Towards scalable Monte Carlo algorithms for some models involving latent variables Christophe Andrieu joint with Arnaud Doucet (Oxford) and Sinan Yıldırım (Bristol Istanbul) April 22, 2016 Overview I Assume we are interested in sampling from a probability distribution with density π(θ), defined on some space (Θ, E) Overview I Assume we are interested in sampling from a probability distribution with density π(θ), defined on some space (Θ, E) I Most general purpose algorithms to do so require one to be able to evaluate π(θ) for θ ∈ Θ. This is not always possible. Overview I Assume we are interested in sampling from a probability distribution with density π(θ), defined on some space (Θ, E) I Most general purpose algorithms to do so require one to be able to evaluate π(θ) for θ ∈ Θ. This is not always possible. I Some progress has been made in recent years on this problem, but the solutions do not always scale with the complexity of some problems. Overview I Assume we are interested in sampling from a probability distribution with density π(θ), defined on some space (Θ, E) I Most general purpose algorithms to do so require one to be able to evaluate π(θ) for θ ∈ Θ. This is not always possible. I Some progress has been made in recent years on this problem, but the solutions do not always scale with the complexity of some problems. I The presentation is about how some of these algorithms can be made to scale. The Metropolis-Hastings algorithm I Assume we are interested in sampling from a probability distribution with density π(θ), defined on some space (Θ, E) The Metropolis-Hastings algorithm I Assume we are interested in sampling from a probability distribution with density π(θ), defined on some space (Θ, E) I The Metropolis-Hastings (MH) algorithm generates a Markov chain {θn , n ≥ 0}, which leaves π invariant. The Metropolis-Hastings algorithm I Assume we are interested in sampling from a probability distribution with density π(θ), defined on some space (Θ, E) I The Metropolis-Hastings (MH) algorithm generates a Markov chain {θn , n ≥ 0}, which leaves π invariant. The MH update proceeds as follows: Given θn = θ, I Propose θ0 ∼ q(θ, ·) The Metropolis-Hastings algorithm I Assume we are interested in sampling from a probability distribution with density π(θ), defined on some space (Θ, E) I The Metropolis-Hastings (MH) algorithm generates a Markov chain {θn , n ≥ 0}, which leaves π invariant. The MH update proceeds as follows: Given θn = θ, I Propose θ0 ∼ q(θ, ·) I Calculate the acceptance ratio r (θ, θ0 ) := π(θ0 )q(θ0 , θ) . π(θ)q(θ, θ0 ) The Metropolis-Hastings algorithm I Assume we are interested in sampling from a probability distribution with density π(θ), defined on some space (Θ, E) I The Metropolis-Hastings (MH) algorithm generates a Markov chain {θn , n ≥ 0}, which leaves π invariant. The MH update proceeds as follows: Given θn = θ, I Propose θ0 ∼ q(θ, ·) I Calculate the acceptance ratio r (θ, θ0 ) := π(θ0 )q(θ0 , θ) . π(θ)q(θ, θ0 ) I Set θn+1 = θ0 with probability α(θ, θ0 ) := min {1, r (θ, θ0 )}, otherwise set θn+1 = θ The Metropolis-Hastings algorithm I Assume we are interested in sampling from a probability distribution with density π(θ), defined on some space (Θ, E) I The Metropolis-Hastings (MH) algorithm generates a Markov chain {θn , n ≥ 0}, which leaves π invariant. The MH update proceeds as follows: Given θn = θ, I Propose θ0 ∼ q(θ, ·) I Calculate the acceptance ratio r (θ, θ0 ) := π(θ0 )q(θ0 , θ) . π(θ)q(θ, θ0 ) I Set θn+1 = θ0 with probability α(θ, θ0 ) := min {1, r (θ, θ0 )}, otherwise set θn+1 = θ Intractable acceptance ratio I Being able to implement the MH update requires one to evaluate r (θ, θ0 ) = π(θ0 )q(θ0 , θ) . π(θ)q(θ, θ0 ) Intractable acceptance ratio I Being able to implement the MH update requires one to evaluate r (θ, θ0 ) = π(θ0 )q(θ0 , θ) . π(θ)q(θ, θ0 ) I In some situations r (θ, θ 0 ) is intractable (impossible or expensive to compute). Intractable acceptance ratio I Being able to implement the MH update requires one to evaluate r (θ, θ0 ) = π(θ0 )q(θ0 , θ) . π(θ)q(θ, θ0 ) I In some situations r (θ, θ 0 ) is intractable (impossible or expensive to compute). I Example: there is a latent variable z such that π(θ) = is intractable. ´ π(θ, z)dz which Intractable acceptance ratio I Being able to implement the MH update requires one to evaluate r (θ, θ0 ) = π(θ0 )q(θ0 , θ) . π(θ)q(θ, θ0 ) I In some situations r (θ, θ 0 ) is intractable (impossible or expensive to compute). I Example: there is a latent variable z such that π(θ) = ´ π(θ, z)dz which is intractable. I Intractability of r (θ, θ 0 ) is the motivation for exact-approximations of MCMCs: Intractable acceptance ratio I Being able to implement the MH update requires one to evaluate r (θ, θ0 ) = π(θ0 )q(θ0 , θ) . π(θ)q(θ, θ0 ) I In some situations r (θ, θ 0 ) is intractable (impossible or expensive to compute). I Example: there is a latent variable z such that π(θ) = ´ π(θ, z)dz which is intractable. I Intractability of r (θ, θ 0 ) is the motivation for exact-approximations of MCMCs: I exact: the transition kernel leaves π(θ) invariant. Intractable acceptance ratio I Being able to implement the MH update requires one to evaluate r (θ, θ0 ) = π(θ0 )q(θ0 , θ) . π(θ)q(θ, θ0 ) I In some situations r (θ, θ 0 ) is intractable (impossible or expensive to compute). I Example: there is a latent variable z such that π(θ) = ´ π(θ, z)dz which is intractable. I Intractability of r (θ, θ 0 ) is the motivation for exact-approximations of MCMCs: I I exact: the transition kernel leaves π(θ) invariant. approximate: they use an approximation of r (θ, θ0 ). Intractable acceptance ratio I Being able to implement the MH update requires one to evaluate r (θ, θ0 ) = π(θ0 )q(θ0 , θ) . π(θ)q(θ, θ0 ) I In some situations r (θ, θ 0 ) is intractable (impossible or expensive to compute). I Example: there is a latent variable z such that π(θ) = ´ π(θ, z)dz which is intractable. I Intractability of r (θ, θ 0 ) is the motivation for exact-approximations of MCMCs: I I exact: the transition kernel leaves π(θ) invariant. approximate: they use an approximation of r (θ, θ0 ). I But they may not scale. Latent variables and pseudo-marginals I Assume interest is in a posterior distribution ˆ π(θ) = p(θ|y ) ∝ η(θ)p(y |θ) = η(θ) p(y , x|θ)dx where the integral cannot be computed analytically. Latent variables and pseudo-marginals I Assume interest is in a posterior distribution ˆ π(θ) = p(θ|y ) ∝ η(θ)p(y |θ) = η(θ) p(y , x|θ)dx where the integral cannot be computed analytically. I Then with x (i) iid ∼ Qθ and p(y , x|θ)/Qθ (x) well defined, consider an IS approximation of the likelihood p̂(y |θ) = N 1 X p(y , x (i) |θ) N i=1 Qθ (x (i) ) This is a noisy measurement of the intractable “likelihood” p(y |θ). Latent variables and pseudo-marginals I Assume interest is in a posterior distribution ˆ π(θ) = p(θ|y ) ∝ η(θ)p(y |θ) = η(θ) p(y , x|θ)dx where the integral cannot be computed analytically. I Then with x (i) iid ∼ Qθ and p(y , x|θ)/Qθ (x) well defined, consider an IS approximation of the likelihood p̂(y |θ) = N 1 X p(y , x (i) |θ) N i=1 Qθ (x (i) ) This is a noisy measurement of the intractable “likelihood” p(y |θ). I One could define the “pseudo-marginal” π̂(θ) ∝ η(θ)p̂(y |θ) Pseudo-marginal approach I Idea: replace π(θ) with non-negative “unbiased” estimators π̂(θ). i.e. such that for some C > 0 E[π̂(θ)] = C × π(θ), θ ∈ Θ. Pseudo-marginal approach I Idea: replace π(θ) with non-negative “unbiased” estimators π̂(θ). i.e. such that for some C > 0 E[π̂(θ)] = C × π(θ), Pseudo-marginal MH [Andrieu and Roberts, 2009] Given θn = θ and π̂(θ), θ ∈ Θ. Pseudo-marginal approach I Idea: replace π(θ) with non-negative “unbiased” estimators π̂(θ). i.e. such that for some C > 0 E[π̂(θ)] = C × π(θ), Pseudo-marginal MH [Andrieu and Roberts, 2009] Given θn = θ and π̂(θ), I Propose θ 0 ∼ q(θ, ·), calculate π̂(θ 0 ) θ ∈ Θ. Pseudo-marginal approach I Idea: replace π(θ) with non-negative “unbiased” estimators π̂(θ). i.e. such that for some C > 0 E[π̂(θ)] = C × π(θ), θ ∈ Θ. Pseudo-marginal MH [Andrieu and Roberts, 2009] Given θn = θ and π̂(θ), I Propose θ 0 ∼ q(θ, ·), calculate π̂(θ 0 ) I Calculate the acceptance ratio rˆ(θ, θ0 ) := π̂(θ0 ) q(θ0 , θ) . π̂(θ) q(θ, θ0 ) Pseudo-marginal approach I Idea: replace π(θ) with non-negative “unbiased” estimators π̂(θ). i.e. such that for some C > 0 E[π̂(θ)] = C × π(θ), θ ∈ Θ. Pseudo-marginal MH [Andrieu and Roberts, 2009] Given θn = θ and π̂(θ), I Propose θ 0 ∼ q(θ, ·), calculate π̂(θ 0 ) I Calculate the acceptance ratio rˆ(θ, θ0 ) := π̂(θ0 ) q(θ0 , θ) . π̂(θ) q(θ, θ0 ) I Set θn+1 = θ 0 with probability min {1, rˆ(θ, θ 0 )}, otherwise set θn+1 = θ. Pseudo-marginal approach I Idea: replace π(θ) with non-negative “unbiased” estimators π̂(θ). i.e. such that for some C > 0 E[π̂(θ)] = C × π(θ), θ ∈ Θ. Pseudo-marginal MH [Andrieu and Roberts, 2009] Given θn = θ and π̂(θ), I Propose θ 0 ∼ q(θ, ·), calculate π̂(θ 0 ) I Calculate the acceptance ratio rˆ(θ, θ0 ) := π̂(θ0 ) q(θ0 , θ) . π̂(θ) q(θ, θ0 ) I Set θn+1 = θ 0 with probability min {1, rˆ(θ, θ 0 )}, otherwise set θn+1 = θ. I Pseudo marginal algorithms are exact approximations of the marginal MH. Toy latent variables example I We consider here a simple example where the target distribution is π(θ, x) = N θ x 0 1 ; , 0 −0.9 −0.9 1 Toy latent variables example I We consider here a simple example where the target distribution is π(θ, x) = N θ x I Marginal is π(θ) = N (θ; 0, 1) 0 1 ; , 0 −0.9 −0.9 1 Toy latent variables example I We consider here a simple example where the target distribution is π(θ, x) = N θ x 0 1 ; , 0 −0.9 I Marginal is π(θ) = N (θ; 0, 1) I Sample with random walk Metropolis algorithm −0.9 1 Toy latent variables example I We consider here a simple example where the target distribution is π(θ, x) = N θ x 0 1 ; , 0 −0.9 −0.9 1 I Marginal is π(θ) = N (θ; 0, 1) I Sample with random walk Metropolis algorithm I Q with q(θ, θ0 ) = N θ0 ; θ, 2.42 and Qθ (X ) = N i=1 N (xi ; 0, 1) for IS. Toy latent variables example I We consider here a simple example where the target distribution is π(θ, x) = N θ x 0 1 ; , 0 −0.9 −0.9 1 I Marginal is π(θ) = N (θ; 0, 1) I Sample with random walk Metropolis algorithm I I Q with q(θ, θ0 ) = N θ0 ; θ,2.42 and Qθ (X ) = N i=1 N (xi ; 0, 1) for IS. q(θ, θ0 ) = N θ0 ; θ, 2.42 is known to be optimal in terms of asymptotic variance. Standard AV Beaumont"s algorithm with N=1 0.5 0.4 0.3 0.2 0.1 0 −4 −3 −2 −1 0 1 2 3 4 3 2 1 0 −1 −2 −3 0 100 200 300 400 500 600 700 800 900 1000 N =5 Beaumont"s algorithm with N=5 0.5 0.4 0.3 0.2 0.1 0 −4 −3 −2 −1 0 1 2 3 4 3 2 1 0 −1 −2 −3 −4 0 100 200 300 400 500 600 700 800 900 1000 N = 10 Beaumont"s algorithm with N=10 0.5 0.4 0.3 0.2 0.1 0 −4 −3 −2 −1 0 1 2 3 4 4 3 2 1 0 −1 −2 −3 0 100 200 300 400 500 600 700 800 900 1000 N = 20 Beaumont"s algorithm with N=20 0.5 0.4 0.3 0.2 0.1 0 −4 −3 −2 −1 0 1 2 3 4 4 2 0 −2 −4 −6 0 100 200 300 400 500 600 700 800 900 1000 Intuition I The acceptance probability of the algorithm is π̂(θ0 ) q(θ0 , θ) min 1, π̂(θ) q(θ, θ0 ) Intuition I The acceptance probability of the algorithm is π̂(θ0 ) q(θ0 , θ) min 1, π̂(θ) q(θ, θ0 ) I The probability of escaping (θ, π̂(θ)) can be made arbitrarily small by increasing π̂(θ)... Intuition I The acceptance probability of the algorithm is π̂(θ0 ) q(θ0 , θ) min 1, π̂(θ) q(θ, θ0 ) I The probability of escaping (θ, π̂(θ)) can be made arbitrarily small by increasing π̂(θ)... I The Markov chain becomes “sticky”. Asymptotic variance and expected acceptance probability I With Π a Markov transition kernel with invariant distribution π, letting θ1 ∼ π and θn ∼ Π(θn−1 , ·), τ := lim T E T →∞ 2 T 1 X f (θk ) − π(f ) ∈ [0, ∞], T k=1 In other words τ is such that var T 1 X f (θk ) T k=1 ! ≈ τ T Asymptotic variance and expected acceptance probability I With Π a Markov transition kernel with invariant distribution π, letting θ1 ∼ π and θn ∼ Π(θn−1 , ·), τ := lim T E T →∞ 2 T 1 X f (θk ) − π(f ) ∈ [0, ∞], T k=1 In other words τ is such that var T 1 X f (θk ) T k=1 ! ≈ τ T I The expected acceptance probability of a MH algorithm with invariant distribution π is ˆ α(θ, θ0 )π(dθ)q(θ, dθ0 ) Estimated autocorrelation time Expected acceptance probability Performance as a function of N 0.5 0.4 0.3 0.2 0.1 0 20 40 60 80 100 Value of N 120 140 160 180 200 0 20 40 60 80 100 Value of N 120 140 160 180 200 18 16 14 12 10 8 6 4 Large datasets? I In order to fix ideas we are going to consider a scenario where the target distribution is a posterior distribution, Large datasets? I In order to fix ideas we are going to consider a scenario where the target distribution is a posterior distribution, I That is it is a posterior distribution obtained from Bayes’ rule, given a prior distribution η(θ) is the prior distribution π(θ) = p(θ | y1:T ) ∝ η(θ)p(y1:T | θ) where y1:T are observations and p(y1:T | θ) the assumed probability density of the observations for the value θ of the model’s parameter. Large datasets? I In order to fix ideas we are going to consider a scenario where the target distribution is a posterior distribution, I That is it is a posterior distribution obtained from Bayes’ rule, given a prior distribution η(θ) is the prior distribution π(θ) = p(θ | y1:T ) ∝ η(θ)p(y1:T | θ) where y1:T are observations and p(y1:T | θ) the assumed probability density of the observations for the value θ of the model’s parameter. I We are going to assume that ˆ pθ (y1:T ) := p(y1:T | θ) = pθ (x1:T , y1:T )d(x1:T ) XT Large datasets? I In order to fix ideas we are going to consider a scenario where the target distribution is a posterior distribution, I That is it is a posterior distribution obtained from Bayes’ rule, given a prior distribution η(θ) is the prior distribution π(θ) = p(θ | y1:T ) ∝ η(θ)p(y1:T | θ) where y1:T are observations and p(y1:T | θ) the assumed probability density of the observations for the value θ of the model’s parameter. I We are going to assume that ˆ pθ (y1:T ) := p(y1:T | θ) = pθ (x1:T , y1:T )d(x1:T ) XT I Where pθ (y1:T ) is hard or impossible to evaluate, while pθ (x1:T , y1:T ) is relatively easy to evaluate for any x1:T ∈ XT y1:T ∈ YT . Large datasets? I In order to fix ideas we are going to consider a scenario where the target distribution is a posterior distribution, I That is it is a posterior distribution obtained from Bayes’ rule, given a prior distribution η(θ) is the prior distribution π(θ) = p(θ | y1:T ) ∝ η(θ)p(y1:T | θ) where y1:T are observations and p(y1:T | θ) the assumed probability density of the observations for the value θ of the model’s parameter. I We are going to assume that ˆ pθ (y1:T ) := p(y1:T | θ) = pθ (x1:T , y1:T )d(x1:T ) XT I Where pθ (y1:T ) is hard or impossible to evaluate, while pθ (x1:T , y1:T ) is relatively easy to evaluate for any x1:T ∈ XT y1:T ∈ YT . I Some progress has been made in recent years for large classes of models, but it is unclear how these algorithms behave for large T ? Motivating example: state-space models I Let {Xn , n ≥ 1} be a Markov chain defined by X1 ∼ µθ (·) and Xn | (Xn−1 = xn−1 ) ∼ fθ (· | xn−1 ) . Motivating example: state-space models I Let {Xn , n ≥ 1} be a Markov chain defined by X1 ∼ µθ (·) and Xn | (Xn−1 = xn−1 ) ∼ fθ (· | xn−1 ) . I We only have access to a process {Yn , n ≥ 1} such that, conditional upon {Xn , n ≥ 1}, the observations are statistically independent and Yn | (Xn = xn ) ∼ gθ (· | xn ) . Motivating example: state-space models I Let {Xn , n ≥ 1} be a Markov chain defined by X1 ∼ µθ (·) and Xn | (Xn−1 = xn−1 ) ∼ fθ (· | xn−1 ) . I We only have access to a process {Yn , n ≥ 1} such that, conditional upon {Xn , n ≥ 1}, the observations are statistically independent and Yn | (Xn = xn ) ∼ gθ (· | xn ) . I θ is an unknown parameter of prior density η (θ) . In general ˆ pθ (y1:T ) = is intractable. XT µθ (x1 )gθ y1 | x1 T Y i=2 gθ yi | xi fθ xi | xi−1 d(x1:T ) Motivating example: state-space models I Then the MH acceptance ratio r (θ, θ0 ) = η(θ0 ) pθ0 (y1:T ) q(θ0 , θ) η(θ) pθ (y1:T ) q(θ, θ0 ) Motivating example: state-space models I Then the MH acceptance ratio r (θ, θ0 ) = η(θ0 ) pθ0 (y1:T ) q(θ0 , θ) η(θ) pθ (y1:T ) q(θ, θ0 ) I The intractable quantity is the “likelihood” ratio LT (θ, θ0 ) := pθ0 (y1:T ) . pθ (y1:T ) Motivating example: state-space models I Then the MH acceptance ratio r (θ, θ0 ) = η(θ0 ) pθ0 (y1:T ) q(θ0 , θ) η(θ) pθ (y1:T ) q(θ, θ0 ) I The intractable quantity is the “likelihood” ratio LT (θ, θ0 ) := pθ0 (y1:T ) . pθ (y1:T ) I There are efficient methods to estimate pθ (y1:T ) unbiasedly, based on particle filters (or sequential Monte Carlo methods). SMC methods I Particle filters, or SMC, were initially developed to approximate {p(xt |y1:t ), t = 1, . . .} recursively (or p(x1:T |y1:T )). SMC methods I Particle filters, or SMC, were initially developed to approximate {p(xt |y1:t ), t = 1, . . .} recursively (or p(x1:T |y1:T )). I The main idea is to propagate through time a cloud of (possibly weighted) samples {xti , i = 1, . . . , N} such that their empirical distribution is a good approximation of p(xt |y1:t ). SMC methods I Particle filters, or SMC, were initially developed to approximate {p(xt |y1:t ), t = 1, . . .} recursively (or p(x1:T |y1:T )). I The main idea is to propagate through time a cloud of (possibly weighted) samples {xti , i = 1, . . . , N} such that their empirical distribution is a good approximation of p(xt |y1:t ). I An unbiased and non-negative estimator pθ (y1:T ) can easily be obtained as a by-product. Sequential Monte Carlo Description for i = 1, . . . , N do (i) Sample x1 ∼ mθ (·) ; (i) (i) Compute w1 ∝ µθ x1 (i) (i) gθ x1 , y1 /mθ x1 end for t = 2, . . . , T do for i = 1, . . . , N do (i) (at−1 ) (i) (1) (N) (i) Sample at−1 ∼ P wt−1 , . . . , wt−1 and zt ∼ Mθ xt−1 , · ,; (i) (i) (at−1 ) (i) (at−1 ) (i) (i) Compute wti = fθ xt−1 , xt gθ xt , yt /Mθ xt−1 , xt end end Algorithm 1: SMC N, Mθ , Aθ Sequential Monte Carlo 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 15 20 25 15 20 25 time index 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 time index Picture created by Olivier Cappé. Sequential Monte Carlo 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 15 20 25 15 20 25 time index 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 time index Picture created by Olivier Cappé. Sequential Monte Carlo 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 15 20 25 15 20 25 time index 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 time index Picture created by Olivier Cappé. Sequential Monte Carlo 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 15 20 25 15 20 25 time index 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 time index Picture created by Olivier Cappé. Sequential Monte Carlo 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 15 20 25 15 20 25 time index 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 time index Picture created by Olivier Cappé. Sequential Monte Carlo 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 15 20 25 15 20 25 time index 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 time index Picture created by Olivier Cappé. Sequential Monte Carlo 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 15 20 25 15 20 25 time index 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 time index Picture created by Olivier Cappé. Sequential Monte Carlo 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 15 20 25 15 20 25 time index 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 time index Picture created by Olivier Cappé. Sequential Monte Carlo 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 15 20 25 15 20 25 time index 1.6 1.4 state 1.2 1 0.8 0.6 0.4 5 10 time index Picture created by Olivier Cappé. Pseudo-marginal approach I Idea: replace pθ (y1:T ) with non-negative “unbiased” estimators p̂θ (y1:T ). i.e. such that for some C > 0 E[p̂θ (y1:T )] = Cpθ (y1:T ), θ ∈ Θ. Pseudo-marginal approach I Idea: replace pθ (y1:T ) with non-negative “unbiased” estimators p̂θ (y1:T ). i.e. such that for some C > 0 E[p̂θ (y1:T )] = Cpθ (y1:T ), Pseudo-marginal MH [Andrieu and Roberts, 2009] Given θn = θ and p̂θ (y1:T ), θ ∈ Θ. Pseudo-marginal approach I Idea: replace pθ (y1:T ) with non-negative “unbiased” estimators p̂θ (y1:T ). i.e. such that for some C > 0 E[p̂θ (y1:T )] = Cpθ (y1:T ), Pseudo-marginal MH [Andrieu and Roberts, 2009] Given θn = θ and p̂θ (y1:T ), I Propose θ 0 ∼ q(θ, ·), calculate p̂θ0 (y1:T ) θ ∈ Θ. Pseudo-marginal approach I Idea: replace pθ (y1:T ) with non-negative “unbiased” estimators p̂θ (y1:T ). i.e. such that for some C > 0 E[p̂θ (y1:T )] = Cpθ (y1:T ), θ ∈ Θ. Pseudo-marginal MH [Andrieu and Roberts, 2009] Given θn = θ and p̂θ (y1:T ), I Propose θ 0 ∼ q(θ, ·), calculate p̂θ0 (y1:T ) I Calculate the acceptance ratio rˆ(θ, θ0 ) := η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ) . η(θ) p̂θ (y1:T ) q(θ, θ0 ) Pseudo-marginal approach I Idea: replace pθ (y1:T ) with non-negative “unbiased” estimators p̂θ (y1:T ). i.e. such that for some C > 0 E[p̂θ (y1:T )] = Cpθ (y1:T ), θ ∈ Θ. Pseudo-marginal MH [Andrieu and Roberts, 2009] Given θn = θ and p̂θ (y1:T ), I Propose θ 0 ∼ q(θ, ·), calculate p̂θ0 (y1:T ) I Calculate the acceptance ratio rˆ(θ, θ0 ) := η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ) . η(θ) p̂θ (y1:T ) q(θ, θ0 ) I Set θn+1 = θ 0 with probability min {1, rˆ(θ, θ 0 )}, otherwise set θn+1 = θ. Pseudo-marginal approach I Idea: replace pθ (y1:T ) with non-negative “unbiased” estimators p̂θ (y1:T ). i.e. such that for some C > 0 E[p̂θ (y1:T )] = Cpθ (y1:T ), θ ∈ Θ. Pseudo-marginal MH [Andrieu and Roberts, 2009] Given θn = θ and p̂θ (y1:T ), I Propose θ 0 ∼ q(θ, ·), calculate p̂θ0 (y1:T ) I Calculate the acceptance ratio rˆ(θ, θ0 ) := η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ) . η(θ) p̂θ (y1:T ) q(θ, θ0 ) I Set θn+1 = θ 0 with probability min {1, rˆ(θ, θ 0 )}, otherwise set θn+1 = θ. I Pseudo marginal algorithms are exact approximations of the marginal MH. Numerical experiments I Stochastic volatility (toyish) model 2 Xt = Xt−1 /2 + 25Xt−1 /(1 + Xt−1 ) + 8 cos(1.2t) + Vt , Yt = Xt2 /20 + Wt , iid t ≥ 1, iid 2 ). where X1 ∼ N (0, 10), Vt ∼ N (0, σv2 ), Wt ∼ N (0, σw t ≥ 2, Numerical experiments I Stochastic volatility (toyish) model 2 Xt = Xt−1 /2 + 25Xt−1 /(1 + Xt−1 ) + 8 cos(1.2t) + Vt , Yt = Xt2 /20 + Wt , iid t ≥ 2, t ≥ 1, iid 2 ). where X1 ∼ N (0, 10), Vt ∼ N (0, σv2 ), Wt ∼ N (0, σw 2 ) and was ascribed a prior I The static parameter of the model is then θ = (σv2 , σw distribution. Performance measure I For a generic Markov transition probability Π of invariant distribution π, Performance measure I For a generic Markov transition probability Π of invariant distribution π, I For any f for which the limit exist, we define the asymptotic variance P τ := lim Mvarπ M −1 M i=1 f (θi ) . M→∞ Results AIS MCMC - PGBS MwG - PGBS PMMH σv2 2 σw σv2 2 σw σv2 2 σw 19.433 T = 500 16.583 21.424 18.880 32.686 19.742 T = 1000 15.294 19.787 20.895 24.633 43.657 54.389 T = 2000 17.431 21.680 22.798 26.487 279.009 554.928 T = 3000 18.336 21.689 20.056 32.555 1202.928 1856.514 T = 4000 17.239 19.782 24.127 32.678 2602.677 1916.475 Table: Estimated IAC times for σv2 and σw2 calculated for the algorithms being compared. All algorithms use N = 500 particles and 1 intermediate step. What’s wrong with pseudo-marginal algorithms? I In pseudo-marginal algorithms, one estimates pθ (y1:T ) and pθ0 (y1:T ) independently rˆ(θ, θ0 ) := η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ) . η(θ) p̂θ (y1:T ) q(θ, θ0 ) What’s wrong with pseudo-marginal algorithms? I In pseudo-marginal algorithms, one estimates pθ (y1:T ) and pθ0 (y1:T ) independently rˆ(θ, θ0 ) := η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ) . η(θ) p̂θ (y1:T ) q(θ, θ0 ) I Variability of the acceptance ratio has a big impact on the performance of the algorithm [CA & M. Vihola 2014,2015]. What’s wrong with pseudo-marginal algorithms? I In pseudo-marginal algorithms, one estimates pθ (y1:T ) and pθ0 (y1:T ) independently rˆ(θ, θ0 ) := η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ) . η(θ) p̂θ (y1:T ) q(θ, θ0 ) I Variability of the acceptance ratio has a big impact on the performance of the algorithm [CA & M. Vihola 2014,2015]. I What is happening here? I we are in a situation where for a fixed number of particles the variability of p̂θ (y1:T ) increases with T , What’s wrong with pseudo-marginal algorithms? I In pseudo-marginal algorithms, one estimates pθ (y1:T ) and pθ0 (y1:T ) independently rˆ(θ, θ0 ) := η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ) . η(θ) p̂θ (y1:T ) q(θ, θ0 ) I Variability of the acceptance ratio has a big impact on the performance of the algorithm [CA & M. Vihola 2014,2015]. I What is happening here? I we are in a situation where for a fixed number of particles the variability of p̂θ (y1:T ) increases with T , I we notice that due to the independence as θ0 → θ the variability in p̂θ0 (y1:T )/p̂θ (y1:T ) does not vanish. What’s wrong with pseudo-marginal algorithms? I In pseudo-marginal algorithms, one estimates pθ (y1:T ) and pθ0 (y1:T ) independently rˆ(θ, θ0 ) := η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ) . η(θ) p̂θ (y1:T ) q(θ, θ0 ) I Variability of the acceptance ratio has a big impact on the performance of the algorithm [CA & M. Vihola 2014,2015]. I What is happening here? I we are in a situation where for a fixed number of particles the variability of p̂θ (y1:T ) increases with T , I we notice that due to the independence as θ0 → θ the variability in p̂θ0 (y1:T )/p̂θ (y1:T ) does not vanish. I In what follows we motivate the need to introduce dependence between estimates, in order to ensure that as θ0 → θ the ratio converges to one. What’s wrong with pseudo-marginal algorithms? I In pseudo-marginal algorithms, one estimates pθ (y1:T ) and pθ0 (y1:T ) independently rˆ(θ, θ0 ) := η(θ0 ) p̂θ0 (y1:T ) q(θ0 , θ) . η(θ) p̂θ (y1:T ) q(θ, θ0 ) I Variability of the acceptance ratio has a big impact on the performance of the algorithm [CA & M. Vihola 2014,2015]. I What is happening here? I we are in a situation where for a fixed number of particles the variability of p̂θ (y1:T ) increases with T , I we notice that due to the independence as θ0 → θ the variability in p̂θ0 (y1:T )/p̂θ (y1:T ) does not vanish. I In what follows we motivate the need to introduce dependence between estimates, in order to ensure that as θ0 → θ the ratio converges to one. I We then explain how to estimate LT (θ, θ 0 ) = pθ0 (y1:T )/pθ (y1:T ) directly and design a correct algorithm. What’s wrong with pseudo-marginal algorithms? I In order to simplify discussion, consider the scenario where pθ (y1:T ) := T Y t=1 pθ (yt ) = T ˆ Y t=1 X pθ (xt , yt )dxt What’s wrong with pseudo-marginal algorithms? I In order to simplify discussion, consider the scenario where pθ (y1:T ) := T Y pθ (yt ) = t=1 T ˆ Y t=1 pθ (xt , yt )dxt X I Let P0 be the actual distribution of the observations Y1 , Y2 , . . . and consider the tractable scenario. What’s wrong with pseudo-marginal algorithms? I In order to simplify discussion, consider the scenario where pθ (y1:T ) := T Y pθ (yt ) = t=1 T ˆ Y t=1 pθ (xt , yt )dxt X I Let P0 be the actual distribution of the observations Y1 , Y2 , . . . and consider the tractable scenario. I For θ, θ 0 ∈ Θ given, if P0 −a.s. as T → ∞ ´ log pθ0 (y ) P (y )dy P0 (y ) 0 log LT (θ, θ0 ) = T X t=1 log − ´ log pθ (y ) P (y )dy P0 (y ) 0 pθ0 (Yt ) → ±∞ pθ (Yt ) 6= 0 then What’s wrong with pseudo-marginal algorithms? I In order to simplify discussion, consider the scenario where pθ (y1:T ) := T Y pθ (yt ) = t=1 T ˆ Y t=1 pθ (xt , yt )dxt X I Let P0 be the actual distribution of the observations Y1 , Y2 , . . . and consider the tractable scenario. I For θ, θ 0 ∈ Θ given, if P0 −a.s. as T → ∞ ´ log pθ0 (y ) P (y )dy P0 (y ) 0 log LT (θ, θ0 ) = T X t=1 log − ´ log pθ (y ) P (y )dy P0 (y ) 0 6= 0 then pθ0 (Yt ) → ±∞ pθ (Yt ) I As a result the acceptance ratio of the exact algorithm we would want to implement vanishes or diverges to infinity. rT (θ, θ0 ) := η(θ0 ) q(θ0 , θ) LT (θ, θ0 ) η(θ) q(θ, θ0 ) What’s wrong with pseudo-marginal algorithms? √ I However if θ 0 = θ + / T for some (which corresponds to a random walk Metropolis) What’s wrong with pseudo-marginal algorithms? √ I However if θ 0 = θ + / T for some (which corresponds to a random walk Metropolis) I Under some regularity assumptions the likelihood ratio is (with P0 the law of the observations) T √ 1 X˙ `θ Yi − 2 V (θ) + oP0 (1) log LT (θ, θ + / T ) = √ 2 T i=1 What’s wrong with pseudo-marginal algorithms? √ I However if θ 0 = θ + / T for some (which corresponds to a random walk Metropolis) I Under some regularity assumptions the likelihood ratio is (with P0 the law of the observations) T √ 1 X˙ `θ Yi − 2 V (θ) + oP0 (1) log LT (θ, θ + / T ) = √ 2 T i=1 I We expect a central limit theorem to hold, and the acceptance ratio to have a “non-degenerate” limit. What’s wrong with pseudo-marginal algorithms? √ I However if θ 0 = θ + / T for some (which corresponds to a random walk Metropolis) I Under some regularity assumptions the likelihood ratio is (with P0 the law of the observations) T √ 1 X˙ `θ Yi − 2 V (θ) + oP0 (1) log LT (θ, θ + / T ) = √ 2 T i=1 I We expect a central limit theorem to hold, and the acceptance ratio to have a “non-degenerate” limit. I This type of strategy exploits the smoothness, for any θ ∈ Θ of θ0 7→ pθ0 (y1:T ) pθ (y1:T ) and the fact that pθ0 (y1:T )/pθ (y1:T ) →θ0 →θ 1 in order to control the asymptotic behaviour of the ratio. What’s wrong with pseudo-marginal algorithms? √ I However if θ 0 = θ + / T for some (which corresponds to a random walk Metropolis) I Under some regularity assumptions the likelihood ratio is (with P0 the law of the observations) T √ 1 X˙ `θ Yi − 2 V (θ) + oP0 (1) log LT (θ, θ + / T ) = √ 2 T i=1 I We expect a central limit theorem to hold, and the acceptance ratio to have a “non-degenerate” limit. I This type of strategy exploits the smoothness, for any θ ∈ Θ of θ0 7→ pθ0 (y1:T ) pθ (y1:T ) and the fact that pθ0 (y1:T )/pθ (y1:T ) →θ0 →θ 1 in order to control the asymptotic behaviour of the ratio. I This does not work if we use independent estimators, p̂θ(1) (y1:T ) and (2) p̂θ0 (y1:T ). Estimate of the likelihood ratio I A classical way of estimating the likelihood ratio consists of exploiting the identity pθ0 (y ) = pθ (y ) ˆ pθ0 (x, y ) pθ (x | y )dx pθ (x, y ) ˆ pθ0 (x | y )pθ0 (y ) pθ (x | y )dx = pθ (x | y )pθ (y ) Estimate of the likelihood ratio I A classical way of estimating the likelihood ratio consists of exploiting the identity pθ0 (y ) = pθ (y ) ˆ pθ0 (x, y ) pθ (x | y )dx pθ (x, y ) ˆ pθ0 (x | y )pθ0 (y ) pθ (x | y )dx = pθ (x | y )pθ (y ) I Assuming we can sample from X ∼ pθ (· | y ) then p 0 (y ) pθ0 (X , y ) estimates θ pθ (X , y ) pθ (y ) and further with sufficient smoothness the estimator goes to one as θ0 → θ. Estimate of the likelihood ratio I A classical way of estimating the likelihood ratio consists of exploiting the identity pθ0 (y ) = pθ (y ) ˆ pθ0 (x, y ) pθ (x | y )dx pθ (x, y ) ˆ pθ0 (x | y )pθ0 (y ) pθ (x | y )dx = pθ (x | y )pθ (y ) I Assuming we can sample from X ∼ pθ (· | y ) then p 0 (y ) pθ0 (X , y ) estimates θ pθ (X , y ) pθ (y ) and further with sufficient smoothness the estimator goes to one as θ0 → θ. I Sampling from pθ (x | y ) is an issue, but we could design an algorithm targetting the joint distribution π(θ, x | y ) ∝ η(θ)pθ (x, y ). Estimate of the likelihood ratio I Target π(θ, x | y ) ∝ η(θ)pθ (x, y ). Estimate of the likelihood ratio I Target π(θ, x | y ) ∝ η(θ)pθ (x, y ). I One can show reversibility of the algorithm 1. Given (θ, x), sample θ0 ∼ q(θ, ·) 2. With probability η(θ0 )pθ0 (x, y )q(θ0 , θ) min 1, η(θ)pθ (x, y )q(θ, θ0 ) jump to (θ0 , x) Estimate of the likelihood ratio I Target π(θ, x | y ) ∝ η(θ)pθ (x, y ). I One can show reversibility of the algorithm 1. Given (θ, x), sample θ0 ∼ q(θ, ·) 2. With probability η(θ0 )pθ0 (x, y )q(θ0 , θ) min 1, η(θ)pθ (x, y )q(θ, θ0 ) jump to (θ0 , x) I x is not updated and we must combine this with an update of x (Metropolis-within-Gibbs idea). Estimate of the likelihood ratio I Target π(θ, x | y ) ∝ η(θ)pθ (x, y ). I One can show reversibility of the algorithm 1. Given (θ, x), sample θ0 ∼ q(θ, ·) 2. With probability η(θ0 )pθ0 (x, y )q(θ0 , θ) min 1, η(θ)pθ (x, y )q(θ, θ0 ) jump to (θ0 , x) I x is not updated and we must combine this with an update of x (Metropolis-within-Gibbs idea). I But we can exploit the continuity of pθ0 (x, y ), and there is hope the algorithm scales. Estimate of the likelihood ratio I However for θ̃ ∈ Θ, using the identity twice, pθ̃ (y ) pθ0 (y ) = pθ (y ) pθ̃ (y ) ˆ ˆ pθ̃ (x, y ) pθ0 (x 0 , y ) pθ (x | y )pθ̃ (x 0 | y )dxdx 0 pθ (x, y ) pθ̃ (x 0 , y ) This was originally proposed as a variance reduction technique (think of θ̃ = (θ + θ0 )/2) [Crooks, 1998, Jarzynski, 1997, Neal, 2001, Neal, 1996]. Estimate of the likelihood ratio I However for θ̃ ∈ Θ, using the identity twice, pθ̃ (y ) pθ0 (y ) = pθ (y ) pθ̃ (y ) ˆ ˆ pθ̃ (x, y ) pθ0 (x 0 , y ) pθ (x | y )pθ̃ (x 0 | y )dxdx 0 pθ (x, y ) pθ̃ (x 0 , y ) This was originally proposed as a variance reduction technique (think of θ̃ = (θ + θ0 )/2) [Crooks, 1998, Jarzynski, 1997, Neal, 2001, Neal, 1996]. I Assume for a moment that sampling from pθ̃ (x | y ) is possible and let us target π(θ, x | y ) ∝ η(θ)pθ (x, y ). 1. Given (θ, x), sample θ0 ∼ q(θ, ·), let θ̃ = (θ + θ0 )/2 and sample x 0 ∼ pθ̃ (· | y ) 2. With probability pθ̃ (y ) η(θ0 )pθ0 (x 0 , y )q(θ0 , θ)pθ̃ (x | y ) min 1, × η(θ)pθ (x, y )q(θ, θ0 )pθ̃ (x 0 | y ) pθ̃ (y ) jump to (θ0 , x 0 ), otherwise stay at (θ, x). Estimate of the likelihood ratio I However for θ̃ ∈ Θ, using the identity twice, pθ̃ (y ) pθ0 (y ) = pθ (y ) pθ̃ (y ) ˆ ˆ pθ̃ (x, y ) pθ0 (x 0 , y ) pθ (x | y )pθ̃ (x 0 | y )dxdx 0 pθ (x, y ) pθ̃ (x 0 , y ) This was originally proposed as a variance reduction technique (think of θ̃ = (θ + θ0 )/2) [Crooks, 1998, Jarzynski, 1997, Neal, 2001, Neal, 1996]. I Assume for a moment that sampling from pθ̃ (x | y ) is possible and let us target π(θ, x | y ) ∝ η(θ)pθ (x, y ). 1. Given (θ, x), sample θ0 ∼ q(θ, ·), let θ̃ = (θ + θ0 )/2 and sample x 0 ∼ pθ̃ (· | y ) 2. With probability pθ̃ (y ) η(θ0 )pθ0 (x 0 , y )q(θ0 , θ)pθ̃ (x | y ) min 1, × η(θ)pθ (x, y )q(θ, θ0 )pθ̃ (x 0 | y ) pθ̃ (y ) jump to (θ0 , x 0 ), otherwise stay at (θ, x). I The algorithm requires exact sampling from pθ̃ (x 0 | y )... Estimate of the likelihood ratio I Crucially one can show that if Rθ̃ is a Markov transition probability leaving pθ̃ (· | y ) invariant then also ˆ ˆ pθ̃ (y ) pθ0 (y ) pθ̃ (x, y ) pθ0 (x 0 , y ) = pθ (x | y )Rθ̃ (x, x 0 )dxdx 0 pθ (y ) pθ̃ (y ) pθ (x, y ) pθ̃ (x 0 , y ) Estimate of the likelihood ratio I Crucially one can show that if Rθ̃ is a Markov transition probability leaving pθ̃ (· | y ) invariant then also ˆ ˆ pθ̃ (y ) pθ0 (y ) pθ̃ (x, y ) pθ0 (x 0 , y ) = pθ (x | y )Rθ̃ (x, x 0 )dxdx 0 pθ (y ) pθ̃ (y ) pθ (x, y ) pθ̃ (x 0 , y ) I If further we have reversibility pθ̃ (x | y )Rθ̃ (x, x 0 ) = pθ̃ (x 0 | y )Rθ̃ (x 0 , x) then the following is a valid algorithm to sample from π(θ, x | y ) ∝ η(θ)pθ (x, y ) 1. Given (θ, x), sample θ0 ∼ q(θ, ·), let θ̃ = (θ + θ0 )/2 and sample x 0 ∼ Rθ̃ (x, ·) 2. With probability η(θ0 )pθ0 (x 0 , y )q(θ0 , θ)pθ̃ (x, y ) min 1, η(θ)pθ (x, y )q(θ, θ0 )pθ̃ (x 0 , y ) jump to (θ0 , x 0 ), otherwise stay at (θ, x). More general AIS... I It is possible to generalise this and introduce K intermediate intermediate steps I e.g. consider Pθ,θ0 ,K := {πθ,θ0 ,k (x), k = 0, . . . , K + 1} with πθ,θ0 ,k (x) := pθ×k/(K +1)+θ0 ×[1−k/(K +1)] (· | y ) I and the associated [K ] Markov kernels Rθ,θ0 ,K := Rθ,θ0 ,k (·, ·) : Xn × X ⊗n → [0, 1], k = 1, . . . , K . More general AIS... I It is possible to generalise this and introduce K intermediate intermediate steps I e.g. consider Pθ,θ0 ,K := {πθ,θ0 ,k (x), k = 0, . . . , K + 1} with πθ,θ0 ,k (x) := pθ×k/(K +1)+θ0 ×[1−k/(K +1)] (· | y ) I and the associated [K ] Markov kernels Rθ,θ0 ,K := Rθ,θ0 ,k (·, ·) : Xn × X ⊗n → [0, 1], k = 1, . . . , K . I What is interesting is that in this case the estimator is of the form K Y πθ,θ0 ,k+1 xk πθ,θ0 ,k xk k=0 More general AIS... I It is possible to generalise this and introduce K intermediate intermediate steps I e.g. consider Pθ,θ0 ,K := {πθ,θ0 ,k (x), k = 0, . . . , K + 1} with πθ,θ0 ,k (x) := pθ×k/(K +1)+θ0 ×[1−k/(K +1)] (· | y ) I and the associated [K ] Markov kernels Rθ,θ0 ,K := Rθ,θ0 ,k (·, ·) : Xn × X ⊗n → [0, 1], k = 1, . . . , K . I What is interesting is that in this case the estimator is of the form K Y πθ,θ0 ,k+1 xk πθ,θ0 ,k xk k=0 I And with some conditions on Pθ,θ0 ,K : and Rθ,θ0 ,K the estimator is consistent. The conditional SMC I Let πθ (x1:T ) be the conditional distribution pθ (x1:T | y1:T ). The conditional SMC I Let πθ (x1:T ) be the conditional distribution pθ (x1:T | y1:T ). I The conditional Sequential Monte Carlo algorithm is a πθ −invariant Markov kernel (in fact πθ −reversible). The conditional SMC I Let πθ (x1:T ) be the conditional distribution pθ (x1:T | y1:T ). I The conditional Sequential Monte Carlo algorithm is a πθ −invariant Markov kernel (in fact πθ −reversible). I As suggested by its name it is a particle based algorithm (say N particles, call it Rθ,N ). The conditional SMC I Let πθ (x1:T ) be the conditional distribution pθ (x1:T | y1:T ). I The conditional Sequential Monte Carlo algorithm is a πθ −invariant Markov kernel (in fact πθ −reversible). I As suggested by its name it is a particle based algorithm (say N particles, call it Rθ,N ). I It can be shown to have very good mixing properties [Andrieu et al., 2013] and Lindsten et al. 2015. Sequential Monte Carlo Description for i = 1, . . . , N do (i) If i 6= 1 sample x1 ∼ mθ (·) ; (i) (i) (i) (i) Compute w1 ∝ µθ x1 gθ x1 , y1 /mθ x1 end for t = 2, . . . , T do for i = 1, . . . , N do (i) (at−1 ) (i) (1) (N) (i) If i 6= 1 sample at−1 ∼ P wt−1 , . . . , wt−1 and zt ∼ Mθ xt−1 , · ,; (i) (i) (at−1 ) (i) (a ) (i) (i) t−1 Compute wti = fθ xt−1 , xt gθ xt , yt /Mθ xt−1 , xt end end (1) (N) Sample kn ∼ P wn , . . . , wn for t = T − 1, . . . , 1 do (kt+1 ) kt = at−1 ; (kn ) and set xn0 = xn . (kt ) Set xt0 = xt end 0 return x1:T Algorithm 2: cSMC N, x1:T , Mθ , Aθ Numerical experiments I Stochastic volatility (toyish) model 2 Xt = Xt−1 /2 + 25Xt−1 /(1 + Xt−1 ) + 8 cos(1.2t) + Vt , Yt = Xt2 /20 + Wt , iid t ≥ 1, iid 2 ). where X1 ∼ N (0, 10), Vt ∼ N (0, σv2 ), Wt ∼ N (0, σw t ≥ 2, Numerical experiments I Stochastic volatility (toyish) model 2 Xt = Xt−1 /2 + 25Xt−1 /(1 + Xt−1 ) + 8 cos(1.2t) + Vt , Yt = Xt2 /20 + Wt , iid t ≥ 2, t ≥ 1, iid 2 ). where X1 ∼ N (0, 10), Vt ∼ N (0, σv2 ), Wt ∼ N (0, σw 2 ) and was ascribed a prior I The static parameter of the model is then θ = (σv2 , σw distribution. Numerical experiments I Stochastic volatility (toyish) model 2 Xt = Xt−1 /2 + 25Xt−1 /(1 + Xt−1 ) + 8 cos(1.2t) + Vt , Yt = Xt2 /20 + Wt , iid t ≥ 2, t ≥ 1, iid 2 ). where X1 ∼ N (0, 10), Vt ∼ N (0, σv2 ), Wt ∼ N (0, σw 2 ) and was ascribed a prior I The static parameter of the model is then θ = (σv2 , σw distribution. I We implemented exact approximations of the random walk Metropolis with q(θ, θ0 ) = N θ, σ 2 /n . Results AIS MCMC - PGBS MwG - PGBS PMMH σv2 2 σw σv2 2 σw σv2 2 σw 19.433 T = 500 16.583 21.424 18.880 32.686 19.742 T = 1000 15.294 19.787 20.895 24.633 43.657 54.389 T = 2000 17.431 21.680 22.798 26.487 279.009 554.928 T = 3000 18.336 21.689 20.056 32.555 1202.928 1856.514 T = 4000 17.239 19.782 24.127 32.678 2602.677 1916.475 Table: Estimated IAC times for σv2 and σw2 calculated for the algorithms being compared. All algorithms use N = 500 particles and 1 intermediate step. Preliminary results I Considered the situation pθ (y1:T ) := T Y t=1 pθ (yt ) = T ˆ Y t=1 X pθ (xt , yt )dxt Preliminary results I Considered the situation pθ (y1:T ) := T Y pθ (yt ) = t=1 T ˆ Y pθ (xt , yt )dxt t=1 X I We make some assumptions including 1. Θ ⊂ R is compact, 2. For any x, y ∈ X × Y, θ 7→ log pθ (x, y ) is three times differentiable with derivatives uniformly bounded in θ, x, y ... Preliminary results I Considered the situation pθ (y1:T ) := T Y pθ (yt ) = t=1 T ˆ Y pθ (xt , yt )dxt t=1 X I We make some assumptions including 1. Θ ⊂ R is compact, 2. For any x, y ∈ X × Y, θ 7→ log pθ (x, y ) is three times differentiable with derivatives uniformly bounded in θ, x, y ... I The Markov kernel Rθ,N is a product of Markov kernels each targetting pθ (xt | yt ) and N is not necessarily a number of particles. Some asymptotics Theorem Under some assumptions... Then P−a.s., forany ε0 > 0 there exist T0 , N0 ∈ N such that for any T ≥ T0 and any sequence NT ∈ NN such that NT ≥ N0 for T ≥ T0 ω sup Eω T [min{1, r˜T (θ, ; ω, ξ)}] − ĚT [min{1, rT (θ, ; ω) exp(Z )}] ≤ ε0 , T ≥T0 and 2 2 ≤ ε0 sup Eω − Ěω T min{1, r˜T (θ, ; ω, ξ)} T min{1, rT (θ, ; ω) exp(Z )} T ≥T0 where Z | (θ, , ω) ∼ N 2 ς (θ, ) 2 − T , ςT (θ, ) 2 for some ςT2 (θ, ) such that limT →∞ ςT2 (θ, ) exists and is finite! The end Thank you! Andrieu, C., Lee, A., and Vihola, M. (2013). Uniform ergodicity of the iterated conditional smc and geometric ergodicity of particle gibbs samplers. (arXiv:1312.6432). Andrieu, C. and Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. Annals of Statistics, 37(2):569–1078. Crooks, G. (1998). Nonequilibrium measurements of free energy differences for microscopically reversible markovian systems. Journal of Statistical Physics, 90(5-6):1481–1487. Jarzynski, C. (1997). Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach. Phys. Rev. E, 56:5018–5035. Neal, R. (1996). Sampling from multimodal distributions using tempered transitions. Statistics and computing, 6(4):353–366. Neal, R. (2001). Annealed importance sampling. Statistics and Computing, 11(2):125–139.