E¢ cient Implementation of MCMC When Using An Unbiased Likelihood Estimator Michael Pitt University of Oxford Joint work with A. Doucet, G. Deligiannidis & R. Kohn Cambridge, 22/04/14 (Cambridge, 22/04/14) 1 / 26 Organisation of the Talk 1 MCMC with intractable likehood functions. In physics, …rst appeared in Lin, Liu & Sloan (2000). In statistics, Beaumont (2003), Andrieu, Berthelesen, D., Roberts (2006), Andrieu & Roberts (2009). (Cambridge, 22/04/14) 2 / 26 Organisation of the Talk 1 MCMC with intractable likehood functions. In physics, …rst appeared in Lin, Liu & Sloan (2000). In statistics, Beaumont (2003), Andrieu, Berthelesen, D., Roberts (2006), Andrieu & Roberts (2009). 2 Ine¢ ciency of the MCMC scheme with likelihood estimator against known likelihood. (Cambridge, 22/04/14) 2 / 26 Organisation of the Talk 1 MCMC with intractable likehood functions. In physics, …rst appeared in Lin, Liu & Sloan (2000). In statistics, Beaumont (2003), Andrieu, Berthelesen, D., Roberts (2006), Andrieu & Roberts (2009). 2 Ine¢ ciency of the MCMC scheme with likelihood estimator against known likelihood. 3 Guidelines on an optimal precision for the estimator of the log-likelihood. (Cambridge, 22/04/14) 2 / 26 Bayesian Inference Likelihood function p (y ; θ ) where θ 2 Θ (Cambridge, 22/04/14) Rd . 3 / 26 Bayesian Inference Likelihood function p (y ; θ ) where θ 2 Θ Rd . Prior distribution of density p (θ ) . (Cambridge, 22/04/14) 3 / 26 Bayesian Inference Likelihood function p (y ; θ ) where θ 2 Θ Rd . Prior distribution of density p (θ ) . Bayesian inference relies on the posterior π (θ ) = p ( θ j y ) = R (Cambridge, 22/04/14) p (y ; θ ) p ( θ ) . p y ; θ0 p θ0 d θ0 Θ 3 / 26 Bayesian Inference Likelihood function p (y ; θ ) where θ 2 Θ Rd . Prior distribution of density p (θ ) . Bayesian inference relies on the posterior π (θ ) = p ( θ j y ) = R p (y ; θ ) p ( θ ) . p y ; θ0 p θ0 d θ0 Θ For non-trivial models, inference relies typically on MCMC. (Cambridge, 22/04/14) 3 / 26 Metropolis-Hastings algorithm Set ϑ(0 ) and iterate for i = 1, 2, ... (Cambridge, 22/04/14) 4 / 26 Metropolis-Hastings algorithm Set ϑ(0 ) and iterate for i = 1, 2, ... 1 Sample ϑ q (Cambridge, 22/04/14) j ϑ (i 1) . 4 / 26 Metropolis-Hastings algorithm Set ϑ(0 ) and iterate for i = 1, 2, ... 1 Sample ϑ 2 Compute q j ϑ (i α = 1^ (Cambridge, 22/04/14) 1) . q ϑ (i p (y ; ϑ ) p ( ϑ ) p y ; ϑ (i 1) p ϑ (i 1) 1) q ϑ j ϑ (i ϑ 1) . 4 / 26 Metropolis-Hastings algorithm Set ϑ(0 ) and iterate for i = 1, 2, ... 1 Sample ϑ 2 Compute q j ϑ (i α = 1^ 3 1) . q ϑ (i p (y ; ϑ ) p ( ϑ ) p y ; ϑ (i 1) p ϑ (i 1) q ϑ j ϑ (i With probability α, set ϑ(i ) := ϑ and ϑ(i ) := ϑ(i (Cambridge, 22/04/14) 1) 1) ϑ 1) . otherwise. 4 / 26 Intractable Likelihood Function In numerous scenarios, p (y ; θ ) cannot be evaluated pointwise; e.g. p (y ; θ ) = Z p (x, y ; θ ) dx where the integral cannot be evaluated. (Cambridge, 22/04/14) 5 / 26 Intractable Likelihood Function In numerous scenarios, p (y ; θ ) cannot be evaluated pointwise; e.g. p (y ; θ ) = Z p (x, y ; θ ) dx where the integral cannot be evaluated. A standard “solution” consists of using MCMC to sample from p ( θ, x j y ) = p (x, y ; θ ) p (θ ) p (y ) by updating iterately x and θ. (Cambridge, 22/04/14) 5 / 26 Intractable Likelihood Function In numerous scenarios, p (y ; θ ) cannot be evaluated pointwise; e.g. p (y ; θ ) = Z p (x, y ; θ ) dx where the integral cannot be evaluated. A standard “solution” consists of using MCMC to sample from p ( θ, x j y ) = p (x, y ; θ ) p (θ ) p (y ) by updating iterately x and θ. Gibbs sampling strategies can be slow mixing and di¢ cult to put in practice. (Cambridge, 22/04/14) 5 / 26 MCMC with Intractable Likelihood Function Let p b (y ; θ, U ) be an unbiased non-negative estimator of the likelihood where U mθ ( ); i.e. p (y ; θ ) = (Cambridge, 22/04/14) Z U p b (y ; θ, u ) mθ (u ) du. 6 / 26 MCMC with Intractable Likelihood Function Let p b (y ; θ, U ) be an unbiased non-negative estimator of the likelihood where U mθ ( ); i.e. p (y ; θ ) = Z U p b (y ; θ, u ) mθ (u ) du. Introduce a target distribution on Θ π (θ, u ) = π (θ ) U of density p b (y ; θ, u ) p (θ ) p b (y ; θ, u ) mθ (u ) mθ (u ) = p (y ; θ ) p (y ) then unbiasedness yields Z (Cambridge, 22/04/14) U π (θ, u ) du = π (θ ) 6 / 26 MCMC with Intractable Likelihood Function Let p b (y ; θ, U ) be an unbiased non-negative estimator of the likelihood where U mθ ( ); i.e. p (y ; θ ) = Z U p b (y ; θ, u ) mθ (u ) du. Introduce a target distribution on Θ π (θ, u ) = π (θ ) U of density p b (y ; θ, u ) p (θ ) p b (y ; θ, u ) mθ (u ) mθ (u ) = p (y ; θ ) p (y ) then unbiasedness yields Z U π (θ, u ) du = π (θ ) Any MCMC algorithm sampling from π (θ, u ) yields samples from π ( θ ). (Cambridge, 22/04/14) 6 / 26 Pseudo-Marginal Metropolis-Hastings algorithm Set ϑ(0 ) , U (0 ) (Cambridge, 22/04/14) and iterate for i = 1, 2, ... 7 / 26 Pseudo-Marginal Metropolis-Hastings algorithm 1 Set ϑ(0 ) , U (0 ) and iterate for i = 1, 2, ... Sample ϑ j ϑ (i q (Cambridge, 22/04/14) 1) ,U mϑ ( ) to obtain p b (y ; ϑ, U ). 7 / 26 Pseudo-Marginal Metropolis-Hastings algorithm Set ϑ(0 ) , U (0 ) and iterate for i = 1, 2, ... 1 Sample ϑ j ϑ (i 2 Compute q α = 1^ (Cambridge, 22/04/14) 1) ,U p b (y ; ϑ, U ) p b y ; ϑ (i 1) , U (i mϑ ( ) to obtain p b (y ; ϑ, U ). p (ϑ ) 1) p ϑ (i 1) q ϑ (i 1) q ϑ j ϑ (i ϑ 1) 7 / 26 Pseudo-Marginal Metropolis-Hastings algorithm Set ϑ(0 ) , U (0 ) and iterate for i = 1, 2, ... 1 Sample ϑ j ϑ (i 2 Compute q α = 1^ 3 1) ,U p b (y ; ϑ, U ) p b y ; ϑ (i 1) , U (i mϑ ( ) to obtain p b (y ; ϑ, U ). p (ϑ ) 1) p ϑ (i With proba α, set ϑ(i ) , p b y ; ϑ (i ) , U (i ) where you are otherwise. (Cambridge, 22/04/14) 1) q ϑ (i 1) q ϑ j ϑ (i ϑ 1) := (ϑ, p b (y ; ϑ, U )) and stay 7 / 26 Importance Sampling Estimator For latent variable models, one has p (y ; θ ) = Z p (x, y ; θ ) dx = Z p (x, y ; θ ) qθ (x )dx qθ (x ) where qθ (x ) is an importance sampling density. (Cambridge, 22/04/14) 8 / 26 Importance Sampling Estimator For latent variable models, one has p (y ; θ ) = Z p (x, y ; θ ) dx = Z p (x, y ; θ ) qθ (x )dx qθ (x ) where qθ (x ) is an importance sampling density. An unbiased estimator is given by p b(y ; θ, U ) = (Cambridge, 22/04/14) 1 N p Xk, y; θ ∑ qθ (X k ) , k =1 N Xk i.i.d. qθ ( ) 8 / 26 Importance Sampling Estimator For latent variable models, one has p (y ; θ ) = Z p (x, y ; θ ) dx = Z p (x, y ; θ ) qθ (x )dx qθ (x ) where qθ (x ) is an importance sampling density. An unbiased estimator is given by p b(y ; θ, U ) = 1 N p Xk, y; θ ∑ qθ (X k ) , k =1 N Here U := X 1 , ..., X N and mθ (u ) = (Cambridge, 22/04/14) Xk ∏ k =1 qθ N i.i.d. qθ ( ) xk . 8 / 26 Importance Sampling Estimator For latent variable models, one has p (y ; θ ) = Z p (x, y ; θ ) dx = Z p (x, y ; θ ) qθ (x )dx qθ (x ) where qθ (x ) is an importance sampling density. An unbiased estimator is given by p b(y ; θ, U ) = 1 N p Xk, y; θ ∑ qθ (X k ) , k =1 N Here U := X 1 , ..., X N and mθ (u ) = Xk ∏ k =1 qθ N i.i.d. qθ ( ) xk . Whatever being N 1, the pseudo-marginal MH admits π (θ ) as invariant distribution. (Cambridge, 22/04/14) 8 / 26 Sequential Monte Carlo Estimator fXt gt 0 is a X-valued latent Markov process with X0 Xt +1 jXt f ( j Xt ; θ ). (Cambridge, 22/04/14) µ( ; θ ) and 9 / 26 Sequential Monte Carlo Estimator fXt gt 0 is a X-valued latent Markov process with X0 µ( ; θ ) and Xt +1 jXt f ( j Xt ; θ ). Observations fYt gt 1 are conditionally independent given fXt gt 0 with Yt j fXk gk 0 g ( jXt , θ ). (Cambridge, 22/04/14) 9 / 26 Sequential Monte Carlo Estimator fXt gt 0 is a X-valued latent Markov process with X0 µ( ; θ ) and Xt +1 jXt f ( j Xt ; θ ). Observations fYt gt 1 are conditionally independent given fXt gt 0 with Yt j fXk gk 0 g ( jXt , θ ). Likelihood of y1:T = (y1 , ..., yT ) is p (y1:T ; θ ) = (Cambridge, 22/04/14) Z XT +1 p (x0:T , y1:T ; θ )dx0:T . 9 / 26 Sequential Monte Carlo Estimator fXt gt 0 is a X-valued latent Markov process with X0 µ( ; θ ) and Xt +1 jXt f ( j Xt ; θ ). Observations fYt gt 1 are conditionally independent given fXt gt 0 with Yt j fXk gk 0 g ( jXt , θ ). Likelihood of y1:T = (y1 , ..., yT ) is p (y1:T ; θ ) = Z XT +1 p (x0:T , y1:T ; θ )dx0:T . SMC provides an unbiased estimator of relative variance O (T /N ) where N is the number of particles. (Cambridge, 22/04/14) 9 / 26 Sequential Monte Carlo Estimator fXt gt 0 is a X-valued latent Markov process with X0 µ( ; θ ) and Xt +1 jXt f ( j Xt ; θ ). Observations fYt gt 1 are conditionally independent given fXt gt 0 with Yt j fXk gk 0 g ( jXt , θ ). Likelihood of y1:T = (y1 , ..., yT ) is p (y1:T ; θ ) = Z XT +1 p (x0:T , y1:T ; θ )dx0:T . SMC provides an unbiased estimator of relative variance O (T /N ) where N is the number of particles. Whatever being N 1, the pseudo-marginal MH admits π (θ ) as invariant distribution. (Cambridge, 22/04/14) 9 / 26 A Nonlinear State-Space Model Standard non-linear model Xt = 1 2 Xt 1 Yt = 1 2 20 Xt (Cambridge, 22/04/14) + 25 1 +XXt 21 + 8 cos(1.2t ) + Vt , + Wt , t 1 Wt i.i.d. Vt i.i.d. N 0, σ2V , N 0, σ2W . 10 / 26 A Nonlinear State-Space Model Standard non-linear model Xt = 1 2 Xt 1 Yt = 1 2 20 Xt + 25 1 +XXt 21 + 8 cos(1.2t ) + Vt , + Wt , t 1 Wt i.i.d. Vt i.i.d. N 0, σ2V , N 0, σ2W . T = 200 data points with θ = σ2V , σ2W = (10, 10). (Cambridge, 22/04/14) 10 / 26 A Nonlinear State-Space Model Standard non-linear model Xt = 1 2 Xt 1 Yt = 1 2 20 Xt + 25 1 +XXt 21 + 8 cos(1.2t ) + Vt , + Wt , t 1 Wt i.i.d. Vt i.i.d. N 0, σ2V , N 0, σ2W . T = 200 data points with θ = σ2V , σ2W = (10, 10). Di¢ cult to perform standard MCMC as p ( x1:T j y1:T , θ ) is highly multimodal. (Cambridge, 22/04/14) 10 / 26 A Nonlinear State-Space Model Standard non-linear model Xt = 1 2 Xt 1 Yt = 1 2 20 Xt + 25 1 +XXt 21 + 8 cos(1.2t ) + Vt , + Wt , t 1 Wt i.i.d. Vt i.i.d. N 0, σ2V , N 0, σ2W . T = 200 data points with θ = σ2V , σ2W = (10, 10). Di¢ cult to perform standard MCMC as p ( x1:T j y1:T , θ ) is highly multimodal. We sample from p ( θ j y1:T ) using a random walk pseudo-marginal MH where p (y1:T ; θ ) is estimated using SMC with N particles. (Cambridge, 22/04/14) 10 / 26 A Nonlinear State-Space Model σ σ V 1.0 N=96, T=200 W σ σ V W N=96, T=200 3 0.5 2 0 σ 200 V σ 400 600 800 1000 N=600, T=200 W 0.0 1.0 0 100 200 σ 3 300 σ V W 400 500 N=600, T=200 0.5 2 0.0 0 200 σ 400 σ V 600 800 0 1000 50 100 150 200 N=6667, T=200 W 1.0 3 σ σ V W N=6667, T=200 0.5 2 0 200 400 600 Figure: Autocorrelation of (Cambridge, 22/04/14) 800 n (i ) σV 1000 o 0.0 and 0 n (i ) σW 50 o 100 150 200 of the MH sampler for various N. 11 / 26 How to Select the Number of Samples A key issue from a practical point of view is how to select N? (Cambridge, 22/04/14) 12 / 26 How to Select the Number of Samples A key issue from a practical point of view is how to select N? If N is too small, then the algorithm mixes poorly and will require many MCMC iterations. (Cambridge, 22/04/14) 12 / 26 How to Select the Number of Samples A key issue from a practical point of view is how to select N? If N is too small, then the algorithm mixes poorly and will require many MCMC iterations. If N is too large, then each MCMC iteration is expensive. (Cambridge, 22/04/14) 12 / 26 Pseudo-Marginal Metropolis-Hastings algorithm Consider the error in the log-likelihood estimator Z = log p b (y ; θ, U ) (Cambridge, 22/04/14) log p (y ; θ ) gθ ( ) 13 / 26 Pseudo-Marginal Metropolis-Hastings algorithm Consider the error in the log-likelihood estimator Z = log p b (y ; θ, U ) log p (y ; θ ) gθ ( ) In the (θ, Z ) parameterization, the target is π (θ, u ) = π (θ ) (Cambridge, 22/04/14) p b (y ; θ, u ) mθ (u ) ) π (θ, z ) = π (θ ) exp(z )gθ (z ). p (y ; θ ) 13 / 26 Pseudo-Marginal Metropolis-Hastings algorithm Consider the error in the log-likelihood estimator Z = log p b (y ; θ, U ) log p (y ; θ ) gθ ( ) In the (θ, Z ) parameterization, the target is π (θ, u ) = π (θ ) p b (y ; θ, u ) mθ (u ) ) π (θ, z ) = π (θ ) exp(z )gθ (z ). p (y ; θ ) The pseudo-marginal MH algorithm is Q f(θ, z ) , (d ϑ, dw )g = q (ϑ jθ )gϑ (w ) min f1, r (θ, ϑ ) exp (w + 1 z )g d ϑdw $Q (θ, z ) δ(θ,z ) (d ϑ, dw ) where r (θ, ϑ ) = π (ϑ )q (θ jϑ )/ fπ (θ )q (ϑ jθ )g . (Cambridge, 22/04/14) 13 / 26 Ine¢ ciency Measure Consider the Markov chain fθ i , Zi g of transition Q with invariant distribution π (θ, z ) and h : Θ ! R. De…ne π (h) := Eπ [h(θ )] (Cambridge, 22/04/14) and b n (h ) = n π 1 n ∑ h ( θ i ). i =1 14 / 26 Ine¢ ciency Measure Consider the Markov chain fθ i , Zi g of transition Q with invariant distribution π (θ, z ) and h : Θ ! R. De…ne π (h) := Eπ [h(θ )] and b n (h ) = n π 1 n ∑ h ( θ i ). i =1 Proposition (KV 1986). If fθ i , Zi g is stationary and ergodic, Vπ [h(θ )] < ∞ and IFhQ := 1 + 2 ∞ ∑ corrπ,Q fh (θ 0 ) , h (θ τ )g τ =1 then p b n (h ) n fπ (Cambridge, 22/04/14) π (h)g ! N 0, Vπ [h(θ )] IFhQ . 14 / 26 Ine¢ ciency Measure Consider the Markov chain fθ i , Zi g of transition Q with invariant distribution π (θ, z ) and h : Θ ! R. De…ne π (h) := Eπ [h(θ )] and b n (h ) = n π 1 n ∑ h ( θ i ). i =1 Proposition (KV 1986). If fθ i , Zi g is stationary and ergodic, Vπ [h(θ )] < ∞ and IFhQ := 1 + 2 ∞ ∑ corrπ,Q fh (θ 0 ) , h (θ τ )g τ =1 then p b n (h ) n fπ π (h)g ! N 0, Vπ [h(θ )] IFhQ . The Integrated Autocorrelation Time IFhQ is a measure of ine¢ ciency of Q which we want to minimize for a …xed computational budget. (Cambridge, 22/04/14) 14 / 26 Aim of the Analysis Simplifying Assumption: The noise Z is independent of θ and Gaussian; i.e. Z N σ2 /2; σ2 : π (θ, z ) = π (θ ) exp (z ) g (z ) = π (θ ) N z; σ2 /2; σ2 . | {z } π Z (z ) (Cambridge, 22/04/14) 15 / 26 Aim of the Analysis Simplifying Assumption: The noise Z is independent of θ and Gaussian; i.e. Z N σ2 /2; σ2 : π (θ, z ) = π (θ ) exp (z ) g (z ) = π (θ ) N z; σ2 /2; σ2 . | {z } π Z (z ) Aim: Minimize the computational cost CThQ (σ ) = IFhQ (σ) /σ2 as σ2 ∝ 1/N and computational e¤orts proportional to N. (Cambridge, 22/04/14) 15 / 26 Aim of the Analysis Simplifying Assumption: The noise Z is independent of θ and Gaussian; i.e. Z N σ2 /2; σ2 : π (θ, z ) = π (θ ) exp (z ) g (z ) = π (θ ) N z; σ2 /2; σ2 . | {z } π Z (z ) Aim: Minimize the computational cost CThQ (σ ) = IFhQ (σ) /σ2 as σ2 ∝ 1/N and computational e¤orts proportional to N. Special cases: (Cambridge, 22/04/14) 15 / 26 Aim of the Analysis Simplifying Assumption: The noise Z is independent of θ and Gaussian; i.e. Z N σ2 /2; σ2 : π (θ, z ) = π (θ ) exp (z ) g (z ) = π (θ ) N z; σ2 /2; σ2 . | {z } π Z (z ) Aim: Minimize the computational cost CThQ (σ ) = IFhQ (σ) /σ2 as σ2 ∝ 1/N and computational e¤orts proportional to N. Special cases: 1 When q (ϑ jθ ) = p ( ϑ j y ), σopt = 0.92 (Pitt et al., 2012). (Cambridge, 22/04/14) 15 / 26 Aim of the Analysis Simplifying Assumption: The noise Z is independent of θ and Gaussian; i.e. Z N σ2 /2; σ2 : π (θ, z ) = π (θ ) exp (z ) g (z ) = π (θ ) N z; σ2 /2; σ2 . | {z } π Z (z ) Aim: Minimize the computational cost CThQ (σ ) = IFhQ (σ) /σ2 as σ2 ∝ 1/N and computational e¤orts proportional to N. Special cases: 1 When q (ϑ jθ ) = p ( ϑ j y ), σopt = 0.92 (Pitt et al., 2012). 2 When π (θ ) = d ∏ f (θ i ) and q (ϑjθ ) is an isotropic Gaussian random i =1 walk then, as d ! ∞, σopt = 1.81 (Sherlock et al., 2013). (Cambridge, 22/04/14) 15 / 26 Sketch of the Analysis For general proposals and targets, direct minimization of CThQ (σ) = IFhQ (σ) /σ2 impossible so minimize an upper bound over it. (Cambridge, 22/04/14) 16 / 26 Sketch of the Analysis For general proposals and targets, direct minimization of CThQ (σ) = IFhQ (σ) /σ2 impossible so minimize an upper bound over it. We introduce an auxiliary π (θ, z ) reversible kernel Q f(θ, z ) , (d ϑ, dw )g = q (ϑ jθ )g (w )αEX (θ, ϑ ) αZ (z, w ) d ϑdw + f1 $EX (θ ) $Z (z )g δ(θ,z ) (d ϑ, dw ) where αEX (θ, ϑ ) = min f1, r (θ, ϑ )g , αZ (z, w ) = min f1, exp (w (Cambridge, 22/04/14) z )g 16 / 26 Sketch of the Analysis For general proposals and targets, direct minimization of CThQ (σ) = IFhQ (σ) /σ2 impossible so minimize an upper bound over it. We introduce an auxiliary π (θ, z ) reversible kernel Q f(θ, z ) , (d ϑ, dw )g = q (ϑ jθ )g (w )αEX (θ, ϑ ) αZ (z, w ) d ϑdw + f1 $EX (θ ) $Z (z )g δ(θ,z ) (d ϑ, dw ) where αEX (θ, ϑ ) = min f1, r (θ, ϑ )g , αZ (z, w ) = min f1, exp (w Peskun’s theorem (1973) guarantees that IFhQ (σ) CThQ (σ) CThQ (σ). (Cambridge, 22/04/14) z )g IFhQ (σ) so that 16 / 26 A Bounding Chain The Q chain can be interpreted in the following two step procedure (Cambridge, 22/04/14) 17 / 26 A Bounding Chain The Q chain can be interpreted in the following two step procedure Let (θ, Z ) be the current state of the Markov chain. (Cambridge, 22/04/14) 17 / 26 A Bounding Chain The Q chain can be interpreted in the following two step procedure Let (θ, Z ) be the current state of the Markov chain. 1 Propose ϑ (Cambridge, 22/04/14) q (ϑ jθ ) and w g ( ; ϑ ). 17 / 26 A Bounding Chain The Q chain can be interpreted in the following two step procedure Let (θ, Z ) be the current state of the Markov chain. q (ϑ jθ ) and w g ( ; ϑ ). 1 Propose ϑ 2 Accept ϑ w.p. αEX (θ; ϑ ) = 1 ^ (Cambridge, 22/04/14) π (ϑ ) q ( θ jϑ ) . π (θ ) q ( ϑ jθ ) 17 / 26 A Bounding Chain The Q chain can be interpreted in the following two step procedure Let (θ, Z ) be the current state of the Markov chain. q (ϑ jθ ) and w g ( ; ϑ ). 1 Propose ϑ 2 Accept ϑ w.p. αEX (θ; ϑ ) = 1 ^ 3 π (ϑ ) q ( θ jϑ ) . π (θ ) q ( ϑ jθ ) Accept w w.p. αZ (z; w ) = 1 ^ exp(w (Cambridge, 22/04/14) z ). 17 / 26 A Bounding Chain The Q chain can be interpreted in the following two step procedure Let (θ, Z ) be the current state of the Markov chain. q (ϑ jθ ) and w g ( ; ϑ ). 1 Propose ϑ 2 Accept ϑ w.p. αEX (θ; ϑ ) = 1 ^ 3 4 π (ϑ ) q ( θ jϑ ) . π (θ ) q ( ϑ jθ ) Accept w w.p. αZ (z; w ) = 1 ^ exp(w z ). (ϑ, w ) is accepted if and only if there is acceptance in both criteria. (Cambridge, 22/04/14) 17 / 26 Sketch of the Analysis Let (θ i , Zi )i 1 (Cambridge, 22/04/14) be generated by Q . 18 / 26 Sketch of the Analysis Let (θ i , Zi )i 1 ei Denote e θi , Z be generated by Q . i 1 sojourn times; i.e. the accepted proposals and (τ i )i e e1 = (θ 1 , Z1 ) = θ1 , Z e e2 = (θ τ1+1 , Zτ1 +1 ) = θ2 , Z e ei +1 6= e ei θ i +1 , Z θi , Z (Cambridge, 22/04/14) 1 the associated = ( θ τ 1 , Zτ 1 ), = (θ τ2 , Zτ2 ) and so on where a.s. 18 / 26 Sketch of the Analysis Let (θ i , Zi )i 1 ei Denote e θi , Z be generated by Q . i 1 sojourn times; i.e. the accepted proposals and (τ i )i e e1 = (θ 1 , Z1 ) = θ1 , Z e e2 = (θ τ1+1 , Zτ1 +1 ) = θ2 , Z e ei +1 6= e ei θ i +1 , Z θi , Z the associated = ( θ τ 1 , Zτ 1 ), = (θ τ2 , Zτ2 ) and so on where a.s. Q IFhQ (σ) can be re-expressed in terms of IFh/J($ QJ f(θ, z ) , (d ϑ, dw )g = (Cambridge, 22/04/14) 1 EX $Z ) (σ) where q (d ϑ jθ )αEX (θ, ϑ ) g (dw )αZ (z, w ) $EX (θ ) $Z (z ) 18 / 26 Sketch of the Analysis Let (θ i , Zi )i 1 ei Denote e θi , Z be generated by Q . i 1 sojourn times; i.e. the accepted proposals and (τ i )i e e1 = (θ 1 , Z1 ) = θ1 , Z e e2 = (θ τ1+1 , Zτ1 +1 ) = θ2 , Z e ei +1 6= e ei θ i +1 , Z θi , Z 1 the associated = ( θ τ 1 , Zτ 1 ), = (θ τ2 , Zτ2 ) and so on where a.s. Q IFhQ (σ) can be re-expressed in terms of IFh/J($ QJ f(θ, z ) , (d ϑ, dw )g = EX $Z ) (σ) where q (d ϑ jθ )αEX (θ, ϑ ) g (dw )αZ (z, w ) $EX (θ ) $Z (z ) Proof exploits spectral representations of these kernels + Positivity of QJZ (z, dw ). (Cambridge, 22/04/14) 18 / 26 Main Results Proposition: IFhQ (σ) IFhEX IFhQ (σ) IFhEX 1 2 1+ 1 IFhEX 1 + IF Z (σ) 1 IFhEX and the bound is tight as IFhEX ! 1 or σ ! 0. As IFJEX ! ∞, ,h/$ EX IFhQ (σ) 1 ! σ . EX π Z ($Z (σ)) IFh (Cambridge, 22/04/14) 19 / 26 Main Results Proposition: IFhQ (σ) IFhEX IFhQ (σ) IFhEX 1 2 1+ 1 IFhEX 1 + IF Z (σ) 1 IFhEX and the bound is tight as IFhEX ! 1 or σ ! 0. As IFJEX ! ∞, ,h/$ EX IFhQ (σ) 1 ! σ . EX π Z ($Z (σ)) IFh Results used to minimize w.r.t σ upper bounds on CThQ (σ) = IFhQ (σ) /σ2 . (Cambridge, 22/04/14) 19 / 26 Bounds on Relative Computational Costs 30 R R R R 20 C C C C T T T T against σ , IFEX=4 EX , I F = 40 h Z (2 ) h (2 ) R C T h (2 ), I F E X = 2 R C T h (2 ), I F E X = 1 0 R C T h (2 ), I F E X = 1 0 6 10 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 30 R C T Z against σ R C T h (3 ), I F JE X = 4 R C T h (3 ), I F JE X = 4 0 RCT 20 R C T h (3 ), I F J E X = 2 R C T h (3 ), I F J E X = 1 0 R C T h (3 ), I F J E X = 1 0 6 10 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 Figure: Bounds on IFhQ (σ) / σ2 IFhEX (Cambridge, 22/04/14) 20 / 26 Practical Guidelines For good proposals, select σ σ 1.7. (Cambridge, 22/04/14) 1 whereas for poor proposals, select 21 / 26 Practical Guidelines For good proposals, select σ σ 1.7. 1 whereas for poor proposals, select When you have no clue about the proposal e¢ ciency, (Cambridge, 22/04/14) 21 / 26 Practical Guidelines For good proposals, select σ σ 1.7. 1 whereas for poor proposals, select When you have no clue about the proposal e¢ ciency, 1 If σopt = 1 and you pick σ = 1.7, computing time increases by 150%. (Cambridge, 22/04/14) 21 / 26 Practical Guidelines For good proposals, select σ σ 1.7. 1 whereas for poor proposals, select When you have no clue about the proposal e¢ ciency, 1 If σopt = 1 and you pick σ = 1.7, computing time increases by 150%. 2 If σopt = 1.7 and you pick σ = 1, computing time increases by (Cambridge, 22/04/14) 50%. 21 / 26 Practical Guidelines For good proposals, select σ σ 1.7. 1 whereas for poor proposals, select When you have no clue about the proposal e¢ ciency, 1 If σopt = 1 and you pick σ = 1.7, computing time increases by 150%. 2 If σopt = 1.7 and you pick σ = 1, computing time increases by 3 If σopt = 1 or σopt = 1.7 and you pick σ = 1.2, computing time increases by 15%. (Cambridge, 22/04/14) 50%. 21 / 26 Example: Noisy Autoregressive Example Consider Xt = µ(1 φ) + φXt + Vt , Yt = Xt + Wt , Wt i.i.d. Vt i.i.d. N 0, σ2η , N 0, σ2ε , where θ = φ, µ, σ2η . (Cambridge, 22/04/14) 22 / 26 Example: Noisy Autoregressive Example Consider Xt = µ(1 φ) + φXt + Vt , Yt = Xt + Wt , Wt i.i.d. Vt i.i.d. N 0, σ2η , N 0, σ2ε , where θ = φ, µ, σ2η . Likelihood can be computed exactly using Kalman. (Cambridge, 22/04/14) 22 / 26 Example: Noisy Autoregressive Example Consider Xt = µ(1 φ) + φXt + Vt , Yt = Xt + Wt , Wt i.i.d. Vt i.i.d. N 0, σ2η , N 0, σ2ε , where θ = φ, µ, σ2η . Likelihood can be computed exactly using Kalman. Autoregressive Metropolis proposal of coe¢ cient ρ for ϑ based on multivariate t-distribution. (Cambridge, 22/04/14) 22 / 26 Example: Noisy Autoregressive Example Consider Xt = µ(1 φ) + φXt + Vt , Yt = Xt + Wt , Wt i.i.d. Vt i.i.d. N 0, σ2η , N 0, σ2ε , where θ = φ, µ, σ2η . Likelihood can be computed exactly using Kalman. Autoregressive Metropolis proposal of coe¢ cient ρ for ϑ based on multivariate t-distribution. N is selected so as to obtain σ θ (Cambridge, 22/04/14) constant where θ posterior mean. 22 / 26 Empirical vs Asymptotic Distribution of Log-Likelihood Estimator 0.5 ρ =0 Probitexample:N=120, 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 -4 0.5 -2 0 2 4 -4 ρ =0.50 AR1 plus noiseexample:N=60, 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 -4 -2 0 2 4 ρ = 0 .9 7 Probitexample:N=120, -2 0 2 4 ρ = 0 .9 AR1 plus noiseexample:N=60, -4 -2 0 2 4 Figure: Histograms of proposed (red) and accepted (pink) values of Z in PMCMC scheme. Overlayed are Gaussian pdfs from our simplifying Assumption for a target of σ = 0.92. (Cambridge, 22/04/14) 23 / 26 Relative Ine¢ ciency and Computing Time ρ =0 RCT 500 500 250 250 0 100 ρ = 0 .4 200 300 RCT 500 2 100 ρ = 0 .6 200 300 0 1 2 500 50 250 25 100 200 300 500 250 0 100 1 2 RCT RCT 200 300 300 50 250 25 1 2 100 200 300 RIF 2 RIF 1 50 2 RIF 25 0 500 1 50 25 0 250 ρ = 0 .9 200 25 500 0 100 RIF RCT RCT RIF 25 50 RCT 250 0 50 RIF 25 1 500 250 50 RCT 100 200 300 RIF 1 50 2 RIF 25 0 100 200 300 1 2 Figure: From left to right: RCThQ vs N, RCThQ vs σ(θ ), RIFhQ against N and RIFhQ against σ(θ ) for various values of ρ and di¤erent parameters. (Cambridge, 22/04/14) 24 / 26 Acceptance Probability 0.75 0.75 Pr(Accept) of Ex act Pr(Accept) of Particle MH Pr(Accept) of Bo und Pr(Accept) of Ex act Pr(Accept) of Particle MH Pr(Accept) of Bo und 0.50 0.50 ρ=0 0.25 0.5 ρ = 0 .4 0.25 1.0 1.5 2.0 0.75 0.5 1.0 0.75 2.0 Pr(Accept) of Ex act Pr(Accept) of Particle MH Pr(Accept) of Bo und Pr(Accept) of Ex act Pr(Accept) of Particle MH Pr(Accept) of Bo und 0.50 1.5 0.50 ρ = 0 .6 0.25 0.5 ρ = 0 .9 0.25 1.0 1.5 2.0 0.5 1.0 1.5 2.0 (Cambridge, 22/04/14) probability against σ.Theoretical (upper bound) vs Actual. 25 / 26 Figure: Acceptance Discussion Simpli…ed quantitative analysis of pseudo-marginal MCMC. (Cambridge, 22/04/14) 26 / 26 Discussion Simpli…ed quantitative analysis of pseudo-marginal MCMC. 1 Much weaker assumptions than previous work on proposal and target. (Cambridge, 22/04/14) 26 / 26 Discussion Simpli…ed quantitative analysis of pseudo-marginal MCMC. 1 Much weaker assumptions than previous work on proposal and target. 2 Assumption on orthogonality of the noise. (Cambridge, 22/04/14) 26 / 26 Discussion Simpli…ed quantitative analysis of pseudo-marginal MCMC. 1 Much weaker assumptions than previous work on proposal and target. 2 Assumption on orthogonality of the noise. For good proposals, select σ σ 1.7. (Cambridge, 22/04/14) 1 whereas for poor proposals, select 26 / 26 Discussion Simpli…ed quantitative analysis of pseudo-marginal MCMC. 1 Much weaker assumptions than previous work on proposal and target. 2 Assumption on orthogonality of the noise. For good proposals, select σ σ 1.7. 1 whereas for poor proposals, select In scenarios where the proposal e¢ ciency is unknown, select σ (Cambridge, 22/04/14) 1.2. 26 / 26