Bayesian inference for doubly-intractable distributions Anne-Marie Lyne anne-marie.lyne@curie.fr April, 2016 Joint Work On Russian Roulette Estimates for Bayesian Inference with Doubly-Intractable Likelihoods Anne-Marie Lyne, Mark Girolami, Yves Atchade, Heiko Strathmann, Daniel Simpson Statist. Sci. Volume 30, Number 4 (2015) Talk overview 1 Doubly intractable models 2 Current Bayesian approaches 3 Our approach using Russian roulette sampling 4 Results Motivation: Doubly-intractable models Motivation: Exponential Random Graph Models Used extensively in the social networks community Motivation: Exponential Random Graph Models Used extensively in the social networks community View the observed network as one realisation of a random variable Motivation: Exponential Random Graph Models Used extensively in the social networks community View the observed network as one realisation of a random variable The probability of observing a given graph is dependent on certain ‘local’ graph properties Motivation: Exponential Random Graph Models Used extensively in the social networks community View the observed network as one realisation of a random variable The probability of observing a given graph is dependent on certain ‘local’ graph properties For example the edge density, the number of triangles or k-stars Motivation: Modelling social networks ! P(Y = y) = Z(θ) −1 exp X θk gk (y) k g(y) is a vector of K graph statistics θ is a K -dimensional parameter indicating the ‘importance’ of each graph statistic Motivation: Modelling social networks ! P(Y = y) = Z(θ) −1 X exp θk gk (y) k g(y) is a vector of K graph statistics θ is a K -dimensional parameter indicating the ‘importance’ of each graph statistic (Intractable) partition function or normalising term ! X X Z(θ) = exp θk gk (y) y∈Y k Parameter inference for doubly-intractable models Expectations with respect to the posterior distribution N 1X φ(θ k ) Eπ [φ(θ)] = φ(θ)π(θ|y)dθ ≈ N Θ Z θ k ∼ π(θ|y) k=1 Simplest function of interest Z θ π(θ|y)dθ ≈ Eπ [θ] = Θ N 1X θk N θ k ∼ π(θ|y) k=1 But need to sample from the posterior distribution... The Metropolis-Hastings algorithm To draw samples from a distribution π(θ): Choose an initial θ0 , define a proposal distribution q(θ, ·), set n = 0. Iterate the following for n = 0 . . . Niters 1 2 3 Propose new parameter value, θ0 , from q(θn , ·) Set θn+1 = θ0 with probability, α(θn , θ0 ), else θn+1 = θn π(θ0 )q(θ0 , θn ) 0 α(θn , θ ) = min 1, π(θn )q(θn , θ0 ) n =n+1 Doubly-intractable distributions Unfortunately ERGMs are example of ‘doubly-intractable’ distribution: . f (y; θ) p(y|θ)π(θ) = π(θ) p(y) π(θ|y) = p(y) Z(θ) Doubly-intractable distributions Unfortunately ERGMs are example of ‘doubly-intractable’ distribution: . f (y; θ) p(y|θ)π(θ) = π(θ) p(y) π(θ|y) = p(y) Z(θ) Partition function or normalising term, Z(θ), is intractable and function of parameters q(θ 0 , θ)π(θ 0 )f (y; θ 0 )Z(θ) 0 α(θ, θ ) = min 1, q(θ, θ 0 )π(θ)f (y; θ)Z(θ 0 ) Doubly-intractable distributions Unfortunately ERGMs are example of ‘doubly-intractable’ distribution: . f (y; θ) p(y|θ)π(θ) = π(θ) p(y) π(θ|y) = p(y) Z(θ) Partition function or normalising term, Z(θ), is intractable and function of parameters q(θ 0 , θ)π(θ 0 )f (y; θ 0 )Z(θ) 0 α(θ, θ ) = min 1, q(θ, θ 0 )π(θ)f (y; θ)Z(θ 0 ) As well as ERGMs, lots of other examples... (Ising and Potts models, spatial models, phylogenetic models) Current Bayesian approaches Approaches which use some kind of approximation: pseudo-likelihoods Exact-approximate MCMC approaches: auxiliary variable methods such as Exchange algorithm (requires perfect sample if implemented correctly) pseudo-marginal (requires unbiased estimate of likelihood) π̂(θ0 )q(θ0 , θn ) 0 α(θn , θ ) = min 1, π̂(θn )q(θn , θ0 ) The Exchange algorithm (Murray et al 2004 and Møller et al 2004) Expand the state space of our target (posterior) distribution to p(x, θ, θ0 |y) = f (y; θ) f (x; θ 0 ) . π(θ)q(θ, θ0 ) p(y) Z(θ) Z(θ 0 ) The Exchange algorithm (Murray et al 2004 and Møller et al 2004) Expand the state space of our target (posterior) distribution to p(x, θ, θ0 |y) = f (y; θ) f (x; θ 0 ) . π(θ)q(θ, θ0 ) p(y) Z(θ) Z(θ 0 ) 0 (x;θ ) Gibbs sample q(θ, θ0 ) fZ(θ 0) Propose to swap θ ↔ θ0 using Metropolis-Hastings α(θ, θ0 ) = f (y; θ0 )f (x; θ)π(θ0 )q(θ0 , θ) f (y; θ)f (x; θ0 )π(θ)q(θ, θ0 ) Pseudo-marginal MCMC (Roberts and Andrieu, 2009) Need an unbiased positive estimate of the target distribution p̂(y |θ, u) such that Z p̂(y |θ, u)pθ (u)du = p(y|θ) Pseudo-marginal MCMC (Roberts and Andrieu, 2009) Need an unbiased positive estimate of the target distribution p̂(y |θ, u) such that Z p̂(y |θ, u)pθ (u)du = p(y|θ) Define joint distribution π(θ, u|y) = π(θ)p̂(y|θ, u)pθ (u) p(y) Pseudo-marginal MCMC (Roberts and Andrieu, 2009) Need an unbiased positive estimate of the target distribution p̂(y |θ, u) such that Z p̂(y |θ, u)pθ (u)du = p(y|θ) Define joint distribution π(θ, u|y) = π(θ)p̂(y|θ, u)pθ (u) p(y) This integrates to one, and has the posterior as its marginal Pseudo-marginal MCMC (Roberts and Andrieu, 2009) Need an unbiased positive estimate of the target distribution p̂(y |θ, u) such that Z p̂(y |θ, u)pθ (u)du = p(y|θ) Define joint distribution π(θ, u|y) = π(θ)p̂(y|θ, u)pθ (u) p(y) This integrates to one, and has the posterior as its marginal We can sample from this distribution! α(θ, θ0 ) = p̂(y|θ0 , u 0 )π(θ0 )pθ0 (u 0 ) q(θ0 , θ)pθ (u) × p̂(y|θ, u)π(θ)pθ (u) q(θ, θ0 )pθ0 (u 0 ) What’s the problem? we need unbiased estimate of π(θ|y) = . p(y|θ)π(θ) f (y; θ) = π(θ) p(y) p(y) Z(θ) What’s the problem? we need unbiased estimate of π(θ|y) = . p(y|θ)π(θ) f (y; θ) = π(θ) p(y) p(y) Z(θ) We can unbiasedly estimate Z(θ) but if we take the reciprocal then the estimate is no longer unbiased. E[Ẑ(θ)] = Z(θ) 1 1 E 6= Z(θ) Ẑ(θ) Our approach Construct an unbiased estimate of the likelihood, based on a series expansion of the likelihood and stochastic truncation. Use pseudo-marginal MCMC to sample from the desired posterior distribution. Proposed methodology Construct random variables {Vθj , j ≥ 0} such that the series π̂(θ|y , {V j }) = ∞ X j=0 Vθj has h i E π̂(θ|y , {V j }) = π(θ|y). Proposed methodology Construct random variables {Vθj , j ≥ 0} such that the series π̂(θ|y , {V j }) = ∞ X Vθj has h i E π̂(θ|y , {V j }) = π(θ|y). j=0 This infinite series then needs to be truncated unbiasedly. This can be achieved via a number of Russian roulette schemes. Proposed methodology Construct random variables {Vθj , j ≥ 0} such that the series π̂(θ|y , {V j }) = ∞ X Vθj has h i E π̂(θ|y , {V j }) = π(θ|y). j=0 This infinite series then needs to be truncated unbiasedly. This can be achieved via a number of Russian roulette schemes. Define random time, τθ , such that u := (τθ , {Vθj , 0 ≤ j ≤ τθ }) π(θ, u|y ) = h i E π(θ, u|y)|{Vθj , j ≥ 0} = τθ X j=0 ∞ X j=0 Vθj Vθj which satisfies Implementation example Rewrite the likelihood as an infinite series. Inspired by the work of Booth (2005). Implementation example Rewrite the likelihood as an infinite series. Inspired by the work of Booth (2005). Simple manipulation: 1 f (y; θ) f (y; θ) h = Z(θ) Z̃(θ) 1 − 1 − ∞ Z(θ) Z̃(θ) where κ(θ) = 1 − i= Z(θ) Z̃(θ) f (y; θ) X κ(θ)n Z̃(θ) n=0 Implementation example Rewrite the likelihood as an infinite series. Inspired by the work of Booth (2005). Simple manipulation: 1 f (y; θ) f (y; θ) h = Z(θ) Z̃(θ) 1 − 1 − ∞ Z(θ) Z̃(θ) where κ(θ) = 1 − i= Z(θ) Z̃(θ) The series converges for |κ(θ)| < 1. f (y; θ) X κ(θ)n Z̃(θ) n=0 Implementation example continued We can unbiasedly estimate each term in the series using n independent estimates of Z(θ). ∞ f (y; θ) f (y; θ) X Z(θ) n = 1− Z(θ) Z̃(θ) n=0 Z̃(θ) # " ∞ n f (y; θ) X Y Ẑi (θ) ≈ 1− Z̃(θ) n=0 i=1 Z̃(θ) Implementation example continued We can unbiasedly estimate each term in the series using n independent estimates of Z(θ). ∞ f (y; θ) f (y; θ) X Z(θ) n = 1− Z(θ) Z̃(θ) n=0 Z̃(θ) # " ∞ n f (y; θ) X Y Ẑi (θ) ≈ 1− Z̃(θ) n=0 i=1 Z̃(θ) Computed using importance sampling (IS) or sequential Monte Carlo (SMC), for example. Implementation example continued We can unbiasedly estimate each term in the series using n independent estimates of Z(θ). ∞ f (y; θ) f (y; θ) X Z(θ) n = 1− Z(θ) Z̃(θ) n=0 Z̃(θ) # " ∞ n f (y; θ) X Y Ẑi (θ) ≈ 1− Z̃(θ) n=0 i=1 Z̃(θ) Computed using importance sampling (IS) or sequential Monte Carlo (SMC), for example. But can’t compute an infinite number of them... Unbiased estimates of infinite series Take an infinite convergent series S = like to estimate unbiasedly. P∞ k =0 ak which we would Unbiased estimates of infinite series Take an infinite convergent series S = like to estimate unbiasedly. P∞ k =0 ak which we would The P∞ simplest: draw integer k with probability p(K = k) where k=0 p(K = k ) = 1, then Ŝ = ak /p(K = k) Unbiased estimates of infinite series Take an infinite convergent series S = like to estimate unbiasedly. P∞ k =0 ak which we would The P∞ simplest: draw integer k with probability p(K = k) where k=0 p(K = k ) = 1, then Ŝ = ak /p(K = k) P =k)ak E[Ŝ] = k p(K p(K =k) = S Unbiased estimates of infinite series Take an infinite convergent series S = like to estimate unbiasedly. P∞ k =0 ak which we would The P∞ simplest: draw integer k with probability p(K = k) where k=0 p(K = k ) = 1, then Ŝ = ak /p(K = k) P =k)ak E[Ŝ] = k p(K p(K =k) = S (This is essentially importance sampling) h 2 i P an 2 Variance: ∞ n=0 p(N=n) − S Russian roulette Alternative: Russian roulette. Choose series of probabilities, {qn }, and draw sequence of i.i.d. uniform random variables, {Un } ∼ U[0, 1], n = 1, 2, 3 . . . Russian roulette Alternative: Russian roulette. Choose series of probabilities, {qn }, and draw sequence of i.i.d. uniform random variables, {Un } ∼ U[0, 1], n = 1, 2, 3 . . . Define time τ = inf{k ≥ 1 : Uk ≥ qk } Russian roulette Alternative: Russian roulette. Choose series of probabilities, {qn }, and draw sequence of i.i.d. uniform random variables, {Un } ∼ U[0, 1], n = 1, 2, 3 . . . Define time τ = inf{k ≥ 1 : Uk ≥ qk } P −1 aj Sτ = τj=0 , is an unbiased estimate of S. Qj i=1 qi Russian roulette Alternative: Russian roulette. Choose series of probabilities, {qn }, and draw sequence of i.i.d. uniform random variables, {Un } ∼ U[0, 1], n = 1, 2, 3 . . . Define time τ = inf{k ≥ 1 : Uk ≥ qk } P −1 aj Sτ = τj=0 , is an unbiased estimate of S. Qj i=1 qi Must choose {qn } to minimise variance of estimator. Debiasing estimator, McLeish (2011), Rhee Glynn (2012) Want unbiased estimate of E[Y ], but can’t generate Y . Can generate approximations, Yn , s.t. limn→∞ E[Yn ] → E[Y ]. Debiasing estimator, McLeish (2011), Rhee Glynn (2012) Want unbiased estimate of E[Y ], but can’t generate Y . Can generate approximations, Yn , s.t. limn→∞ E[Yn ] → E[Y ]. [] = Y0 + E[Y ∞ X i=1 Yi − Yi−1 Debiasing estimator, McLeish (2011), Rhee Glynn (2012) Want unbiased estimate of E[Y ], but can’t generate Y . Can generate approximations, Yn , s.t. limn→∞ E[Yn ] → E[Y ]. [] = Y0 + E[Y ∞ X Yi − Yi−1 i=1 Define probability distribution for N, non-negative, integer-valued random variable, then Debiasing estimator, McLeish (2011), Rhee Glynn (2012) Want unbiased estimate of E[Y ], but can’t generate Y . Can generate approximations, Yn , s.t. limn→∞ E[Yn ] → E[Y ]. [] = Y0 + E[Y ∞ X Yi − Yi−1 i=1 Define probability distribution for N, non-negative, integer-valued random variable, then N X Yi − Yi−1 Z = Y0 + P(N ≥ i) i=1 is an unbiased estimator of E[Y ] and has finite variance. Negative estimates Often we cannot guarantee the overall estimate will be positive Negative estimates Often we cannot guarantee the overall estimate will be positive Can use a trick from the Physics literature... Recall we have an unbiased estimate of the likelihood, p̂(y|θ, u) Z Z Z Eπ [φ(θ)] = φ(θ)π(θ|y )dθ = φ(θ)π(θ, u|y ) dθdu RR φ(θ) p̂(y|θ, u)π(θ)pθ (u) dθdu = RR p̂(y|θ, u)π(θ)pθ (u) dθdu Negative estimates Often we cannot guarantee the overall estimate will be positive Can use a trick from the Physics literature... Recall we have an unbiased estimate of the likelihood, p̂(y|θ, u) Z Z Z Eπ [φ(θ)] = φ(θ)π(θ|y )dθ = φ(θ)π(θ, u|y ) dθdu RR φ(θ) p̂(y|θ, u)π(θ)pθ (u) dθdu = RR p̂(y|θ, u)π(θ)pθ (u) dθdu RR φ(θ)σ(p̂) |p̂(y |θ, u)|π(θ)pθ (u) dθdu = RR σ(p̂) |p̂(y|θ, u)|π(θ)pθ (u) dθdu Negative estimates cont. From last slide RR φ(θ)σ(p̂) |p̂(y |θ, u)|π(θ)pθ (u) dθdu Eπ [φ(θ)] = R R σ(p̂) |p̂(y|θ, u)|π(θ)pθ (u) dθdu RR φ(θ)σ(p̂) q(θ, u|y ) dθdu = RR σ(p̂) q(θ, u|y) dθdu Negative estimates cont. From last slide RR φ(θ)σ(p̂) |p̂(y |θ, u)|π(θ)pθ (u) dθdu Eπ [φ(θ)] = R R σ(p̂) |p̂(y|θ, u)|π(θ)pθ (u) dθdu RR φ(θ)σ(p̂) q(θ, u|y ) dθdu = RR σ(p̂) q(θ, u|y) dθdu Can get a Monte Carlo estimate of φ(θ) wrt the posterior using samples from the ‘absolute’ distribution P φ(θk )σ(pˆk ) Eπ [φ(θ)] ≈ kP ˆ k σ(pk ) Summary So, we can get an unbiased estimate of doubly-intractable distribution. Summary So, we can get an unbiased estimate of doubly-intractable distribution. Draw a random integer, k ∼ p(k) Compute k-th term in infinite series Compute overall unbiased estimate of likelihood Summary So, we can get an unbiased estimate of doubly-intractable distribution. Draw a random integer, k ∼ p(k) Compute k-th term in infinite series Compute overall unbiased estimate of likelihood Plug the absolute value into a pseudo-marginal MCMC scheme to sample from the posterior (or close to...). Compute expectations with respect to the posterior using importance sampling identity. Summary So, we can get an unbiased estimate of doubly-intractable distribution. Draw a random integer, k ∼ p(k) Compute k-th term in infinite series Compute overall unbiased estimate of likelihood Plug the absolute value into a pseudo-marginal MCMC scheme to sample from the posterior (or close to...). Compute expectations with respect to the posterior using importance sampling identity. However, the methodology is computationally costly as need many low-variance estimates of partition function. Summary So, we can get an unbiased estimate of doubly-intractable distribution. Draw a random integer, k ∼ p(k) Compute k-th term in infinite series Compute overall unbiased estimate of likelihood Plug the absolute value into a pseudo-marginal MCMC scheme to sample from the posterior (or close to...). Compute expectations with respect to the posterior using importance sampling identity. However, the methodology is computationally costly as need many low-variance estimates of partition function. But, we can compute estimates in parallel... Results: Ising models We simulated a 40x40 grid of data points from a Gibbs sampler with Jβ = 0.2 Geometric construction with Russian roulette sampling AIS Parallel implementation using Matlab. Example: Florentine business network ERGM model, 16 nodes Example: Florentine business network ERGM model, 16 nodes Graph statistics in model exponent are number of edges, number of 2- and 3-stars and number of triangles Example: Florentine business network ERGM model, 16 nodes Graph statistics in model exponent are number of edges, number of 2- and 3-stars and number of triangles Estimates of the normalising term were computed using SMC Example: Florentine business network ERGM model, 16 nodes Graph statistics in model exponent are number of edges, number of 2- and 3-stars and number of triangles Estimates of the normalising term were computed using SMC Series truncation was carried out using Russian roulette Example: Florentine business network Example: Florentine business network Example: Florentine business network Further work Optimise various parts of the methodology, stopping probabilities etc. Compare with approximate approaches in terms of variance and computation Further work Optimise various parts of the methodology, stopping probabilities etc. Compare with approximate approaches in terms of variance and computation Thank you for listening! References MCMC for doubly-intractable distributions. Murray, Ghahramani, MacKay (2004) An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Møller, Pettitt, Berthelsen, Reeves (2004). Unbiased Estimation with Square Root Convergence for SDE Models. Rhee, Glynn (2012) A general method for debiasing a Monte Carlo estimator. McLeish (2011) Unbiased Monte Carlo Estimation of the Reciprocal of an Integral. Booth (2007)