Bayesian inference for doubly-intractable distributions Anne-Marie Lyne April, 2016

advertisement
Bayesian inference for doubly-intractable
distributions
Anne-Marie Lyne
anne-marie.lyne@curie.fr
April, 2016
Joint Work
On Russian Roulette Estimates for Bayesian Inference with
Doubly-Intractable Likelihoods
Anne-Marie Lyne, Mark Girolami, Yves Atchade, Heiko
Strathmann, Daniel Simpson
Statist. Sci. Volume 30, Number 4 (2015)
Talk overview
1
Doubly intractable models
2
Current Bayesian approaches
3
Our approach using Russian roulette sampling
4
Results
Motivation: Doubly-intractable models
Motivation: Exponential Random Graph Models
Used extensively in the social networks community
Motivation: Exponential Random Graph Models
Used extensively in the social networks community
View the observed network as one realisation of a random
variable
Motivation: Exponential Random Graph Models
Used extensively in the social networks community
View the observed network as one realisation of a random
variable
The probability of observing a given graph is dependent on
certain ‘local’ graph properties
Motivation: Exponential Random Graph Models
Used extensively in the social networks community
View the observed network as one realisation of a random
variable
The probability of observing a given graph is dependent on
certain ‘local’ graph properties
For example the edge density, the number of triangles or k-stars
Motivation: Modelling social networks
!
P(Y = y) = Z(θ)
−1
exp
X
θk gk (y)
k
g(y) is a vector of K graph statistics
θ is a K -dimensional parameter indicating the ‘importance’ of
each graph statistic
Motivation: Modelling social networks
!
P(Y = y) = Z(θ)
−1
X
exp
θk gk (y)
k
g(y) is a vector of K graph statistics
θ is a K -dimensional parameter indicating the ‘importance’ of
each graph statistic
(Intractable) partition function or normalising term
!
X
X
Z(θ) =
exp
θk gk (y)
y∈Y
k
Parameter inference for doubly-intractable models
Expectations with respect to the posterior distribution
N
1X
φ(θ k )
Eπ [φ(θ)] =
φ(θ)π(θ|y)dθ ≈
N
Θ
Z
θ k ∼ π(θ|y)
k=1
Simplest function of interest
Z
θ π(θ|y)dθ ≈
Eπ [θ] =
Θ
N
1X
θk
N
θ k ∼ π(θ|y)
k=1
But need to sample from the posterior distribution...
The Metropolis-Hastings algorithm
To draw samples from a distribution π(θ):
Choose an initial θ0 , define a proposal distribution q(θ, ·), set n = 0.
Iterate the following for n = 0 . . . Niters
1
2
3
Propose new parameter value, θ0 , from q(θn , ·)
Set θn+1 = θ0 with probability, α(θn , θ0 ), else θn+1 = θn
π(θ0 )q(θ0 , θn )
0
α(θn , θ ) = min 1,
π(θn )q(θn , θ0 )
n =n+1
Doubly-intractable distributions
Unfortunately ERGMs are example of ‘doubly-intractable’
distribution:
.
f (y; θ)
p(y|θ)π(θ)
=
π(θ) p(y)
π(θ|y) =
p(y)
Z(θ)
Doubly-intractable distributions
Unfortunately ERGMs are example of ‘doubly-intractable’
distribution:
.
f (y; θ)
p(y|θ)π(θ)
=
π(θ) p(y)
π(θ|y) =
p(y)
Z(θ)
Partition function or normalising term, Z(θ), is intractable and
function of parameters
q(θ 0 , θ)π(θ 0 )f (y; θ 0 )Z(θ)
0
α(θ, θ ) = min 1,
q(θ, θ 0 )π(θ)f (y; θ)Z(θ 0 )
Doubly-intractable distributions
Unfortunately ERGMs are example of ‘doubly-intractable’
distribution:
.
f (y; θ)
p(y|θ)π(θ)
=
π(θ) p(y)
π(θ|y) =
p(y)
Z(θ)
Partition function or normalising term, Z(θ), is intractable and
function of parameters
q(θ 0 , θ)π(θ 0 )f (y; θ 0 )Z(θ)
0
α(θ, θ ) = min 1,
q(θ, θ 0 )π(θ)f (y; θ)Z(θ 0 )
As well as ERGMs, lots of other examples... (Ising and Potts
models, spatial models, phylogenetic models)
Current Bayesian approaches
Approaches which use some kind of approximation:
pseudo-likelihoods
Exact-approximate MCMC approaches:
auxiliary variable methods such as Exchange algorithm (requires
perfect sample if implemented correctly)
pseudo-marginal (requires unbiased estimate of likelihood)
π̂(θ0 )q(θ0 , θn )
0
α(θn , θ ) = min 1,
π̂(θn )q(θn , θ0 )
The Exchange algorithm (Murray et al 2004 and
Møller et al 2004)
Expand the state space of our target (posterior) distribution to
p(x, θ, θ0 |y) =
f (y; θ)
f (x; θ 0 ) .
π(θ)q(θ, θ0 )
p(y)
Z(θ)
Z(θ 0 )
The Exchange algorithm (Murray et al 2004 and
Møller et al 2004)
Expand the state space of our target (posterior) distribution to
p(x, θ, θ0 |y) =
f (y; θ)
f (x; θ 0 ) .
π(θ)q(θ, θ0 )
p(y)
Z(θ)
Z(θ 0 )
0
(x;θ )
Gibbs sample q(θ, θ0 ) fZ(θ
0)
Propose to swap θ ↔ θ0 using Metropolis-Hastings
α(θ, θ0 ) =
f (y; θ0 )f (x; θ)π(θ0 )q(θ0 , θ)
f (y; θ)f (x; θ0 )π(θ)q(θ, θ0 )
Pseudo-marginal MCMC (Roberts and Andrieu, 2009)
Need an unbiased positive estimate of the target distribution
p̂(y |θ, u) such that
Z
p̂(y |θ, u)pθ (u)du = p(y|θ)
Pseudo-marginal MCMC (Roberts and Andrieu, 2009)
Need an unbiased positive estimate of the target distribution
p̂(y |θ, u) such that
Z
p̂(y |θ, u)pθ (u)du = p(y|θ)
Define joint distribution
π(θ, u|y) =
π(θ)p̂(y|θ, u)pθ (u)
p(y)
Pseudo-marginal MCMC (Roberts and Andrieu, 2009)
Need an unbiased positive estimate of the target distribution
p̂(y |θ, u) such that
Z
p̂(y |θ, u)pθ (u)du = p(y|θ)
Define joint distribution
π(θ, u|y) =
π(θ)p̂(y|θ, u)pθ (u)
p(y)
This integrates to one, and has the posterior as its marginal
Pseudo-marginal MCMC (Roberts and Andrieu, 2009)
Need an unbiased positive estimate of the target distribution
p̂(y |θ, u) such that
Z
p̂(y |θ, u)pθ (u)du = p(y|θ)
Define joint distribution
π(θ, u|y) =
π(θ)p̂(y|θ, u)pθ (u)
p(y)
This integrates to one, and has the posterior as its marginal
We can sample from this distribution!
α(θ, θ0 ) =
p̂(y|θ0 , u 0 )π(θ0 )pθ0 (u 0 )
q(θ0 , θ)pθ (u)
×
p̂(y|θ, u)π(θ)pθ (u)
q(θ, θ0 )pθ0 (u 0 )
What’s the problem?
we need unbiased estimate of
π(θ|y) =
.
p(y|θ)π(θ)
f (y; θ)
=
π(θ) p(y)
p(y)
Z(θ)
What’s the problem?
we need unbiased estimate of
π(θ|y) =
.
p(y|θ)π(θ)
f (y; θ)
=
π(θ) p(y)
p(y)
Z(θ)
We can unbiasedly estimate Z(θ) but if we take the reciprocal
then the estimate is no longer unbiased.
E[Ẑ(θ)] = Z(θ)
1
1
E
6=
Z(θ)
Ẑ(θ)
Our approach
Construct an unbiased estimate of the likelihood, based on a
series expansion of the likelihood and stochastic truncation.
Use pseudo-marginal MCMC to sample from the desired
posterior distribution.
Proposed methodology
Construct random variables {Vθj , j ≥ 0} such that the series
π̂(θ|y , {V j }) =
∞
X
j=0
Vθj
has
h
i
E π̂(θ|y , {V j }) = π(θ|y).
Proposed methodology
Construct random variables {Vθj , j ≥ 0} such that the series
π̂(θ|y , {V j }) =
∞
X
Vθj
has
h
i
E π̂(θ|y , {V j }) = π(θ|y).
j=0
This infinite series then needs to be truncated unbiasedly.
This can be achieved via a number of Russian roulette schemes.
Proposed methodology
Construct random variables {Vθj , j ≥ 0} such that the series
π̂(θ|y , {V j }) =
∞
X
Vθj
has
h
i
E π̂(θ|y , {V j }) = π(θ|y).
j=0
This infinite series then needs to be truncated unbiasedly.
This can be achieved via a number of Russian roulette schemes.
Define random time, τθ , such that u := (τθ , {Vθj , 0 ≤ j ≤ τθ })
π(θ, u|y ) =
h
i
E π(θ, u|y)|{Vθj , j ≥ 0} =
τθ
X
j=0
∞
X
j=0
Vθj
Vθj
which satisfies
Implementation example
Rewrite the likelihood as an infinite series. Inspired by the work
of Booth (2005).
Implementation example
Rewrite the likelihood as an infinite series. Inspired by the work
of Booth (2005).
Simple manipulation:
1
f (y; θ)
f (y; θ)
h
=
Z(θ)
Z̃(θ) 1 − 1 −
∞
Z(θ)
Z̃(θ)
where
κ(θ) = 1 −
i=
Z(θ)
Z̃(θ)
f (y; θ) X
κ(θ)n
Z̃(θ) n=0
Implementation example
Rewrite the likelihood as an infinite series. Inspired by the work
of Booth (2005).
Simple manipulation:
1
f (y; θ)
f (y; θ)
h
=
Z(θ)
Z̃(θ) 1 − 1 −
∞
Z(θ)
Z̃(θ)
where
κ(θ) = 1 −
i=
Z(θ)
Z̃(θ)
The series converges for |κ(θ)| < 1.
f (y; θ) X
κ(θ)n
Z̃(θ) n=0
Implementation example continued
We can unbiasedly estimate each term in the series using n
independent estimates of Z(θ).
∞ f (y; θ)
f (y; θ) X
Z(θ) n
=
1−
Z(θ)
Z̃(θ) n=0
Z̃(θ)
#
"
∞ n
f (y; θ) X Y
Ẑi (θ)
≈
1−
Z̃(θ) n=0 i=1
Z̃(θ)
Implementation example continued
We can unbiasedly estimate each term in the series using n
independent estimates of Z(θ).
∞ f (y; θ)
f (y; θ) X
Z(θ) n
=
1−
Z(θ)
Z̃(θ) n=0
Z̃(θ)
#
"
∞ n
f (y; θ) X Y
Ẑi (θ)
≈
1−
Z̃(θ) n=0 i=1
Z̃(θ)
Computed using importance sampling (IS) or sequential Monte
Carlo (SMC), for example.
Implementation example continued
We can unbiasedly estimate each term in the series using n
independent estimates of Z(θ).
∞ f (y; θ)
f (y; θ) X
Z(θ) n
=
1−
Z(θ)
Z̃(θ) n=0
Z̃(θ)
#
"
∞ n
f (y; θ) X Y
Ẑi (θ)
≈
1−
Z̃(θ) n=0 i=1
Z̃(θ)
Computed using importance sampling (IS) or sequential Monte
Carlo (SMC), for example.
But can’t compute an infinite number of them...
Unbiased estimates of infinite series
Take an infinite convergent series S =
like to estimate unbiasedly.
P∞
k =0 ak
which we would
Unbiased estimates of infinite series
Take an infinite convergent series S =
like to estimate unbiasedly.
P∞
k =0 ak
which we would
The
P∞ simplest: draw integer k with probability p(K = k) where
k=0 p(K = k ) = 1, then Ŝ = ak /p(K = k)
Unbiased estimates of infinite series
Take an infinite convergent series S =
like to estimate unbiasedly.
P∞
k =0 ak
which we would
The
P∞ simplest: draw integer k with probability p(K = k) where
k=0 p(K = k ) = 1, then Ŝ = ak /p(K = k)
P
=k)ak
E[Ŝ] = k p(K
p(K =k) = S
Unbiased estimates of infinite series
Take an infinite convergent series S =
like to estimate unbiasedly.
P∞
k =0 ak
which we would
The
P∞ simplest: draw integer k with probability p(K = k) where
k=0 p(K = k ) = 1, then Ŝ = ak /p(K = k)
P
=k)ak
E[Ŝ] = k p(K
p(K =k) = S
(This is essentially importance sampling)
h 2 i
P
an
2
Variance: ∞
n=0 p(N=n) − S
Russian roulette
Alternative: Russian roulette.
Choose series of probabilities, {qn }, and draw sequence of i.i.d.
uniform random variables, {Un } ∼ U[0, 1], n = 1, 2, 3 . . .
Russian roulette
Alternative: Russian roulette.
Choose series of probabilities, {qn }, and draw sequence of i.i.d.
uniform random variables, {Un } ∼ U[0, 1], n = 1, 2, 3 . . .
Define time τ = inf{k ≥ 1 : Uk ≥ qk }
Russian roulette
Alternative: Russian roulette.
Choose series of probabilities, {qn }, and draw sequence of i.i.d.
uniform random variables, {Un } ∼ U[0, 1], n = 1, 2, 3 . . .
Define time τ = inf{k ≥ 1 : Uk ≥ qk }
P −1 aj
Sτ = τj=0
, is an unbiased estimate of S.
Qj
i=1
qi
Russian roulette
Alternative: Russian roulette.
Choose series of probabilities, {qn }, and draw sequence of i.i.d.
uniform random variables, {Un } ∼ U[0, 1], n = 1, 2, 3 . . .
Define time τ = inf{k ≥ 1 : Uk ≥ qk }
P −1 aj
Sτ = τj=0
, is an unbiased estimate of S.
Qj
i=1
qi
Must choose {qn } to minimise variance of estimator.
Debiasing estimator, McLeish (2011), Rhee Glynn
(2012)
Want unbiased estimate of E[Y ], but can’t generate Y . Can
generate approximations, Yn , s.t. limn→∞ E[Yn ] → E[Y ].
Debiasing estimator, McLeish (2011), Rhee Glynn
(2012)
Want unbiased estimate of E[Y ], but can’t generate Y . Can
generate approximations, Yn , s.t. limn→∞ E[Yn ] → E[Y ].
[] = Y0 +
E[Y
∞
X
i=1
Yi − Yi−1
Debiasing estimator, McLeish (2011), Rhee Glynn
(2012)
Want unbiased estimate of E[Y ], but can’t generate Y . Can
generate approximations, Yn , s.t. limn→∞ E[Yn ] → E[Y ].
[] = Y0 +
E[Y
∞
X
Yi − Yi−1
i=1
Define probability distribution for N, non-negative, integer-valued
random variable, then
Debiasing estimator, McLeish (2011), Rhee Glynn
(2012)
Want unbiased estimate of E[Y ], but can’t generate Y . Can
generate approximations, Yn , s.t. limn→∞ E[Yn ] → E[Y ].
[] = Y0 +
E[Y
∞
X
Yi − Yi−1
i=1
Define probability distribution for N, non-negative, integer-valued
random variable, then
N
X
Yi − Yi−1
Z = Y0 +
P(N ≥ i)
i=1
is an unbiased estimator of E[Y ] and has finite variance.
Negative estimates
Often we cannot guarantee the overall estimate will be positive
Negative estimates
Often we cannot guarantee the overall estimate will be positive
Can use a trick from the Physics literature...
Recall we have an unbiased estimate of the likelihood, p̂(y|θ, u)
Z
Z Z
Eπ [φ(θ)] = φ(θ)π(θ|y )dθ =
φ(θ)π(θ, u|y ) dθdu
RR
φ(θ) p̂(y|θ, u)π(θ)pθ (u) dθdu
= RR
p̂(y|θ, u)π(θ)pθ (u) dθdu
Negative estimates
Often we cannot guarantee the overall estimate will be positive
Can use a trick from the Physics literature...
Recall we have an unbiased estimate of the likelihood, p̂(y|θ, u)
Z
Z Z
Eπ [φ(θ)] = φ(θ)π(θ|y )dθ =
φ(θ)π(θ, u|y ) dθdu
RR
φ(θ) p̂(y|θ, u)π(θ)pθ (u) dθdu
= RR
p̂(y|θ, u)π(θ)pθ (u) dθdu
RR
φ(θ)σ(p̂) |p̂(y |θ, u)|π(θ)pθ (u) dθdu
= RR
σ(p̂) |p̂(y|θ, u)|π(θ)pθ (u) dθdu
Negative estimates cont.
From last slide
RR
φ(θ)σ(p̂) |p̂(y |θ, u)|π(θ)pθ (u) dθdu
Eπ [φ(θ)] = R R
σ(p̂) |p̂(y|θ, u)|π(θ)pθ (u) dθdu
RR
φ(θ)σ(p̂) q(θ, u|y ) dθdu
= RR
σ(p̂) q(θ, u|y) dθdu
Negative estimates cont.
From last slide
RR
φ(θ)σ(p̂) |p̂(y |θ, u)|π(θ)pθ (u) dθdu
Eπ [φ(θ)] = R R
σ(p̂) |p̂(y|θ, u)|π(θ)pθ (u) dθdu
RR
φ(θ)σ(p̂) q(θ, u|y ) dθdu
= RR
σ(p̂) q(θ, u|y) dθdu
Can get a Monte Carlo estimate of φ(θ) wrt the posterior using
samples from the ‘absolute’ distribution
P
φ(θk )σ(pˆk )
Eπ [φ(θ)] ≈ kP
ˆ
k σ(pk )
Summary
So, we can get an unbiased estimate of doubly-intractable
distribution.
Summary
So, we can get an unbiased estimate of doubly-intractable
distribution.
Draw a random integer, k ∼ p(k)
Compute k-th term in infinite series
Compute overall unbiased estimate of likelihood
Summary
So, we can get an unbiased estimate of doubly-intractable
distribution.
Draw a random integer, k ∼ p(k)
Compute k-th term in infinite series
Compute overall unbiased estimate of likelihood
Plug the absolute value into a pseudo-marginal MCMC scheme
to sample from the posterior (or close to...).
Compute expectations with respect to the posterior using
importance sampling identity.
Summary
So, we can get an unbiased estimate of doubly-intractable
distribution.
Draw a random integer, k ∼ p(k)
Compute k-th term in infinite series
Compute overall unbiased estimate of likelihood
Plug the absolute value into a pseudo-marginal MCMC scheme
to sample from the posterior (or close to...).
Compute expectations with respect to the posterior using
importance sampling identity.
However, the methodology is computationally costly as need
many low-variance estimates of partition function.
Summary
So, we can get an unbiased estimate of doubly-intractable
distribution.
Draw a random integer, k ∼ p(k)
Compute k-th term in infinite series
Compute overall unbiased estimate of likelihood
Plug the absolute value into a pseudo-marginal MCMC scheme
to sample from the posterior (or close to...).
Compute expectations with respect to the posterior using
importance sampling identity.
However, the methodology is computationally costly as need
many low-variance estimates of partition function.
But, we can compute estimates in parallel...
Results: Ising models
We simulated a 40x40 grid of data points from a Gibbs sampler with
Jβ = 0.2
Geometric construction with Russian roulette sampling
AIS
Parallel implementation using Matlab.
Example: Florentine business network
ERGM model, 16 nodes
Example: Florentine business network
ERGM model, 16 nodes
Graph statistics in model exponent are number of edges,
number of 2- and 3-stars and number of triangles
Example: Florentine business network
ERGM model, 16 nodes
Graph statistics in model exponent are number of edges,
number of 2- and 3-stars and number of triangles
Estimates of the normalising term were computed using SMC
Example: Florentine business network
ERGM model, 16 nodes
Graph statistics in model exponent are number of edges,
number of 2- and 3-stars and number of triangles
Estimates of the normalising term were computed using SMC
Series truncation was carried out using Russian roulette
Example: Florentine business network
Example: Florentine business network
Example: Florentine business network
Further work
Optimise various parts of the methodology, stopping
probabilities etc.
Compare with approximate approaches in terms of variance and
computation
Further work
Optimise various parts of the methodology, stopping
probabilities etc.
Compare with approximate approaches in terms of variance and
computation
Thank you for listening!
References
MCMC for doubly-intractable distributions. Murray, Ghahramani,
MacKay (2004)
An efficient Markov chain Monte Carlo method for distributions
with intractable normalising constants. Møller, Pettitt,
Berthelsen, Reeves (2004).
Unbiased Estimation with Square Root Convergence for SDE
Models. Rhee, Glynn (2012)
A general method for debiasing a Monte Carlo estimator.
McLeish (2011)
Unbiased Monte Carlo Estimation of the Reciprocal of an
Integral. Booth (2007)
Download