Approximate Inference Magnus Rattray June 15th, 2010

advertisement
Approximate Inference
Magnus Rattray
Machine Learning and Oprimization Group, University of Manchester
June 15th, 2010
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
1 / 22
Talk outline
Full Bayesian inference example
MAP-Laplace approximation
MCMC: Metropolis-Hastings
MCMC: Gibbs sampling
Variational Inference
Model selection
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
2 / 22
Full Bayesian parameter inference: normal-Gamma example
Model data as xi ∼ N (µ, σ 2 ) with parameters θ = {µ, σ 2 }
Gamma distribution prior on inverse variance 1/σ 2 ∼ Gamma(ν0 , σ02 )
Gaussian distribution prior over mean µ ∼ N (µ0 , σ 2 /λ0 )
The posterior distribution p(θ|D) has same functional form as prior
– in this case the prior is called a conjugate prior
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
3 / 22
Full Bayesian parameter inference: normal-Gamma example
Model data as xi ∼ N (µ, σ 2 ) with parameters θ = {µ, σ 2 }
Gamma distribution prior on inverse variance 1/σ 2 ∼ Gamma(ν0 , σ02 )
Gaussian distribution prior over mean µ ∼ N (µ0 , σ 2 /λ0 )
The posterior distribution p(θ|D) has same functional form as prior
– in this case the prior is called a conjugate prior
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
3 / 22
Full Bayesian parameter inference: normal-Gamma example
Model data as xi ∼ N (µ, σ 2 ) with parameters θ = {µ, σ 2 }
Gamma distribution prior on inverse variance 1/σ 2 ∼ Gamma(ν0 , σ02 )
Gaussian distribution prior over mean µ ∼ N (µ0 , σ 2 /λ0 )
The posterior distribution p(θ|D) has same functional form as prior
– in this case the prior is called a conjugate prior
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
3 / 22
Full Bayesian parameter inference: normal-Gamma example
Model data as xi ∼ N (µ, σ 2 ) with parameters θ = {µ, σ 2 }
Gamma distribution prior on inverse variance 1/σ 2 ∼ Gamma(ν0 , σ02 )
Gaussian distribution prior over mean µ ∼ N (µ0 , σ 2 /λ0 )
The posterior distribution p(θ|D) has same functional form as prior
– in this case the prior is called a conjugate prior
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
3 / 22
Full Bayesian parameter inference
Prior: p(µ,σ2)
Likelihood: p(D|µ,σ2)
−3
x 10
8
6
4
2
4
0.01
3
2
1
σ2
5
10
µ
15
0
4
3
Posterior: p(µ,σ2|D)
2
1
σ2
5
10
µ
15
Posterior: p(σ2|D)
1
1
0.5
0
4
0.5
3
2
1
σ2
Magnus Rattray (University of Manchester)
5
10
µ
15
0
0
Approximate Inference
2
σ2
4
15/06/10
4 / 22
Full Bayesian parameter inference
Posterior p(θ|D) provides more than just point estimates
p(µ,σ2|D)
1
p(σ2|D)
1
0.5
0
4
0.5
3
2
σ2 1
Magnus Rattray (University of Manchester)
5
15
10 µ
0
0
Approximate Inference
2
σ2
4
15/06/10
5 / 22
Full Bayesian parameter inference
Posterior p(θ|D) provides more than just point estimates
p(µ,σ2|D)
1
p(σ2|D)
1
0.5
0
4
0.5
3
2
σ2 1
5
15
10 µ
0
0
2
σ2
4
Captures uncertainty in parameter estimates and covariance between
different parameters
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
5 / 22
Full Bayesian parameter inference
Posterior p(θ|D) provides more than just point estimates
p(µ,σ2|D)
1
p(σ2|D)
1
0.5
0
4
0.5
3
2
σ2 1
5
15
10 µ
0
0
2
σ2
4
Captures uncertainty in parameter estimates and covariance between
different parameters
Can usefully propagate uncertainty through next stage of analysis
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
5 / 22
Full Bayesian parameter inference
In complex models the posterior distribution can be multi-modal
Girolami et al. in “Learning and Inference in Computational Systems Biology” (MIT Press 2010)
Point estimates are not enough to summarize our belief
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
6 / 22
Full Bayesian parameter inference
Rather than using point estimates, one can try and evaluate p(θ|D)
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
7 / 22
Full Bayesian parameter inference
Rather than using point estimates, one can try and evaluate p(θ|D)
Intractable for non-conjugate or very high-dimensional models
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
7 / 22
Full Bayesian parameter inference
Rather than using point estimates, one can try and evaluate p(θ|D)
Intractable for non-conjugate or very high-dimensional models
Approximate inference schemes include:
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
7 / 22
Full Bayesian parameter inference
Rather than using point estimates, one can try and evaluate p(θ|D)
Intractable for non-conjugate or very high-dimensional models
Approximate inference schemes include:
I
Laplace approximation:
Gaussian approximation to posterior centred at MAP solution
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
7 / 22
Full Bayesian parameter inference
Rather than using point estimates, one can try and evaluate p(θ|D)
Intractable for non-conjugate or very high-dimensional models
Approximate inference schemes include:
I
Laplace approximation:
Gaussian approximation to posterior centred at MAP solution
I
Markov chain Monte Carlo (MCMC):
Class of iterative sampling-based algorithms
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
7 / 22
Full Bayesian parameter inference
Rather than using point estimates, one can try and evaluate p(θ|D)
Intractable for non-conjugate or very high-dimensional models
Approximate inference schemes include:
I
Laplace approximation:
Gaussian approximation to posterior centred at MAP solution
I
Markov chain Monte Carlo (MCMC):
Class of iterative sampling-based algorithms
I
Variational inference:
Fitting a simplified approximate function to posterior
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
7 / 22
Full Bayesian parameter inference
Rather than using point estimates, one can try and evaluate p(θ|D)
Intractable for non-conjugate or very high-dimensional models
Approximate inference schemes include:
I
Laplace approximation:
Gaussian approximation to posterior centred at MAP solution
I
Markov chain Monte Carlo (MCMC):
Class of iterative sampling-based algorithms
I
Variational inference:
Fitting a simplified approximate function to posterior
I
Message passing (loopy BP, EP etc):
Belief propagation on intractable graphs
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
7 / 22
Laplace approximation
p(D|θ)p(θ)
∝ exp(ln(p(D|θ)p(θ)))
p(D)
' exp ln(p(D|θMAP )p(θMAP )) − 12 (θ − θMAP )T A(θ − θMAP )
∂ 2 ln(p(D|θ)p(θ)) −1
= N (θ|θMAP , A ) where Aij = −
∂θi ∂θj
θMAP
p(θ|D) =
1.4
Exact
MAP−Laplace
1.2
p(σ2|D)
1
0.8
0.6
0.4
0.2
0
0
Magnus Rattray (University of Manchester)
1
2
σ2
Approximate Inference
3
4
15/06/10
8 / 22
Markov chain Monte Carlo: Metropolis-Hastings
A method to draw samples from the posterior θ (s) ∼ p (θ|D).
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
9 / 22
Markov chain Monte Carlo: Metropolis-Hastings
A method to draw samples from the posterior θ (s) ∼ p (θ|D).
We design a 1st order Markov process
θ (s+1) ∼ t θ (s+1) |θ (s)
so that θ (s) ∼ p (θ|D) as s → ∞.
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
9 / 22
Markov chain Monte Carlo: Metropolis-Hastings
A method to draw samples from the posterior θ (s) ∼ p (θ|D).
We design a 1st order Markov process
θ (s+1) ∼ t θ (s+1) |θ (s)
so that θ (s) ∼ p (θ|D) as s → ∞.
Popular approach is Metropolis-Hastings:
Metropolis-Hastings Algorithm
1
Select a proposal from q θ ∗ |θ (s)
2
Accept with probability
∗
α θ |θ
3
(s)
(
)
p (θ ∗ |D) q θ (s) |θ ∗
= min 1,
p θ (s) |D q θ ∗ |θ (s)
If accepted θ (s+1) = θ ∗ , else θ (s+1) = θ (s) . Repeat until convergence.
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
9 / 22
Metropolis-Hastings: sampling from a Gaussian posterior
10
8
θ
2
6
4
2
0
0
2
4
θ1
6
8
10
Spherical Gaussian proposal distribution
Solid lines are accepted moves, faint lines are rejected
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
10 / 22
Metropolis-Hastings: normal-Gamma example
xi ∼ N (µ, σ 2 ) ,
µ ∼ N (µ0 , σ 2 /λ0 ) ,
1/σ 2 ∼ Gamma(ν0 , σ02 )
p(µ, σ 2 |D)
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
11 / 22
Metropolis-Hastings: normal-Gamma example
xi ∼ N (µ, σ 2 ) ,
µ ∼ N (µ0 , σ 2 /λ0 ) ,
1/σ 2 ∼ Gamma(ν0 , σ02 )
p(µ, σ 2 |D)
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
11 / 22
Metropolis-Hastings: normal-Gamma example
xi ∼ N (µ, σ 2 ) ,
µ ∼ N (µ0 , σ 2 /λ0 ) ,
1/σ 2 ∼ Gamma(ν0 , σ02 )
p(σ 2 |D)
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
12 / 22
Markov chain Monte Carlo: Gibbs Sampling
Gibbs sampling is an alternative MCMC scheme:
Gibbs sampling
I
Partition the parameter set into sub-sets: θ = {θ1 , θ2 , . . . , θJ }
I
Repeatedly sample each sub-set conditioned on the rest:
(s+1)
θi
(s)
(s)
(s)
(s)
(s)
∼ p(θi |θ1 , θ2 , . . . , θi−1 , θi+1 , . . . , θJ , D)
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
13 / 22
Markov chain Monte Carlo: Gibbs Sampling
Gibbs sampling is an alternative MCMC scheme:
Gibbs sampling
I
Partition the parameter set into sub-sets: θ = {θ1 , θ2 , . . . , θJ }
I
Repeatedly sample each sub-set conditioned on the rest:
(s+1)
θi
(s)
(s)
(s)
(s)
(s)
∼ p(θi |θ1 , θ2 , . . . , θi−1 , θi+1 , . . . , θJ , D)
Less general since conditional distributions must be tractable
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
13 / 22
Markov chain Monte Carlo: Gibbs Sampling
Gibbs sampling is an alternative MCMC scheme:
Gibbs sampling
I
Partition the parameter set into sub-sets: θ = {θ1 , θ2 , . . . , θJ }
I
Repeatedly sample each sub-set conditioned on the rest:
(s+1)
θi
(s)
(s)
(s)
(s)
(s)
∼ p(θi |θ1 , θ2 , . . . , θi−1 , θi+1 , . . . , θJ , D)
Less general since conditional distributions must be tractable
Special case of Metropoils-Hastings where every move is accepted
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
13 / 22
Markov chain Monte Carlo: Gibbs Sampling
Gibbs sampling is an alternative MCMC scheme:
Gibbs sampling
I
Partition the parameter set into sub-sets: θ = {θ1 , θ2 , . . . , θJ }
I
Repeatedly sample each sub-set conditioned on the rest:
(s+1)
θi
(s)
(s)
(s)
(s)
(s)
∼ p(θi |θ1 , θ2 , . . . , θi−1 , θi+1 , . . . , θJ , D)
Less general since conditional distributions must be tractable
Special case of Metropoils-Hastings where every move is accepted
Good point: there is no proposal density to choose
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
13 / 22
Markov chain Monte Carlo: Gibbs Sampling
Gibbs sampling is an alternative MCMC scheme:
Gibbs sampling
I
Partition the parameter set into sub-sets: θ = {θ1 , θ2 , . . . , θJ }
I
Repeatedly sample each sub-set conditioned on the rest:
(s+1)
θi
(s)
(s)
(s)
(s)
(s)
∼ p(θi |θ1 , θ2 , . . . , θi−1 , θi+1 , . . . , θJ , D)
Less general since conditional distributions must be tractable
Special case of Metropoils-Hastings where every move is accepted
Good point: there is no proposal density to choose
Bad point: convergence is slow for highly correlated variables
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
13 / 22
Gibbs: normal-Gamma example
p(µ, σ 2 |D)
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
14 / 22
Gibbs: sampling from a Gaussian
8
8
6
6
θ
θ2
10
2
10
4
4
2
2
0
0
2
4
θ1
6
8
10
0
0
2
4
θ1
6
8
10
On the right we see how Gibbs moves more slowly around a highly
correlated posterior
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
15 / 22
Multi-modal landscapes
5
4
3
2
1
1
2
3
4
5
Metropolis-Hastings can become trapped at local optima
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
16 / 22
Multi-modal landscapes
5
4
3
2
1
1
2
3
4
5
Metropolis-Hastings can become trapped at local optima
Advanced MCMC samplers can perform better, e.g. population
MCMC, thermodynamic annealing etc.
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
16 / 22
Variational Inference
Like Gibbs we partition θ = {θ1 , θ2 , . . . , θJ }
Approximate p(θ|D) by Q(θ) = q(θ1 )q(θ2 ) . . . q(θJ ).
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
17 / 22
Variational Inference
Like Gibbs we partition θ = {θ1 , θ2 , . . . , θJ }
Approximate p(θ|D) by Q(θ) = q(θ1 )q(θ2 ) . . . q(θJ ).
Iteratively update Q(θ) by optimising an EM-like bound:
Z
log P(D)
=
log
P(D, θ) dθ
Z
P(D, θ)
log Q(θ)
dθ
Q(θ)
Z
P(D, θ)
dθ
≥
Q(θ) log
Q(θ)
=
The updates are:
q(θk ) ∝ exphlog p(D|θ)p(θ)iQj6=k q(θj )
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
17 / 22
Model selection
Sometimes we are interested in comparing models
let M be a model and D be a data set
Bayes theorem gives us,
p(D|M)p(M)
M p(D|M)p(M)
p(M|D) = P
The first term in numerator is the denominator from parameter
inference. It is called the evidence or marginal likelihood
Z
p(D|M) = p(D|θM )p(θM )dθM
Assuming all models are equally good a priori, we should select the
one with largest evidence
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
18 / 22
Model selection
Sometimes we are interested in comparing models
let M be a model and D be a data set
Bayes theorem gives us,
p(D|M)p(M)
M p(D|M)p(M)
p(M|D) = P
The first term in numerator is the denominator from parameter
inference. It is called the evidence or marginal likelihood
Z
p(D|M) = p(D|θM )p(θM )dθM
Assuming all models are equally good a priori, we should select the
one with largest evidence
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
18 / 22
Model selection
Sometimes we are interested in comparing models
let M be a model and D be a data set
Bayes theorem gives us,
p(D|M)p(M)
M p(D|M)p(M)
p(M|D) = P
The first term in numerator is the denominator from parameter
inference. It is called the evidence or marginal likelihood
Z
p(D|M) = p(D|θM )p(θM )dθM
Assuming all models are equally good a priori, we should select the
one with largest evidence
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
18 / 22
Model selection
Sometimes we are interested in comparing models
let M be a model and D be a data set
Bayes theorem gives us,
p(D|M)p(M)
M p(D|M)p(M)
p(M|D) = P
The first term in numerator is the denominator from parameter
inference. It is called the evidence or marginal likelihood
Z
p(D|M) = p(D|θM )p(θM )dθM
Assuming all models are equally good a priori, we should select the
one with largest evidence
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
18 / 22
Model selection
Sometimes we are interested in comparing models
let M be a model and D be a data set
Bayes theorem gives us,
p(D|M)p(M)
M p(D|M)p(M)
p(M|D) = P
The first term in numerator is the denominator from parameter
inference. It is called the evidence or marginal likelihood
Z
p(D|M) = p(D|θM )p(θM )dθM
Assuming all models are equally good a priori, we should select the
one with largest evidence
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
18 / 22
Model selection
Sometimes we are interested in comparing models
let M be a model and D be a data set
Bayes theorem gives us,
p(D|M)p(M)
M p(D|M)p(M)
p(M|D) = P
The first term in numerator is the denominator from parameter
inference. It is called the evidence or marginal likelihood
Z
p(D|M) = p(D|θM )p(θM )dθM
Assuming all models are equally good a priori, we should select the
one with largest evidence
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
18 / 22
Model selection
p(D)
M1
M2
M3
D0
D
M1 too specific
M3 too general
C.M. Bishop “Pattern Recognition and Machine Learning” (Springer 2006)
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
19 / 22
Laplace approximation of the evidence
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
20 / 22
Laplace approximation of the evidence
The evidence cannot be calculated exactly in many cases
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
20 / 22
Laplace approximation of the evidence
The evidence cannot be calculated exactly in many cases
The Laplace approximation can be used again
Z
p(D) =
p(D|θ)p(θ)dθ
Z
= exp ln(p(D|θMAP )p(θMAP )) − 12 (θ − θMAP )T A(θ − θMAP ) dθ
s
(2π)p
= exp (ln(p(D|θMAP )p(θMAP )))
det(A)
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
20 / 22
Laplace approximation of the evidence
The evidence cannot be calculated exactly in many cases
The Laplace approximation can be used again
Z
p(D) =
p(D|θ)p(θ)dθ
Z
= exp ln(p(D|θMAP )p(θMAP )) − 12 (θ − θMAP )T A(θ − θMAP ) dθ
s
(2π)p
= exp (ln(p(D|θMAP )p(θMAP )))
det(A)
So working in logs
ln p(D) ' ln p(D|θMAP ) + ln p(θMAP ) +
Magnus Rattray (University of Manchester)
Approximate Inference
p
1
ln(2π) − ln det(A)
2
2
15/06/10
20 / 22
Model selection – BIC and AIC
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
21 / 22
Model selection – BIC and AIC
The Laplace approximation to the evidence is
ln p(D) ' ln p(D|θMAP ) + ln p(θMAP ) +
Magnus Rattray (University of Manchester)
Approximate Inference
p
1
ln(2π) − ln det(A)
2
2
15/06/10
21 / 22
Model selection – BIC and AIC
The Laplace approximation to the evidence is
ln p(D) ' ln p(D|θMAP ) + ln p(θMAP ) +
p
1
ln(2π) − ln det(A)
2
2
Assuming det(A) ∝ np and large n gives Bayesian Information
Criterion (BIC)
−2 ln p(D) ' BIC = −2 ln p(D|θML ) + p ln n
where p is the number of free parameters
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
21 / 22
Model selection – BIC and AIC
The Laplace approximation to the evidence is
ln p(D) ' ln p(D|θMAP ) + ln p(θMAP ) +
p
1
ln(2π) − ln det(A)
2
2
Assuming det(A) ∝ np and large n gives Bayesian Information
Criterion (BIC)
−2 ln p(D) ' BIC = −2 ln p(D|θML ) + p ln n
where p is the number of free parameters
A similar criterion is the Akaike Information Criterion (AIC)
AIC = −2 ln p(D|θML ) + p
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
21 / 22
Model selection – BIC and AIC
The Laplace approximation to the evidence is
ln p(D) ' ln p(D|θMAP ) + ln p(θMAP ) +
p
1
ln(2π) − ln det(A)
2
2
Assuming det(A) ∝ np and large n gives Bayesian Information
Criterion (BIC)
−2 ln p(D) ' BIC = −2 ln p(D|θML ) + p ln n
where p is the number of free parameters
A similar criterion is the Akaike Information Criterion (AIC)
AIC = −2 ln p(D|θML ) + p
Both criteria penalize models for increases in complexity to reduce the
problem of maximum likelihood over-fitting
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
21 / 22
Model selection
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
22 / 22
Model selection
AIC/BIC can over-penalize complex models as they do not account
for parameter correlations – cannot be used with regularized or
non-parametric models
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
22 / 22
Model selection
AIC/BIC can over-penalize complex models as they do not account
for parameter correlations – cannot be used with regularized or
non-parametric models
Laplace’s method does account for parameter correlations, but is still
a pretty rough approximation
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
22 / 22
Model selection
AIC/BIC can over-penalize complex models as they do not account
for parameter correlations – cannot be used with regularized or
non-parametric models
Laplace’s method does account for parameter correlations, but is still
a pretty rough approximation
More advanced marginal likelihood approximations exist, mostly based
on MCMC but also some promising message passing approaches
For recent review of MCMC for evidence estimation see Iain Murray’s thesis;
for Systems Biology applications see work from Mark Girolami’s group
Magnus Rattray (University of Manchester)
Approximate Inference
15/06/10
22 / 22
Download