Approximate Inference Magnus Rattray Machine Learning and Oprimization Group, University of Manchester June 15th, 2010 Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 1 / 22 Talk outline Full Bayesian inference example MAP-Laplace approximation MCMC: Metropolis-Hastings MCMC: Gibbs sampling Variational Inference Model selection Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 2 / 22 Full Bayesian parameter inference: normal-Gamma example Model data as xi ∼ N (µ, σ 2 ) with parameters θ = {µ, σ 2 } Gamma distribution prior on inverse variance 1/σ 2 ∼ Gamma(ν0 , σ02 ) Gaussian distribution prior over mean µ ∼ N (µ0 , σ 2 /λ0 ) The posterior distribution p(θ|D) has same functional form as prior – in this case the prior is called a conjugate prior Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 3 / 22 Full Bayesian parameter inference: normal-Gamma example Model data as xi ∼ N (µ, σ 2 ) with parameters θ = {µ, σ 2 } Gamma distribution prior on inverse variance 1/σ 2 ∼ Gamma(ν0 , σ02 ) Gaussian distribution prior over mean µ ∼ N (µ0 , σ 2 /λ0 ) The posterior distribution p(θ|D) has same functional form as prior – in this case the prior is called a conjugate prior Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 3 / 22 Full Bayesian parameter inference: normal-Gamma example Model data as xi ∼ N (µ, σ 2 ) with parameters θ = {µ, σ 2 } Gamma distribution prior on inverse variance 1/σ 2 ∼ Gamma(ν0 , σ02 ) Gaussian distribution prior over mean µ ∼ N (µ0 , σ 2 /λ0 ) The posterior distribution p(θ|D) has same functional form as prior – in this case the prior is called a conjugate prior Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 3 / 22 Full Bayesian parameter inference: normal-Gamma example Model data as xi ∼ N (µ, σ 2 ) with parameters θ = {µ, σ 2 } Gamma distribution prior on inverse variance 1/σ 2 ∼ Gamma(ν0 , σ02 ) Gaussian distribution prior over mean µ ∼ N (µ0 , σ 2 /λ0 ) The posterior distribution p(θ|D) has same functional form as prior – in this case the prior is called a conjugate prior Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 3 / 22 Full Bayesian parameter inference Prior: p(µ,σ2) Likelihood: p(D|µ,σ2) −3 x 10 8 6 4 2 4 0.01 3 2 1 σ2 5 10 µ 15 0 4 3 Posterior: p(µ,σ2|D) 2 1 σ2 5 10 µ 15 Posterior: p(σ2|D) 1 1 0.5 0 4 0.5 3 2 1 σ2 Magnus Rattray (University of Manchester) 5 10 µ 15 0 0 Approximate Inference 2 σ2 4 15/06/10 4 / 22 Full Bayesian parameter inference Posterior p(θ|D) provides more than just point estimates p(µ,σ2|D) 1 p(σ2|D) 1 0.5 0 4 0.5 3 2 σ2 1 Magnus Rattray (University of Manchester) 5 15 10 µ 0 0 Approximate Inference 2 σ2 4 15/06/10 5 / 22 Full Bayesian parameter inference Posterior p(θ|D) provides more than just point estimates p(µ,σ2|D) 1 p(σ2|D) 1 0.5 0 4 0.5 3 2 σ2 1 5 15 10 µ 0 0 2 σ2 4 Captures uncertainty in parameter estimates and covariance between different parameters Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 5 / 22 Full Bayesian parameter inference Posterior p(θ|D) provides more than just point estimates p(µ,σ2|D) 1 p(σ2|D) 1 0.5 0 4 0.5 3 2 σ2 1 5 15 10 µ 0 0 2 σ2 4 Captures uncertainty in parameter estimates and covariance between different parameters Can usefully propagate uncertainty through next stage of analysis Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 5 / 22 Full Bayesian parameter inference In complex models the posterior distribution can be multi-modal Girolami et al. in “Learning and Inference in Computational Systems Biology” (MIT Press 2010) Point estimates are not enough to summarize our belief Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 6 / 22 Full Bayesian parameter inference Rather than using point estimates, one can try and evaluate p(θ|D) Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 7 / 22 Full Bayesian parameter inference Rather than using point estimates, one can try and evaluate p(θ|D) Intractable for non-conjugate or very high-dimensional models Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 7 / 22 Full Bayesian parameter inference Rather than using point estimates, one can try and evaluate p(θ|D) Intractable for non-conjugate or very high-dimensional models Approximate inference schemes include: Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 7 / 22 Full Bayesian parameter inference Rather than using point estimates, one can try and evaluate p(θ|D) Intractable for non-conjugate or very high-dimensional models Approximate inference schemes include: I Laplace approximation: Gaussian approximation to posterior centred at MAP solution Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 7 / 22 Full Bayesian parameter inference Rather than using point estimates, one can try and evaluate p(θ|D) Intractable for non-conjugate or very high-dimensional models Approximate inference schemes include: I Laplace approximation: Gaussian approximation to posterior centred at MAP solution I Markov chain Monte Carlo (MCMC): Class of iterative sampling-based algorithms Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 7 / 22 Full Bayesian parameter inference Rather than using point estimates, one can try and evaluate p(θ|D) Intractable for non-conjugate or very high-dimensional models Approximate inference schemes include: I Laplace approximation: Gaussian approximation to posterior centred at MAP solution I Markov chain Monte Carlo (MCMC): Class of iterative sampling-based algorithms I Variational inference: Fitting a simplified approximate function to posterior Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 7 / 22 Full Bayesian parameter inference Rather than using point estimates, one can try and evaluate p(θ|D) Intractable for non-conjugate or very high-dimensional models Approximate inference schemes include: I Laplace approximation: Gaussian approximation to posterior centred at MAP solution I Markov chain Monte Carlo (MCMC): Class of iterative sampling-based algorithms I Variational inference: Fitting a simplified approximate function to posterior I Message passing (loopy BP, EP etc): Belief propagation on intractable graphs Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 7 / 22 Laplace approximation p(D|θ)p(θ) ∝ exp(ln(p(D|θ)p(θ))) p(D) ' exp ln(p(D|θMAP )p(θMAP )) − 12 (θ − θMAP )T A(θ − θMAP ) ∂ 2 ln(p(D|θ)p(θ)) −1 = N (θ|θMAP , A ) where Aij = − ∂θi ∂θj θMAP p(θ|D) = 1.4 Exact MAP−Laplace 1.2 p(σ2|D) 1 0.8 0.6 0.4 0.2 0 0 Magnus Rattray (University of Manchester) 1 2 σ2 Approximate Inference 3 4 15/06/10 8 / 22 Markov chain Monte Carlo: Metropolis-Hastings A method to draw samples from the posterior θ (s) ∼ p (θ|D). Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 9 / 22 Markov chain Monte Carlo: Metropolis-Hastings A method to draw samples from the posterior θ (s) ∼ p (θ|D). We design a 1st order Markov process θ (s+1) ∼ t θ (s+1) |θ (s) so that θ (s) ∼ p (θ|D) as s → ∞. Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 9 / 22 Markov chain Monte Carlo: Metropolis-Hastings A method to draw samples from the posterior θ (s) ∼ p (θ|D). We design a 1st order Markov process θ (s+1) ∼ t θ (s+1) |θ (s) so that θ (s) ∼ p (θ|D) as s → ∞. Popular approach is Metropolis-Hastings: Metropolis-Hastings Algorithm 1 Select a proposal from q θ ∗ |θ (s) 2 Accept with probability ∗ α θ |θ 3 (s) ( ) p (θ ∗ |D) q θ (s) |θ ∗ = min 1, p θ (s) |D q θ ∗ |θ (s) If accepted θ (s+1) = θ ∗ , else θ (s+1) = θ (s) . Repeat until convergence. Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 9 / 22 Metropolis-Hastings: sampling from a Gaussian posterior 10 8 θ 2 6 4 2 0 0 2 4 θ1 6 8 10 Spherical Gaussian proposal distribution Solid lines are accepted moves, faint lines are rejected Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 10 / 22 Metropolis-Hastings: normal-Gamma example xi ∼ N (µ, σ 2 ) , µ ∼ N (µ0 , σ 2 /λ0 ) , 1/σ 2 ∼ Gamma(ν0 , σ02 ) p(µ, σ 2 |D) Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 11 / 22 Metropolis-Hastings: normal-Gamma example xi ∼ N (µ, σ 2 ) , µ ∼ N (µ0 , σ 2 /λ0 ) , 1/σ 2 ∼ Gamma(ν0 , σ02 ) p(µ, σ 2 |D) Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 11 / 22 Metropolis-Hastings: normal-Gamma example xi ∼ N (µ, σ 2 ) , µ ∼ N (µ0 , σ 2 /λ0 ) , 1/σ 2 ∼ Gamma(ν0 , σ02 ) p(σ 2 |D) Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 12 / 22 Markov chain Monte Carlo: Gibbs Sampling Gibbs sampling is an alternative MCMC scheme: Gibbs sampling I Partition the parameter set into sub-sets: θ = {θ1 , θ2 , . . . , θJ } I Repeatedly sample each sub-set conditioned on the rest: (s+1) θi (s) (s) (s) (s) (s) ∼ p(θi |θ1 , θ2 , . . . , θi−1 , θi+1 , . . . , θJ , D) Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 13 / 22 Markov chain Monte Carlo: Gibbs Sampling Gibbs sampling is an alternative MCMC scheme: Gibbs sampling I Partition the parameter set into sub-sets: θ = {θ1 , θ2 , . . . , θJ } I Repeatedly sample each sub-set conditioned on the rest: (s+1) θi (s) (s) (s) (s) (s) ∼ p(θi |θ1 , θ2 , . . . , θi−1 , θi+1 , . . . , θJ , D) Less general since conditional distributions must be tractable Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 13 / 22 Markov chain Monte Carlo: Gibbs Sampling Gibbs sampling is an alternative MCMC scheme: Gibbs sampling I Partition the parameter set into sub-sets: θ = {θ1 , θ2 , . . . , θJ } I Repeatedly sample each sub-set conditioned on the rest: (s+1) θi (s) (s) (s) (s) (s) ∼ p(θi |θ1 , θ2 , . . . , θi−1 , θi+1 , . . . , θJ , D) Less general since conditional distributions must be tractable Special case of Metropoils-Hastings where every move is accepted Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 13 / 22 Markov chain Monte Carlo: Gibbs Sampling Gibbs sampling is an alternative MCMC scheme: Gibbs sampling I Partition the parameter set into sub-sets: θ = {θ1 , θ2 , . . . , θJ } I Repeatedly sample each sub-set conditioned on the rest: (s+1) θi (s) (s) (s) (s) (s) ∼ p(θi |θ1 , θ2 , . . . , θi−1 , θi+1 , . . . , θJ , D) Less general since conditional distributions must be tractable Special case of Metropoils-Hastings where every move is accepted Good point: there is no proposal density to choose Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 13 / 22 Markov chain Monte Carlo: Gibbs Sampling Gibbs sampling is an alternative MCMC scheme: Gibbs sampling I Partition the parameter set into sub-sets: θ = {θ1 , θ2 , . . . , θJ } I Repeatedly sample each sub-set conditioned on the rest: (s+1) θi (s) (s) (s) (s) (s) ∼ p(θi |θ1 , θ2 , . . . , θi−1 , θi+1 , . . . , θJ , D) Less general since conditional distributions must be tractable Special case of Metropoils-Hastings where every move is accepted Good point: there is no proposal density to choose Bad point: convergence is slow for highly correlated variables Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 13 / 22 Gibbs: normal-Gamma example p(µ, σ 2 |D) Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 14 / 22 Gibbs: sampling from a Gaussian 8 8 6 6 θ θ2 10 2 10 4 4 2 2 0 0 2 4 θ1 6 8 10 0 0 2 4 θ1 6 8 10 On the right we see how Gibbs moves more slowly around a highly correlated posterior Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 15 / 22 Multi-modal landscapes 5 4 3 2 1 1 2 3 4 5 Metropolis-Hastings can become trapped at local optima Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 16 / 22 Multi-modal landscapes 5 4 3 2 1 1 2 3 4 5 Metropolis-Hastings can become trapped at local optima Advanced MCMC samplers can perform better, e.g. population MCMC, thermodynamic annealing etc. Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 16 / 22 Variational Inference Like Gibbs we partition θ = {θ1 , θ2 , . . . , θJ } Approximate p(θ|D) by Q(θ) = q(θ1 )q(θ2 ) . . . q(θJ ). Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 17 / 22 Variational Inference Like Gibbs we partition θ = {θ1 , θ2 , . . . , θJ } Approximate p(θ|D) by Q(θ) = q(θ1 )q(θ2 ) . . . q(θJ ). Iteratively update Q(θ) by optimising an EM-like bound: Z log P(D) = log P(D, θ) dθ Z P(D, θ) log Q(θ) dθ Q(θ) Z P(D, θ) dθ ≥ Q(θ) log Q(θ) = The updates are: q(θk ) ∝ exphlog p(D|θ)p(θ)iQj6=k q(θj ) Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 17 / 22 Model selection Sometimes we are interested in comparing models let M be a model and D be a data set Bayes theorem gives us, p(D|M)p(M) M p(D|M)p(M) p(M|D) = P The first term in numerator is the denominator from parameter inference. It is called the evidence or marginal likelihood Z p(D|M) = p(D|θM )p(θM )dθM Assuming all models are equally good a priori, we should select the one with largest evidence Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 18 / 22 Model selection Sometimes we are interested in comparing models let M be a model and D be a data set Bayes theorem gives us, p(D|M)p(M) M p(D|M)p(M) p(M|D) = P The first term in numerator is the denominator from parameter inference. It is called the evidence or marginal likelihood Z p(D|M) = p(D|θM )p(θM )dθM Assuming all models are equally good a priori, we should select the one with largest evidence Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 18 / 22 Model selection Sometimes we are interested in comparing models let M be a model and D be a data set Bayes theorem gives us, p(D|M)p(M) M p(D|M)p(M) p(M|D) = P The first term in numerator is the denominator from parameter inference. It is called the evidence or marginal likelihood Z p(D|M) = p(D|θM )p(θM )dθM Assuming all models are equally good a priori, we should select the one with largest evidence Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 18 / 22 Model selection Sometimes we are interested in comparing models let M be a model and D be a data set Bayes theorem gives us, p(D|M)p(M) M p(D|M)p(M) p(M|D) = P The first term in numerator is the denominator from parameter inference. It is called the evidence or marginal likelihood Z p(D|M) = p(D|θM )p(θM )dθM Assuming all models are equally good a priori, we should select the one with largest evidence Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 18 / 22 Model selection Sometimes we are interested in comparing models let M be a model and D be a data set Bayes theorem gives us, p(D|M)p(M) M p(D|M)p(M) p(M|D) = P The first term in numerator is the denominator from parameter inference. It is called the evidence or marginal likelihood Z p(D|M) = p(D|θM )p(θM )dθM Assuming all models are equally good a priori, we should select the one with largest evidence Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 18 / 22 Model selection Sometimes we are interested in comparing models let M be a model and D be a data set Bayes theorem gives us, p(D|M)p(M) M p(D|M)p(M) p(M|D) = P The first term in numerator is the denominator from parameter inference. It is called the evidence or marginal likelihood Z p(D|M) = p(D|θM )p(θM )dθM Assuming all models are equally good a priori, we should select the one with largest evidence Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 18 / 22 Model selection p(D) M1 M2 M3 D0 D M1 too specific M3 too general C.M. Bishop “Pattern Recognition and Machine Learning” (Springer 2006) Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 19 / 22 Laplace approximation of the evidence Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 20 / 22 Laplace approximation of the evidence The evidence cannot be calculated exactly in many cases Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 20 / 22 Laplace approximation of the evidence The evidence cannot be calculated exactly in many cases The Laplace approximation can be used again Z p(D) = p(D|θ)p(θ)dθ Z = exp ln(p(D|θMAP )p(θMAP )) − 12 (θ − θMAP )T A(θ − θMAP ) dθ s (2π)p = exp (ln(p(D|θMAP )p(θMAP ))) det(A) Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 20 / 22 Laplace approximation of the evidence The evidence cannot be calculated exactly in many cases The Laplace approximation can be used again Z p(D) = p(D|θ)p(θ)dθ Z = exp ln(p(D|θMAP )p(θMAP )) − 12 (θ − θMAP )T A(θ − θMAP ) dθ s (2π)p = exp (ln(p(D|θMAP )p(θMAP ))) det(A) So working in logs ln p(D) ' ln p(D|θMAP ) + ln p(θMAP ) + Magnus Rattray (University of Manchester) Approximate Inference p 1 ln(2π) − ln det(A) 2 2 15/06/10 20 / 22 Model selection – BIC and AIC Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 21 / 22 Model selection – BIC and AIC The Laplace approximation to the evidence is ln p(D) ' ln p(D|θMAP ) + ln p(θMAP ) + Magnus Rattray (University of Manchester) Approximate Inference p 1 ln(2π) − ln det(A) 2 2 15/06/10 21 / 22 Model selection – BIC and AIC The Laplace approximation to the evidence is ln p(D) ' ln p(D|θMAP ) + ln p(θMAP ) + p 1 ln(2π) − ln det(A) 2 2 Assuming det(A) ∝ np and large n gives Bayesian Information Criterion (BIC) −2 ln p(D) ' BIC = −2 ln p(D|θML ) + p ln n where p is the number of free parameters Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 21 / 22 Model selection – BIC and AIC The Laplace approximation to the evidence is ln p(D) ' ln p(D|θMAP ) + ln p(θMAP ) + p 1 ln(2π) − ln det(A) 2 2 Assuming det(A) ∝ np and large n gives Bayesian Information Criterion (BIC) −2 ln p(D) ' BIC = −2 ln p(D|θML ) + p ln n where p is the number of free parameters A similar criterion is the Akaike Information Criterion (AIC) AIC = −2 ln p(D|θML ) + p Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 21 / 22 Model selection – BIC and AIC The Laplace approximation to the evidence is ln p(D) ' ln p(D|θMAP ) + ln p(θMAP ) + p 1 ln(2π) − ln det(A) 2 2 Assuming det(A) ∝ np and large n gives Bayesian Information Criterion (BIC) −2 ln p(D) ' BIC = −2 ln p(D|θML ) + p ln n where p is the number of free parameters A similar criterion is the Akaike Information Criterion (AIC) AIC = −2 ln p(D|θML ) + p Both criteria penalize models for increases in complexity to reduce the problem of maximum likelihood over-fitting Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 21 / 22 Model selection Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 22 / 22 Model selection AIC/BIC can over-penalize complex models as they do not account for parameter correlations – cannot be used with regularized or non-parametric models Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 22 / 22 Model selection AIC/BIC can over-penalize complex models as they do not account for parameter correlations – cannot be used with regularized or non-parametric models Laplace’s method does account for parameter correlations, but is still a pretty rough approximation Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 22 / 22 Model selection AIC/BIC can over-penalize complex models as they do not account for parameter correlations – cannot be used with regularized or non-parametric models Laplace’s method does account for parameter correlations, but is still a pretty rough approximation More advanced marginal likelihood approximations exist, mostly based on MCMC but also some promising message passing approaches For recent review of MCMC for evidence estimation see Iain Murray’s thesis; for Systems Biology applications see work from Mark Girolami’s group Magnus Rattray (University of Manchester) Approximate Inference 15/06/10 22 / 22