PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 10 ≈ Approximate Bayesian Inference I: Structural Approximations FALK LIEDER DECEMBER 2 2010 Introduction Variational Inference Variational Bayes Applications Statistical Inference Z Hidden States X P(Z|X) πΌ π π |π Observations β Posterior Belief Introduction Variational Inference Variational Bayes Applications When Do You Need Approximations? The problem with Bayes theorem is that it often leads to integrals that you don’t know how to solve. π π π§ ⋅ π(π§) π π§π = ,π π = π(π) 1. No analytic solution for π π₯ = π π π§ ⋅ π π§ ππ§ π π π§ ⋅ π π§ ππ§ 2. No analytic solution for πΌ π π |π = π π§ ⋅ π π§ π dz 3. In the discrete case computing π(π) has complexity π(exp(#dim of Z)) 4. Sequential Learning For Non-Conjugate Priors Introduction Variational Inference Variational Bayes Applications How to Approximate? Samples Approximate Density by Histogram Approx. Expectations by Averages Numerical integration Approximate Integrals Numerically: a) Evidence p(x) b) Expectations Infeasible if Z is highdimensional Structual Approximation Approximation by a Density of a given Form Evidence /Expectations of Approximate Density are easy to compute Introduction Variational Inference Variational Bayes Applications How to Approximate? Structural Approximations (Variational Inference) Stochastic Approximations (Monte-Carlo-Methods, Sampling) + Fast to Compute - Systematic Error + Efficient Representation - Application often requires mathematical derivations + Learning Rules give Insight - Time-Intensive + Asymptotically Exact - Storage Intensive + Easily Applicable General Purpose Algorithms Introduction Variational Inference Variational Bayes Applications Variational Inference—An Intuition Target Family KL-Divergence VB Approximation Probability Distributions True Posterior Introduction Variational Inference Variational Bayes Applications What Does Closest Mean? Intuition: Closest means minimal additional surprise on average. Kullback-Leibler (KL) divergence measures average additional surprise. KL[p||q]=πΌπ [Surpriseπ π ] − πΌπ [Surpriseπ (π)] KL[p||q] measures how much less accurate the belief q is than p, if p is the true belief. KL[p||q] is largest reduction in average surprise that you can achieve, if p is the true belief. Introduction Variational Inference Variational Bayes KL-Divergence Illustration KL[π(⋅ |π)||π] β π(π|πΏ) π π|πΏ ⋅ π₯π§ ππ π(π) Applications Introduction Variational Inference Variational Bayes Applications Properties of the KL-Divergence KL[π| π(⋅ |π) β π π ⋅ ln π(π) ππ π(π|π) 1. Zero iff both arguments are identical: KL[π| π ⋅ π = 0 ⇔ π=π 2. Greater than zero, if they are different: KL[π| π ⋅ π > 0 ⇔π≠π Disadvantage The KL-divergence is not a metric (distance function), because a) It is not symmetric . b) It does not satisfy the triangle inequality. Introduction Variational Inference Variational Bayes Applications How to Find the Closest Target Density? • Intuition: Minimize Distance • Implementations: – Variational Bayes: Minimize KL[π||π] – Expectation Propagation: KL[π||π] • Arbitrariness – Different Measures ⇒ Different Algorithms & Different Results – Alternative Schemes are being developed, e.g. Jaakola-Jordan variational method, Kikuchi-Approximations Introduction Variational Inference Variational Bayes Applications Minimizing Functionals • KL-divergence is a functional Calculus Functions that map vectors to real numbers: π: βπ β¦ β Derivative: Change of π(π₯1 , … , π₯π ) for infinitesimal changes in π₯1 , π₯2 , … , π₯π Variational Calculus Functionals map functions to real numbers Functional Derivative: Change of πΉ(π −∞ , … , π(+∞)) for infinitesimal changes in f −∞ , … , π 0 , … , π(−∞) Minimizing Functions Minimizing Functionals Find the root of the derivative Find the root of the functional derivative Introduction Variational Inference Variational Bayes Applications VB and the Free-Energy β±(π) Variational Bayes: approximate posterior q = argmin KL[q||p(π§|π)] q∈T Problem: You can’t evaluate the KL-divergence, because you can’t evaluate the posterior. Solution: KL[π| π π§ π = = π π§ π π₯ π π§, π₯ π π ⋅ ln π π ⋅ ln π π π ππ ππ§ = ππ π π§ ⋅ ln π π§ π π§, π₯ ππ§ + ln π π₯ −β± π = −β(π) Conclusion: • You can maximize the free-energy instead. const Introduction Variational Inference Variational Bayes Examples VB: Minimizing KL-Divergence is equivalent to Maximizing Free-Energy β±(q) ln π(π) = β±(π) + KL[π||π] ln π(π) Introduction Variational Inference Variational Bayes Applications Constrained Free-Energy Maximization q = argmax β± (q) q∈T Intuition: • Maximize a Lower Bound on the Log Model Evidence • Maximization is restricted to tractable target densities Definition: π(π, π) β± π β π π§ ⋅ ln dz π(π) Properties • β± π ≤ ln π(π) • The free-energy is maximal for the true posterior. Introduction Variational Inference Variational Bayes Applications Variational Approximations 1. Factorial Approximations (Meanfield) – Independence Assumption π(π§) = πΎ π=1 ππ (π§π ) – Optimization with respect to factor densities ππ – No Restriction on Functional Form of the factors 2. Approximation by Parametric Distributions – Optimization w.r.t. Parameters 3. Variational Approximations for Model Comparison – Variational Approximation of the Log Model Evidence Introduction Variational Inference Variational Bayes Examples Meanfield Approximation Goal: 1. Rewrite β±(π) as a function of ππ and optimize. 2. Optimize β±(ππ ) separately for each factor ππ Step 1: πΎ β± π = πΎ ππ π§π ln π π, π ππ§1 β― ππ§πΎ + π=1 ππ π§π ππ π§π ln ππ π§π ππ§1 β― ππ§πΎ π=1 ππ π§π ln π π, π ππ§1 β― ππ§π−1 ππ§π+1 β― ππ§πΎ ππ§π π≠π ln π π, ππ + const β πΌπ≠π [ln π(π, π)] Introduction Variational Inference Variational Bayes Applications Meanfield Approximation, Step 1 πΎ β± π = πΎ πΎ ππ π§π ln π π, π ππ§1 β― ππ§πΎ − π=1 ππ π§π ⋅ π=1 π π§ ln ππ π§π ππ§1 β― ππ§πΎ + ln ππ π§π ππ§1 β― ππ§πΎ π=1 π π§ ⋅ ln ππ π§π ππ§1 β― ππ§πΎ π≠π ππ π§π ln ππ π§π ππ§1 β― ππ§π−1 ππ§π+1 β― ππ§πΎ ππ§π const ln ππ (π§π ) πΉ ππ = qj π§π ⋅ ln π (π, ππ ) ππ§π − ππ π§π ⋅ ln ππ π§π ππ§π + const Introduction Variational Inference Variational Bayes Applications Meanfield Approximation, Step 2 β± ππ = qj π§π ⋅ ln π (π, ππ ) ππ§π − Notice that β± ππ = −πΎπΏ[ππ | π π, ππ ππ π§π ⋅ ln ππ π§π ππ§π + const + const. qj = arg max β± ππ = π π, ππ ππ = exp πΌπ≠π ln π π, π + const The constant must be the evidence, because qj has to integrate to one. Hence, exp πΌπ≠π ln π π, π qj = exp πΌπ≠π ln π π, π ππ§π Introduction Variational Inference Variational Bayes Applications Meanfield Example True Distribution: π π§1 , π§2 = π© π, Λ−1 with π = π1 , π2 π‘ , Λ = π11 π12 Target Family: π π§1 , π§2 = π1 π§1 ⋅ π2 (π§2 ) VB meanfield solution: 1. ln q1 z1 = πΌz2 [ln p z1 , z2 ] + const 1 2. πΌz2 [ln p z1 , z2 ] = − 2 λ11 z1 − μ1 2 + 2λ12 (z1 − π12 π22 Introduction Variational Inference Variational Bayes Applications Meanfield Example True Density Approximation Observation: VB-Approximation is more compact than true density. Reason: KL[q||p] does not penalize deviations where q is close to 0. KL[π| π(⋅ |π) β π π ⋅ ln π(π) ππ π(π|π) Unreasonable Assumptions ο Poor Approximation Introduction Variational Inference Variational Bayes Applications KL[q||p] vs. KL[p||q] Variational Bayes • Analytically Easier • Approx. is more compact Expectation Propagation • More Involved • Approx. is wider Introduction Variational Inference Variational Bayes Applications 2. Parametric Approximations • Problem: – You don’t know how to integrate prior times likelihood. • Solution: – Approximate π(π§|π) by q ∈ π ⋅ ; π : π ∈ Θ . – KL-divergence and free-energy become functions of the parameters – Apply standard optimization techniques. – Setting derivatives to zero ο One equation per parameter. – Solve System of Equations by iterative Updating. Introduction Variational Inference Variational Bayes Applications Parametric Approximation Example Goal: Learn the Reward Probability p • Likelihood: π ∼ π΅π , π , π = 1/(1 + exp(−20π§ − 4)) • Prior: π ∼ π©(0,1) • Posterior: π π§ π = 1 ∝ Problem exp(−π§ 2 /2) 1+exp(−20π§+4) β Z X 0,1 You cannot derive a learning rule for the expected reward and ist variance, because… a) No Analytic Formula for Expected Reward Probability b) Form of Prior Changes with Every Observation Solution: Approximate the Posterior by a Gaussian. Introduction Variational Inference Variational Bayes Applications Solution (π, π) = arg min KL[π(π, π)| π(π§|π) = arg max β±(π(π, π) π,π Solve π,π πKL[π(π, π)||π] πβ±(π(π, π) I =− =0 ππ ππ πKL[π(π, π)||π] πβ±(π(π, π) II =− =0 ππ ππ Introduction Variational Inference Variational Bayes Applications Result: A Global Approximation Learning Rules for expected reward probability and the uncertainty about it ο Sequential Learning Algorithm True Posterior Laplace Variational Bayes Introduction Variational Inference Variational Bayes Applications VB for Bayesian Model Selection • π ππ = π ππ ⋅π(π) π(π) ∝π π π ⋅π π • Hence, if π π is uniform π π π ∝ π(π|π). • Problem: – π ππ = π(π|π) ⋅ π π π, π ππ is “intractable” • Solution: – ln π(π|π) = β± π + KL[π||π(π|π)] – ln π(π|π) ≈ max β± π π∈π • Justification: – If π π π ∈ π, then ππππ₯ π = π π π – ⇒ KL[πmax ||π(π|π)]=0 ⇒ β± πmax = ln π(π|π) Motivation & Overview VI Intuition VB Maths Applications Summary Approximate Bayesian Inference Structural Approximations Variational Bayes (Ensemble Learning) Meanfield Parametric Approx. Learning Rules, Model Selection