Approximate Bayesian Inference I

advertisement
PATTERN RECOGNITION AND MACHINE LEARNING
CHAPTER 10
≈
Approximate Bayesian Inference I:
Structural Approximations
FALK LIEDER
DECEMBER 2 2010
Introduction
Variational Inference
Variational Bayes
Applications
Statistical Inference
Z
Hidden States
X
P(Z|X)
𝔼 𝑓 𝑍 |𝑋
Observations ↝ Posterior Belief
Introduction
Variational Inference
Variational Bayes
Applications
When Do You Need Approximations?
The problem with Bayes theorem is that it often leads to
integrals that you don’t know how to solve.
𝑝 𝑋 𝑧 ⋅ 𝑝(𝑧)
𝑝 𝑧𝑋 =
,𝑝 𝑋 =
𝑝(𝑋)
1. No analytic solution for 𝑝 π‘₯ =
𝑝 𝑋 𝑧 ⋅ 𝑝 𝑧 𝑑𝑧
𝑝 𝑋 𝑧 ⋅ 𝑝 𝑧 𝑑𝑧
2. No analytic solution for 𝔼 𝑓 𝑍 |𝑋 =
𝑓 𝑧 ⋅ 𝑝 𝑧 𝑋 dz
3. In the discrete case computing 𝑝(𝑋) has complexity
𝑂(exp(#dim of Z))
4. Sequential Learning For Non-Conjugate Priors
Introduction
Variational Inference
Variational Bayes
Applications
How to Approximate?
Samples
Approximate Density
by Histogram
Approx. Expectations
by Averages
Numerical
integration
Approximate Integrals
Numerically:
a) Evidence p(x)
b) Expectations
Infeasible if Z is highdimensional
Structual
Approximation
Approximation by a
Density of a given Form
Evidence /Expectations of
Approximate Density are
easy to compute
Introduction
Variational Inference
Variational Bayes
Applications
How to Approximate?
Structural Approximations
(Variational Inference)
Stochastic Approximations
(Monte-Carlo-Methods, Sampling)
+ Fast to Compute
- Systematic Error
+ Efficient Representation
- Application often requires
mathematical derivations
+ Learning Rules give Insight
- Time-Intensive
+ Asymptotically Exact
- Storage Intensive
+ Easily Applicable General
Purpose Algorithms
Introduction
Variational Inference
Variational Bayes
Applications
Variational Inference—An Intuition
Target Family
KL-Divergence
VB Approximation
Probability Distributions
True Posterior
Introduction
Variational Inference
Variational Bayes
Applications
What Does Closest Mean?
Intuition: Closest means minimal additional surprise on average.
Kullback-Leibler (KL) divergence measures average additional surprise.
KL[p||q]=𝔼𝑝 [Surpriseπ‘ž 𝑍 ] − 𝔼𝑝 [Surprise𝑝 (𝑍)]
KL[p||q] measures how much less accurate the belief q is than p, if p
is the true belief.
KL[p||q] is largest reduction in average surprise that you can achieve,
if p is the true belief.
Introduction
Variational Inference
Variational Bayes
KL-Divergence Illustration
KL[𝑝(⋅ |𝑋)||π‘ž] ≔
𝒑(𝒁|𝑿)
𝒑 𝒁|𝑿 ⋅ π₯𝐧
𝑑𝑍
𝒒(𝒁)
Applications
Introduction
Variational Inference
Variational Bayes
Applications
Properties of the KL-Divergence
KL[π‘ž| 𝑝(⋅ |𝑋) ≔
π‘ž 𝑍 ⋅ ln
π‘ž(𝑍)
𝑑𝑍
𝑝(𝑍|𝑋)
1. Zero iff both arguments are identical: KL[π‘ž| 𝑝 ⋅ 𝑋 = 0 ⇔
π‘ž=𝑝
2. Greater than zero, if they are different: KL[π‘ž| 𝑝 ⋅ 𝑋 >
0 ⇔π‘ž≠𝑝
Disadvantage
The KL-divergence is not a metric (distance function), because
a) It is not symmetric .
b) It does not satisfy the triangle inequality.
Introduction
Variational Inference
Variational Bayes
Applications
How to Find the Closest Target Density?
• Intuition: Minimize Distance
• Implementations:
– Variational Bayes: Minimize KL[π‘ž||𝑝]
– Expectation Propagation: KL[𝑝||π‘ž]
• Arbitrariness
– Different Measures ⇒ Different Algorithms &
Different Results
– Alternative Schemes are being developed,
e.g. Jaakola-Jordan variational method, Kikuchi-Approximations
Introduction
Variational Inference
Variational Bayes
Applications
Minimizing Functionals
• KL-divergence is a functional
Calculus
Functions that map vectors to
real numbers: 𝑓: ℝ𝑛 ↦ ℝ
Derivative: Change of
𝑓(π‘₯1 , … , π‘₯𝑛 ) for infinitesimal
changes in π‘₯1 , π‘₯2 , … , π‘₯𝑛
Variational Calculus
Functionals map functions to
real numbers
Functional Derivative: Change of
𝐹(𝑓 −∞ , … , 𝑓(+∞)) for
infinitesimal changes in
f −∞ , … , 𝑓 0 , … , 𝑓(−∞)
Minimizing Functions
Minimizing Functionals
Find the root of the derivative
Find the root of the functional
derivative
Introduction
Variational Inference
Variational Bayes
Applications
VB and the Free-Energy β„±(π‘ž)
Variational Bayes: approximate posterior q = argmin KL[q||p(𝑧|𝑋)]
q∈T
Problem:
You can’t evaluate the KL-divergence, because you can’t evaluate the
posterior.
Solution:
KL[π‘ž| 𝑝 𝑧 𝑋
=
=
π‘ž 𝑧
𝑝 π‘₯
𝑝 𝑧, π‘₯
π‘ž 𝑍 ⋅ ln
π‘ž 𝑍 ⋅ ln
π‘ž 𝑍
𝑝 𝑍𝑋
𝑑𝑧 =
𝑑𝑍
π‘ž 𝑧 ⋅ ln
π‘ž 𝑧
𝑝 𝑧, π‘₯
𝑑𝑧 + ln 𝑝 π‘₯
−β„± π‘ž = −β„’(π‘ž)
Conclusion:
• You can maximize the free-energy instead.
const
Introduction
Variational Inference
Variational Bayes
Examples
VB: Minimizing KL-Divergence is
equivalent to Maximizing Free-Energy
β„±(q)
ln 𝑝(𝑋) = β„±(π‘ž) + KL[π‘ž||𝑝]
ln 𝑝(𝑋)
Introduction
Variational Inference
Variational Bayes
Applications
Constrained Free-Energy Maximization
q = argmax β„± (q)
q∈T
Intuition:
• Maximize a Lower Bound on the Log Model Evidence
• Maximization is restricted to tractable target densities
Definition:
𝑝(𝑋, 𝑍)
β„± π‘ž ≔ π‘ž 𝑧 ⋅ ln
dz
π‘ž(𝑍)
Properties
• β„± π‘ž ≤ ln 𝑝(𝑋)
• The free-energy is maximal for the true posterior.
Introduction
Variational Inference
Variational Bayes
Applications
Variational Approximations
1. Factorial Approximations (Meanfield)
– Independence Assumption π‘ž(𝑧) = 𝐾
𝑖=1 π‘žπ‘– (𝑧𝑖 )
– Optimization with respect to factor densities π‘žπ‘–
– No Restriction on Functional Form of the factors
2. Approximation by Parametric Distributions
– Optimization w.r.t. Parameters
3. Variational Approximations for Model
Comparison
– Variational Approximation of the Log Model Evidence
Introduction
Variational Inference
Variational Bayes
Examples
Meanfield Approximation
Goal:
1. Rewrite β„±(π‘ž) as a function of π‘žπ‘— and optimize.
2. Optimize β„±(π‘žπ‘— ) separately for each factor π‘žπ‘—
Step 1:
𝐾
β„± π‘ž =
𝐾
π‘žπ‘– 𝑧𝑖 ln 𝑝 𝑋, 𝑍 𝑑𝑧1 β‹― 𝑑𝑧𝐾 +
𝑖=1
π‘žπ‘— 𝑧𝑗
π‘žπ‘– 𝑧𝑖 ln π‘žπ‘– 𝑧𝑖 𝑑𝑧1 β‹― 𝑑𝑧𝐾
𝑖=1
π‘žπ‘– 𝑧𝑖 ln 𝑝 𝑋, 𝑍 𝑑𝑧1 β‹― 𝑑𝑧𝑗−1 𝑑𝑧𝑗+1 β‹― 𝑑𝑧𝐾 𝑑𝑧𝑗
𝑖≠𝑗
ln 𝑝 𝑋, 𝑍𝑗 + const ≔ 𝔼𝑖≠𝑗 [ln 𝑝(𝑋, 𝑍)]
Introduction
Variational Inference
Variational Bayes
Applications
Meanfield Approximation, Step 1
𝐾
β„± π‘ž =
𝐾
𝐾
π‘žπ‘– 𝑧𝑖 ln 𝑝 𝑋, 𝑍 𝑑𝑧1 β‹― 𝑑𝑧𝐾 −
𝑖=1
π‘žπ‘– 𝑧𝑖 ⋅
𝑖=1
π‘ž 𝑧 ln π‘žπ‘— 𝑧𝑗 𝑑𝑧1 β‹― 𝑑𝑧𝐾 +
ln π‘žπ‘– 𝑧𝑖 𝑑𝑧1 β‹― 𝑑𝑧𝐾
𝑖=1
π‘ž 𝑧 ⋅
ln π‘žπ‘– 𝑧𝑖 𝑑𝑧1 β‹― 𝑑𝑧𝐾
𝑖≠𝑗
π‘žπ‘— 𝑧𝑗
ln π‘žπ‘— 𝑧𝑗 𝑑𝑧1 β‹― 𝑑𝑧𝑗−1 𝑑𝑧𝑗+1 β‹― 𝑑𝑧𝐾 𝑑𝑧𝑗
const
ln π‘žπ‘— (𝑧𝑗 )
𝐹 π‘žπ‘— =
qj 𝑧𝑗 ⋅ ln 𝑝 (𝑋, 𝑍𝑗 ) 𝑑𝑧𝑗 −
π‘žπ‘— 𝑧𝑗 ⋅ ln π‘žπ‘— 𝑧𝑗 𝑑𝑧𝑗 + const
Introduction
Variational Inference
Variational Bayes
Applications
Meanfield Approximation, Step 2
β„± π‘žπ‘— =
qj 𝑧𝑗 ⋅ ln 𝑝 (𝑋, 𝑍𝑗 ) 𝑑𝑧𝑗 −
Notice that β„± π‘žπ‘— = −𝐾𝐿[π‘žπ‘— | 𝑝 𝑋, 𝑍𝑗
π‘žπ‘— 𝑧𝑗 ⋅ ln π‘žπ‘— 𝑧𝑗 𝑑𝑧𝑗 + const
+ const.
qj = arg max β„± π‘žπ‘— = 𝑝 𝑋, 𝑍𝑗
π‘žπ‘—
= exp 𝔼𝑖≠𝑗 ln 𝑝 𝑋, 𝑍
+ const
The constant must be the evidence, because qj has to integrate to one.
Hence,
exp 𝔼𝑖≠𝑗 ln 𝑝 𝑋, 𝑍
qj =
exp 𝔼𝑖≠𝑗 ln 𝑝 𝑋, 𝑍 𝑑𝑧𝑗
Introduction
Variational Inference
Variational Bayes
Applications
Meanfield Example
True Distribution: 𝑝 𝑧1 , 𝑧2 = 𝒩 πœ‡, Λ−1 with πœ‡ = πœ‡1 , πœ‡2 𝑑 , Λ =
πœ†11
πœ†12
Target Family: π‘ž 𝑧1 , 𝑧2 = π‘ž1 𝑧1 ⋅ π‘ž2 (𝑧2 )
VB meanfield solution:
1. ln q1 z1 = 𝔼z2 [ln p z1 , z2 ] + const
1
2. 𝔼z2 [ln p z1 , z2 ] = − 2 λ11 z1 − μ1
2
+ 2λ12 (z1 −
πœ†12
πœ†22
Introduction
Variational Inference
Variational Bayes
Applications
Meanfield Example
True Density
Approximation
Observation:
VB-Approximation is more
compact than true density.
Reason:
KL[q||p] does not penalize
deviations where q is close to 0.
KL[π‘ž| 𝑝(⋅ |𝑋) ≔
π‘ž 𝑍 ⋅ ln
π‘ž(𝑍)
𝑑𝑍
𝑝(𝑍|𝑋)
Unreasonable Assumptions οƒ  Poor Approximation
Introduction
Variational Inference
Variational Bayes
Applications
KL[q||p] vs. KL[p||q]
Variational Bayes
• Analytically Easier
• Approx. is more compact
Expectation Propagation
• More Involved
• Approx. is wider
Introduction
Variational Inference
Variational Bayes
Applications
2. Parametric Approximations
• Problem:
– You don’t know how to integrate prior times
likelihood.
• Solution:
– Approximate 𝑝(𝑧|𝑋) by q ∈ π‘ž ⋅ ; πœƒ : πœƒ ∈ Θ .
– KL-divergence and free-energy become functions of
the parameters
– Apply standard optimization techniques.
– Setting derivatives to zero οƒ  One equation per
parameter.
– Solve System of Equations by iterative Updating.
Introduction
Variational Inference
Variational Bayes
Applications
Parametric Approximation Example
Goal: Learn the Reward Probability p
• Likelihood: 𝑋 ∼ 𝐡𝑛 , 𝑝 , 𝑝 = 1/(1 + exp(−20𝑧 − 4))
• Prior: 𝑍 ∼ 𝒩(0,1)
• Posterior: 𝑝 𝑧 𝑋 = 1 ∝
Problem
exp(−𝑧 2 /2)
1+exp(−20𝑧+4)
ℝ
Z
X
0,1
You cannot derive a learning rule for the expected reward and ist
variance, because…
a) No Analytic Formula for Expected Reward Probability
b) Form of Prior Changes with Every Observation
Solution: Approximate the Posterior by a Gaussian.
Introduction
Variational Inference
Variational Bayes
Applications
Solution
(πœ‡, 𝜎) = arg min KL[π‘ž(πœ‡, 𝜎)| 𝑝(𝑧|𝑋) = arg max β„±(π‘ž(πœ‡, 𝜎)
πœ‡,𝜎
Solve
πœ‡,𝜎
πœ•KL[π‘ž(πœ‡, 𝜎)||𝑝]
πœ•β„±(π‘ž(πœ‡, 𝜎)
I
=−
=0
πœ•πœ‡
πœ•πœ‡
πœ•KL[π‘ž(πœ‡, 𝜎)||𝑝]
πœ•β„±(π‘ž(πœ‡, 𝜎)
II
=−
=0
πœ•πœŽ
πœ•πœŽ
Introduction
Variational Inference
Variational Bayes
Applications
Result: A Global Approximation
Learning Rules for
expected reward
probability and the
uncertainty about it
οƒ  Sequential
Learning Algorithm
True Posterior
Laplace
Variational Bayes
Introduction
Variational Inference
Variational Bayes
Applications
VB for Bayesian Model Selection
• 𝑝 π‘šπ‘‹ =
𝑝
π‘‹π‘š
⋅𝑝(π‘š)
𝑝(𝑋)
∝𝑝 𝑋 π‘š ⋅𝑝 π‘š
• Hence, if 𝑝 π‘š is uniform 𝑝 π‘š 𝑋 ∝ 𝑝(𝑋|π‘š).
• Problem:
– 𝑝 π‘‹π‘š =
𝑝(πœƒ|π‘š) ⋅ 𝑝 𝑋 πœƒ, π‘š π‘‘πœƒ is “intractable”
• Solution:
– ln 𝑝(𝑋|π‘š) = β„± π‘ž + KL[π‘ž||𝑝(πœƒ|𝑋)]
– ln 𝑝(𝑋|π‘š) ≈ max β„± π‘ž
π‘ž∈𝑇
• Justification:
– If 𝑝 πœƒ 𝑋 ∈ 𝑇, then π‘žπ‘šπ‘Žπ‘₯ πœƒ = 𝑝 πœƒ 𝑋
– ⇒ KL[π‘žmax ||𝑝(πœƒ|𝑋)]=0 ⇒ β„± π‘žmax = ln 𝑝(𝑋|π‘š)
Motivation & Overview
VI Intuition
VB Maths
Applications
Summary
Approximate Bayesian Inference
Structural Approximations
Variational Bayes (Ensemble Learning)
Meanfield
Parametric Approx.
Learning Rules, Model Selection
Download