Approximate Bayesian Inference for Machine Learning Louis Ellam

advertisement
Approximate Bayesian Inference for Machine Learning
1,2
Louis Ellam
l.ellam@warwick.ac.uk
Warwick Centre of Predictive Modelling (WCPM)
The University of Warwick
13 October 2015
http://www2.warwick.ac.uk/wcpm/
1
Based on Louis Ellam, ”Approximate Inference for Machine Learning”, MSc Dissertation, Centre for Scientific Computing, University of Warwick
2
Warwick Centre of Predictive Modelling (WCPM) Seminar Series, University of Warwick
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
1 / 44
Outline
1
Motivation
Machine Learning
Classical Approaches to Regression
Statistical Approaches to Regression
2
Monte Carlo Inference
Importance Sampling
Markov Chain Monte Carlo
3
Approximate Inference
Laplace Approximation
Variational Inference
Expectation Propagation
4
Summary of Methods
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
2 / 44
Outline
1
Motivation
Machine Learning
Classical Approaches to Regression
Statistical Approaches to Regression
2
Monte Carlo Inference
Importance Sampling
Markov Chain Monte Carlo
3
Approximate Inference
Laplace Approximation
Variational Inference
Expectation Propagation
4
Summary of Methods
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
3 / 44
Machine Learning
Machine Learning is the study of algorithms that can learn from data
Regression
Given D = {x 1 , . . . , x N , t1 , . . . , tN }, find a model that allows us to accurately estimate the response of any
input.
Clustering
Given data D = {x 1 , . . . , x N } and some fixed number of clusters K , assign each data point to a cluster so
that points in each cluster are close together.
The Clutter Problem
Given data D = {x 1 , ..., x N } comprising of a noisy signal that is immersed in clutter, recover the original signal
θ i.e. determine E[θ] and p(D). Observations are drawn from p(x |θ) = (1 − w )N (x |θ, I ) + w N (x |0,aI ).
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
4 / 44
Outline
1
Motivation
Machine Learning
Classical Approaches to Regression
Statistical Approaches to Regression
2
Monte Carlo Inference
Importance Sampling
Markov Chain Monte Carlo
3
Approximate Inference
Laplace Approximation
Variational Inference
Expectation Propagation
4
Summary of Methods
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
5 / 44
Classical Approach to Regression
Linear Model
Use a linear model f defined over M + 1 linearly independent basis functions φ0 (x ), . . . , φM (x ) with weights
w0 , . . . , wM
M
X
f (x , w ) =
wm φm (x ).
m=0
Tikhonov Regularisation
E (w ) =
λ
1
2
2
kΦw − t k + kw k ,
2
2
w ∗ = argminE (w ) = (ΦT Φ + λI )−1 ΦT t .
w
C. Bishop (2006), D. Calvetti et al. (2007)
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
6 / 44
Outline
1
Motivation
Machine Learning
Classical Approaches to Regression
Statistical Approaches to Regression
2
Monte Carlo Inference
Importance Sampling
Markov Chain Monte Carlo
3
Approximate Inference
Laplace Approximation
Variational Inference
Expectation Propagation
4
Summary of Methods
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
7 / 44
Statistical Approach to Regression
Bayes Theorem
p(θ|D) =
p(D|θ)p(θ)
.
p(D)
Likelihood and Prior
We expect that an observation to satisfy
t = f (x , w ) + η,
where η denotes mean centered Gaussian noise with precision parameter β. Thus the likelihood function is
defined as
N
Y
T
−1
p(t |x , w , β) =
N (tn |w φ(x n ), β ).
n=1
p(w |α) = N (w |0,α
−1
I ).
Posterior
ln p(w |, x , t , α, β) = −
N
β X
α T
T
2
(tn − w φ(x n )) − w w + const.
2 n=1
2
R. Kass (1995)
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
8 / 44
Bayesian Approach to Regression
Predictive Distribution
p(t|x , t , α, β) =
Z
p(t|w )p(w |x , t , α, β)d w = N (t|m N φ(x ), σN (x )),
T
where
2
σN =
2
1
T
+ φ(x ) S N φ(x ),
β
T
S −1
N = αI + βΦ Φ,
T
mN = βS N Φ y .
’More Bayesian’ if α is treated as a random variable, but analytically intractable.
Classical integration methods are not suitable for high dimensional probability density functions.
C. Bishop (2006)
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
9 / 44
Outline
1
Motivation
Machine Learning
Classical Approaches to Regression
Statistical Approaches to Regression
2
Monte Carlo Inference
Importance Sampling
Markov Chain Monte Carlo
3
Approximate Inference
Laplace Approximation
Variational Inference
Expectation Propagation
4
Summary of Methods
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
10 / 44
Monte Carlo Integration
Monte Carlo Estimate
Z
q(θ)f (θ)dθ ≈
Ef [q(θ)] =
Z
Ef [q(θ)] =
q(θ)
N
1 X
q(X n ) where
N n=1
f (θ)
g (θ)dθ =
g (θ)
| {z }
iid
Xn ∼
f (θ).
Z
q(θ)w (θ)g (θ)dθ
=:w (θ)
Importance Sampling Estimate
Ef [q(θ)] ≈
N
1 X
w (Z n )q(Z n ) where
N n=1
iid
Zn ∼
g (θ).
Self-Normalised Importance Sampling Estimate
Ef [q(θ)] ≈ PN
m=1
(WCPM)
1
w̃ (Z m )
N
X
w̃ (Z n )q(Z n ) where
iid
Zn ∼
g (θ).
n=1
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
11 / 44
Monte Carlo Integration for the Clutter Problem
Define importance weights
w (θ) =
f˜(θ)
p(D|θ)p(θ)
=
= p(D|θ).
g̃ (θ)
p(θ)
Expectation (SNIS) and Evidence (’Vanilla’ MC)
Ep(θ|D) [θ] ≈ PN
m=1
1
p(D|X m )
p(D) = Ep(θ) [p(D|θ)] ≈
N
X
iid
X n p(D|X n ) where X n ∼
p(θ),
n=1
N
1 X
p(D|X n ) where
N n=1
iid
Xn ∼
p(θ).
T. Minka (2001a), J. Liu (2008)
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
12 / 44
Outline
1
Motivation
Machine Learning
Classical Approaches to Regression
Statistical Approaches to Regression
2
Monte Carlo Inference
Importance Sampling
Markov Chain Monte Carlo
3
Approximate Inference
Laplace Approximation
Variational Inference
Expectation Propagation
4
Summary of Methods
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
13 / 44
Markov Chain Monte Carlo
A discrete Markov chain is defined as a series of random variables
p(Z
(M)
|Z
(1)
,...,Z
(M−1)
) = p(Z
Z (1) , . . . , Z (M)
(M−1)
|Z
).
(M)
Draw new sample from
) accept with probability
( proposal distribution and
f (Z )g (Z (M−1) |Z )
.
A(Z |Z (M−1) ) = min 1,
f (Z (M−1) )g (Z |Z (M−1) )
Algorithm 1 Metropolis-Hastings
(0)
(0)
Initialise Z (0) = {Z 1 , . . . , Z K }
for M=1,. . . ,T do
Draw Z ∼ g (·|Z (M−1) )
n
Compute A(Z |Z (M−1) ) = min 1,
With probability A(Z |Z (M−1) ) set
end for
f (Z )g (Z (M−1) |Z )
f (Z (M−1) )g (Z |Z (M−1) )
(M)
o
= Z , otherwise set
Z
Z
(M)
= Z (M−1)
Algorithm 2 Gibbs Sampling
(0)
(0)
(M)
∼ Z 2 |Z 1
(M)
K
∼ Z K |Z 1
Initialise Z 1 , . . . , Z K
for M=1,. . . ,T do
(M)
(M−1)
(M−1)
draw Z 1 ∼ Z 1 |Z 2
, . . . , ZK
draw
..
.
draw
end for
Z2
Z
(M)
(WCPM)
(M−1)
, Z3
(M)
(M−1)
, . . . , ZK
(M)
, . . . , Z K −1
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
14 / 44
Gibbs Sampling for Clustering
Likelihood for GMM
p(D|µ, Σ, π) =
N X
K
Y
πk N (x n |µk , Σk ).
n=1 k=1
Introduce Latent Variables
p(znk = 1|πk ) = πk ,
p(z |π) =
N Y
K
Y
z
πknk .
n=1 k=1
Joint Distribution for GMM (Augmented)
p(D, z , π, µ, Σ) = p(D|z , π, µ, Σ)p(z |π)p(π)p(µ)p(Σ).
p(D|µ, Σ, z ) =
N Y
K
Y
N (x n |µk , Σk )
znk
.
n=1 k=1
p(µ)p(Σ) =
K
Y
N (µk |m 0 , V 0 )IW(Σk |W 0 , v0 ),
k=1
p(π) = Dir(π|α0 ).
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
15 / 44
Gibbs Sampling for Clustering
Full Conditionals
πk N (x n |µk , Σk )
p(znk = 1|x n , µ, Σ, π) = PK
,
j=1 πj N (x n |µj , Σj )
p(π|z ) = Dir(π|αk ),
p(µk |Σk , z , x ) = N (µk |m k , V k ),
p(Σk |µk , z , x ) = IW(Σk |W k , νk ),
K. Murphy (2012)
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
16 / 44
Outline
1
Motivation
Machine Learning
Classical Approaches to Regression
Statistical Approaches to Regression
2
Monte Carlo Inference
Importance Sampling
Markov Chain Monte Carlo
3
Approximate Inference
Laplace Approximation
Variational Inference
Expectation Propagation
4
Summary of Methods
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
17 / 44
Laplace Approximation
Seek to approximate probability density function p(z ) of the form p(z ) =
1
Z
f (z ).
1
T
ln f (z ) ≈ ln f (z 0 ) + (z − z 0 ) ∇ ln f (z )|z =z 0 + (z − z 0 ) ∇∇ ln f (z )|z =z 0 (z − z 0 ),
2
∇f (z )|z =z 0 = 0.
T
Laplace Approximation
q(z ) = N (z |z 0 , A
−1
), where
A = −∇∇ ln f (z )|z =z 0 ,
Example
2
p(z) ∝ exp(−z /2)σ(20z + 4), where σ(t) = (1 + e
−t −1
)
.
R. Kass et al. (1995), C. Bishop (2006)
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
18 / 44
EM Algorithm (GMM)
Want to find mode of GMM
ln p(D|π, µ, Σ) =
N
X
ln
K
nX
n=1
o
πk N (x n |µk , Σk ) .
k=1
MLE of µk is
µk =
N
1 X
γ(znk )xn ,
Nk n=1
where
Nk =
N
X
γ(znk ),
n=1
πk N (x n |µk , Σk )
.
γ(znk ) = PK
j=1 πj N (x n |µj , Σj )
MLE of Σk is
Σk =
N
1 X
T
γ(znk )(x n − µk )(x n − µk ) .
Nk n=1
A. Dempster et al. (1977), C. Bishop (2006)
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
19 / 44
EM Algoririthm (GMM)
The maximisation for π is constrained to
PK
k=1
πk = 1 so find MLE of
K
X
πk − 1 ,
ln p(D|π, µ, Σ) + λ
k=1
which yields the following
πk =
Nk
.
N
Algorithm 3 EM GMM
Initialise µk , Σk , πk for k = 1, . . . , K
repeat
E Step: Evaluate responsibilities
πk N (x n |µk , Σk )
.
γ(znk ) = P
K π N (x |µ , Σ )
n
j
j
j=1 j
M Step: Define
Nk =
N
X
γ(znk ),
n=1
then update parameters
µk =
1
Σk =
N
X
Nk n=1
1
N
X
Nk n=1
γ(znk )x n ,
T
γ(znk )(x n − µk )(x n − µk ) ,
πk =
Nk
.
N
until convergence
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
20 / 44
EM Algorithm (GMM)
Link to code / .mp4
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
21 / 44
Outline
1
Motivation
Machine Learning
Classical Approaches to Regression
Statistical Approaches to Regression
2
Monte Carlo Inference
Importance Sampling
Markov Chain Monte Carlo
3
Approximate Inference
Laplace Approximation
Variational Inference
Expectation Propagation
4
Summary of Methods
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
22 / 44
Variational Inference
KL Divergence
Choose a proxy function q(θ) and optimise θ so that q is as close fit as possible to p. One such approach is to
minimise
Z
p(θ|D)
dθ.
KL(q||p) := −
q(θ) ln
q(θ)
Example
2
p(z) ∝ exp(−z /2)σ(20z + 4), where σ(t) = (1 + e
−t −1
)
.
H. Attias (2000), Z. Ghahramani et al. (1999), C. Bishop (2006)
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
23 / 44
Variational Inference
Z
p(D, θ)
p(D, θ) q(θ)
dθ =
q(θ) ln
dθ
p(θ|D)
p(θ|D) q(θ)
Z
Z
p(θ|D)
p(D, θ)
dθ −
q(θ) ln
dθ
=
q(θ) ln
q(θ)
q(θ)
ln p(D) = ln
p(D, θ)
=
p(θ|D)
Z
q(θ) ln
= L(q) + KL(q||p).
Lowerbound
Minimising the KL divergence is equivalent to maximising the lower bound
Z
p(D, θ)
L(q) =
q(θ) ln
dθ,
q(θ)
Lowerbound Evidence Approximation
It may be shown that
KL(q||p) ≥ 0.
hence p(D) may be approximated using
ln p(D) ≥ L(q),
p(D) → exp L(q)
(WCPM)
as
KL(q||p) → 0.
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
24 / 44
Mean Field Approximation
Mean Field Approximation
Assume q(θ) can be expressed in factorised form
q(θ) =
M
Y
qm (θ m ).
m=1
Mean Field Lowerbound
L(q) =
Z Y
M
p(D, θ)
q(θ m ) ln QM
dθ =
m=1 q(θ m )
m=1
Z
=
q(θ j )
nZ
ln p(D, θ)
M
Y
Z Y
M
M
n
o
X
q(θ m ) ln p(D, θ) −
ln q(θ m ) dθ
m=1
o
q(θ m )dθ m dθ j
m=1
Z
−
q(θ j ) ln q(θ j )dθ j + const
m=1
m6=j
Z
=
Z
q(θ j )Em6=j [ln p(D, θ)]dθ j −
q(θ j ) ln q(θ j )dθ j + const
= −KL(q(θ j )|| exp Em6=j [ln p(D, θ)]) + const.
Mean Field Minimised KL Divergence
∗
ln qj (θ j ) = Em6=j [ln p(D, θ)] + const.
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
25 / 44
Variational Inference GMM
Target Posterior
p(D, z , π, µ, Λ) = p(D|z , π, µ, Λ)p(z |π)p(π)p(µ|Λ)p(Λ).
Conjugate Prior
p(µ|Λ)p(Λ) =
K
Y
p(µk |Λk )p(Λk ) =
k=1
K
Y
N (µk |m 0 , (β0 Λk )
−1
)W(Λk |W 0 , v0 ).
k=1
Joint Distribution
p(D, z , µ, Λ, π) ∝
N Y
K Y
n=1 k=1
(WCPM)
πk N (x n |µk , Λk )
z
nk
!
· Dir(π|α0 ) ·
K
Y
N (µk |m 0 , β0 Λk )W(Λk |W 0 , ν0 ).
−1
k=1
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
26 / 44
Variational Inference GMM
Mean Field Approximation
q(z , π, µ, Λ) = q(z )q(π, µ, Λ),
Mean Field Minimised KL Divergence
∗
ln qj (θ j ) = Em6=j [ln p(D, θ)] + const.
Expression for Responsbilities
ln q (z ) =
∗
N X
K
X
znk ln ρnk + const,
n=1 k=1
where
1
D
1
T
E[ln |Λk |] −
ln 2π − Eµk ,Λk (x n − µk ) Λk (x n − µk ) ,
2
2
2
which can be used to give an expression of the responsibilities
ρnk
rnk = E[znk ] = PK
.
j=1 ρnj
ln ρnk = E[ln πk ] +
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
27 / 44
Variational Inference GMM
Define
Nk =
N
X
rnk ,
n=1
x̄k
Sk
=
=
N
1 X
rnk x n ,
Nk n=1
N
1 X
T
rnk (x n − x̄ k )(x n − x̄ k ) .
Nk n=1
Expression for Model Parameters
ln q (π, µ, Λ) = Ez [ln p(D, z , π, µ, Λ)] + const.
∗
By inspection
q(π, µ, Λ) = q(π)q(µ, Λ) = q(π)
K
Y
q(µk , Λk ),
k=1
q(π|α) = Dir(π|α),
q (µk |Λk ) = N (µk |m k , (βk Λk )
∗
q (Λk ) = W(Λk |Wk , νk ).
−1
),
∗
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
28 / 44
Variational Inference GMM
Expressions for Model Parameters
αk = α0 + Nk , βk = β0 + Nk and νk = ν0 + Nk ,
m k = β1 (β0 m 0 + Nk x̄ k ),
k
β0 Nk
−1
W −1
=
W
+
N
S
+
(x̄ − m 0 )(x̄ k − m 0 )T .
k k
0
k
β0 + Nk k
Expressions for Responsbilities
T
T
−1
Eµk ,Λk (x n − µk ) Λk (x n − µk ) = νk (x n − m k ) W k (x n − m k ) + Dβk ,
ln π̃k := E[ln |πk |] = ψ(αk ) − ψ(
K
X
αk ),
k=1
ln Λ̃k := E[ln |Λk |] =
D
ν + 1 − i X
k
+ D ln 2 + ln |Wk |,
ψ
2
i=1
Mixing Coefficients
E[πk ] =
(WCPM)
α0 + Nk
,
K α0 + N
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
29 / 44
Variational Inference GMM
Algorithm 4 Variational Bayes GMM
Initialise π, µ, Λ, m , W , β0 , ν0 , α0
repeat
E Step: Compute the responsibilities as follows
1/2
rnk ∝ π˜k Λ̃k
exp
n
−
o
D
νk
−
(x n − m k )T W k (x n − m k ) ,
2βk
2
ln π̃k = ψ(αk ) − ψ
K
X
αk ,
k=1
ln Λ̃k =
D X
νk + 1 − i ψ
+ D ln 2 + ln |W k |,
2
i=1
M Step: Update model parameters as follows
Nk =
N
X
rnk ,
n=1
x̄ k
Sk
=
=
N
1 X
rnk x n ,
Nk n=1
N
1 X
rnk (x n − x̄ k )(x n − x̄ k )T ,
Nk n=1
αk = α0 + Nk , βk = β0 + Nk and νk = ν0 + Nk ,
mk
−1
W
k
=
−1
W0
= βk−1 (β0 m 0 + Nk x̄ k ),
+ Nk S k +
β0 N k
T
(x̄ k − m 0 )(x̄ k − m 0 ) .
β0 + Nk
until convergence
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
30 / 44
Variational Inference GMM
Link to code / .mp4
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
31 / 44
Variational Inference GMM
Link to code / .mp4
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
32 / 44
Variational Inference GMM Lowerbound
Lowerbound
L(q) =
X ZZZ
z
q(z , π, µ, Λ) ln
n p(D, z , π, µ, Λ) o
q(z , π, µ, Λ)
dπdµdΛ
= E[ln p(D, z , π, µ, Λ)] − E[ln q(z , π, µ, Λ)]
= E[ln p(D|z , µ, Λ)] + E[ln p(z |π)] + E[ln p(π)] + E[ln p(µ, Λ)]
− E[ln q(z )] − E[ln q(π)] − E[ln q(µ, Λ)].
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
33 / 44
Variational Inference GMM Model Selection
Model Selection with Lowerbound
p(K |D) = P
K !p(D|K )
,
K !p(D|K 0 )
K0
where
p(D|K ) ≈ exp L(K ).
Alternative View
LM (K ) ≈ L(K ) + ln K !.
A. Corduneanu (2001), C. Bishop (2006), K. Murphy (2012)
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
34 / 44
Outline
1
Motivation
Machine Learning
Classical Approaches to Regression
Statistical Approaches to Regression
2
Monte Carlo Inference
Importance Sampling
Markov Chain Monte Carlo
3
Approximate Inference
Laplace Approximation
Variational Inference
Expectation Propagation
4
Summary of Methods
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
35 / 44
Expectation Propagation
Minimise Reverse KL Divergence
Z
KL(p||q) = −
p(θ|D) ln
q(θ)
dθ.
p(θ|D)
Example
2
p(z) ∝ exp(−z /2)σ(20z + 4), where σ(t) = (1 + e
−t −1
)
.
T. Minka (2001a), T. Minka (2001b)
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
36 / 44
Expectation Propagation
Exponential Family
q(θ) = h(θ)g (η) exp{η
T
u (θ)},
Reverse KL Divergence for Exponential Family
Z
KL(p||q) =
p(θ) ln
Z
=
n p(θ) o
q(θ)
dθ
T
p(θ) ln p(θ) − ln (h(θ)g (η)) − η u (θ) dθ
= − ln g (η) − η Ep(θ) [u (θ)] + const.
T
Minimised Reverse KL Divergence for Exponential Family
−∇ ln g (η) = Ep(θ) [u (θ)],
i.e. when
Eq(θ) [u (θ)] = Ep(θ) [u (θ)].
Difficult, as trying to taking expectation w.r.t. unknown posterior.
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
37 / 44
Expectation Propagation
Assumed Form of Joint Distribution
p(θ, D) =
Y
fn (θ),
n
Approximating Distribution
An approximating posterior belonging to the exponential family is sought of the following form where Z is a
normalisation constant
1 Y˜
q(θ) =
fn (θ).
Z n
The idea is to refine each factor factor f˜n by making the approximating distribution
Y
new
q (θ) ∝ f˜n (θ)
f˜j (θ),
j6=n
a closer approximation to
fn (θ)
Y
f˜j (θ).
j6=n
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
38 / 44
Expectation Propagation
To proceed, factor n is first removed from the approximating distribution by
q
\n
(θ) =
q(θ)
.
f˜n (θ)
A new distribution is defined by
n
p (θ) =
with normalisation constant
1
\n
fn (θ)q (θ),
Zn
Z
Zn =
fn (θ)q
\n
(θ)dθ.
The factor f˜n is updated such that the sufficient statistics of q are matched to those of p n .
Algorithm 5 Expectation Propagation
Initialise approximating factors f˜n (θ)
Q
Initialise approximating function q(θ) ∝ n f˜n (θ)
repeat
Choose a factor f˜n (θ) for refinement
Set q \n (θ) = q(θ)/f˜n (θ)
R
Set sufficient statistics of q new (θ) to those of q \n (θ)fn (θ) including Z = q \n (θ)fn (θ)dθ
Set f˜n (θ) = Zn q new (θ)/q \n (θ)
until convergence
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
39 / 44
Expectation Propagation for the Clutter Problem
Factorised Posterior Distribution
fn (θ) = (1 − w )N (x n |θ, I ) + w N (x n |0, aI ) for n = 1, ..., N
f0 (θ) = N (θ|0, aI )
Approximating Distribution
Taken to be an isotropic Gaussian of the form
q(θ) = N (θ|m , v I ),
q is treated as a product of N + 1 approximating factors
q(θ) ∝
N
Y
f˜n (θ).
n=0
The factors have been defined as follows where factor f˜n has scaling constant sn , mean
vn I
f˜n (θ) = sn N (θ|m n , vn I ),
m n and variance
in terms of a scaled Gaussian distribution.
The approximating distribution is therefore Gaussian
q(θ) = N (θ|m , v I ).
p(D) ≈
R
q(θ)dθ is obtained using results for Gaussian products.
(WCPM)
Approximate Bayesian Inference for Machine Learning 1,2
October 13, 2015
40 / 44
Expectation Propagation for Clutter Problem
Algorithm 6 Expectation Propagation for the Clutter Problem
Initialise the prior factor f˜0 (θ) = N (θ|0, b I )
Initialise the remaining factors f˜n = 1, vn = ∞, m n = 0
Initialise the approximating function m = 0, v = b
repeat
for n=1,. . . ,N do
Remove the current factor from q(θ)
(v \n )−1 = v −1 − vn−1 ,
m
\n
= m + vn−1 v \n (m − m n ).
Evaluate the new posterior q(θ) by matching sufficient statistics
Zn = (1 − w )N (x n |m \n , (v \n + 1)I ) + w N (x n |0, aI ),
ρn = 1 −
m
= m \n + ρn
v = v \n − ρn
w
N (x n |0, aI ),
Zn
v \n
(x n − m \n ),
v \n + 1
(v \n )2
(v \n )2 kx n − m \n k2
+ ρn (1 + ρn )
.
v \n + 1
D(v \n + 1)2
Evaluate and store the new factor f˜n (θ)
vn−1 = v −1 − (v \n )−1 ,
mn
=m
\n
sn =
+ (vn + v \n )(v \n )−1 (m − m \n ),
Zn
.
N (m n |m \n , (vn + v \n )I )
end for
until convergence
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
41 / 44
Expectation Propagation for the Clutter Problem
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
42 / 44
Summary of Methods for the Clutter Problem
T. Minka (2001a), T. Minka (2001b)
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
43 / 44
Further Reading and Acknowledgments
MSc Dissertation
L. Ellam.
Approximate Bayesian Inference for Machine Learning.
The MSc dissertation may be downloaded from here.
Software
The Python code for implementing the various algorithms reported in this presentation may be obtained by
clicking on the respective figures. All software and datasets may be obtained in zip form from here.
Acknowledgments
Professor N. Zabaras (Thesis Advisor) and Dr P. Brommer (Second Marker)
EPSRC Strategic Package Project EP/L027682/1 for research at WCPM
(WCPM)
Approximate Bayesian Inference for Machine Learning 1, 2
October 13, 2015
44 / 44
Download