Derivations for Generalized Gaussian Process Models Antoni B. Chan Daxiang Dong

advertisement
1
Derivations for Generalized Gaussian
Process Models
Antoni B. Chan
Daxiang Dong
Department of Computer Science
City University of Hong Kong
abchan@cityu.edu.hk, daxidong@cityu.edu.hk
Abstract
This is the supplemental material for the CVPR 2011 paper, “Generalized Gaussian Process
Models” [1]. It contains derivations for the Taylor and Laplace approximations, and a summary
of EP.
I. C LOSED - FORM TAYLOR A PPROXIMATION
In this section, we derive the closed-form Taylor approximation proposed in Section 5.1
of the paper [1].
A. Joint likelihood approximation
The joint likelihood of data and parameters is,
log p(y, η|X) = log p(y|θ(η)) + log p(η|X)
(S.1)
n
X
1
1
1
n
=
[yi θ(ηi ) − b(θ(ηi ))] + log h(yi , φ) − η T K−1 η − log |K| − log 2π
a(φ)
2
2
2
i=1
(S.2)
Note that the terms that prevents the posterior of η from being Gaussian are the functions
b(θ(ηi )) and θ(ηi ). Let us consider a second-order Taylor expansion of the likelihood term
at the point η̃i ,
1
[yi θ(ηi ) − b(θ(ηi ))] + log h(yi , φ)
(S.3)
a(φ)
1
(S.4)
≈ log p(yi |θ(η̃i )) + ũi (ηi − η̃i ) − w̃i−1(ηi − η̃i )2
2
where ũi = u(η̃i , yi ) and w̃i = w(η̃i , yi ) as defined in (19). Summing over (S.4), we have
the approximation
log p(yi |θ(ηi )) =
n
X
i=1
n
X
1
log p(yi |θ(η̃i )) + ũi (ηi − η̃i ) − w̃i−1 (ηi − η̃i )2
2
(S.5)
1
= − (η − η̃)T W̃−1 (η − η̃) + ũT (η − η̃) + log p(y|θ(η̃))
2
(S.6)
log p(yi |θ(ηi )) ≈
i=1
2
where ũ is the vector of ũi , and W̃ = diag(w̃1 , . . . , w̃n ). Substituting the Taylor approximation into (S.2), we obtain an approximation to the joint posterior,
log q(y, η|X) = log p(y|θ(η̃)) −
1
n
log |K| − log 2π
2
2
(S.7)
1
1
− (η − η̃)T W̃−1(η − η̃) + ũT (η − η̃) − η T K−1 η
2
2
n
1
= log p(y|θ(η̃)) − log |K| − log 2π
2
2
1
1
1
− (η − η̃ − W̃ũ)T W̃−1 (η − η̃ − W̃ũ) − η T K−1 η + ũT W̃ũ
2
2
2
(S.8)
Next, we note that the first two terms of the second line are of the form
(x − a)T B(x − a) + xT Cx = xT Dx − 2xT Ba + aT Ba
T
T
−1
T
T
−1
T
T
−1
(S.9)
T
= x Dx − 2x DD Ba + a B D Ba − a B D Ba + a Ba
2
= x − D−1 BaD−1 + aT (B − BT D−1 B)a
2
= x − D−1 BaD−1 + kak2B−1 +C−1
(S.10)
(S.11)
(S.12)
where D = B + C, and the last line uses the matrix inversion lemma. Using this property
(x = η, a = η̃ + W̃ũ, B = W−1 , C = K−1 ), the joint likelihood can be rewritten as
n
1
log q(y, η|X) = log p(y|θ(η̃)) − log |K| − log 2π
2 2
2
1 2
1
1
− t̃W̃+K + ũT W̃ũ
− η − A−1 W̃−1 t̃
−1
A
2
2
2
(S.13)
where A = W̃−1 + K−1 , and t̃ = η̃ + W̃ũ is the target vector.
B. Approximate Posterior
Removing terms in (S.13) that are not dependent on η, the approximate posterior of η
is
2
1
−1
−1 log q(η|X, y) ∝ log q(y, η|X) ∝ − η − A W̃ t̃
(S.14)
2 A−1
(S.15)
⇒ q(η|X, y) = N η A−1 W̃−1 t̃, A−1 ,
and hence
m̂ = V̂W̃−1 t̃,
−1
V̂ = W̃−1 + K−1
.
(S.16)
3
C. Approximate Marginal
The approximate marginal is obtained by substituting the approximate joint in (S.13)
Z
Z
log p(y|X) = log exp(log p(y, η|X))dη ≈ log exp(log q(y, η|X))dη
(S.17)
1
n
= log p(y|θ(η̃)) − log |K| − log 2π
2
2 Z
2
2
1
−1
−1
1
1
− t̃W̃+K + ũT W̃ũ + log e− 2 kη−A W̃ t̃kA−1 dη
2
2
n
1
= log p(y|θ(η̃)) − log |K| − log 2π
2
2
1
2
n 1
1
− t̃W̃+K + ũT W̃ũ + log(2π) 2 A−1 2
2
2
1
1 2
1
= log p(y|θ(η̃)) − log |K| |A| − t̃W̃+K + ũT W̃ũ
2
2
2
Looking at the determinant term,
log |A| |K| = log (W̃−1 + K−1 )K = log I + W̃−1 K = log W̃ + K W̃−1 .
(S.18)
(S.19)
(S.20)
(S.21)
Hence, the approximate marginal is
where
1 T
1
−1
log q(y|X) = − t̃ (W̃ + K) t̃ − log W̃ + K + r(φ)
2
2
1
1 T
r(φ) = log p(y|θ(η̃)) + ũ W̃ũ + log W̃ .
2
2
For an individual data point, we have
1
1
ri (φ) = log p(yi |θ(η̃i )) + w̃i ũ2i + log |w̃i | ,
2
2
(S.22)
(S.23)
(S.24)
and hence
r(φ) =
n
X
ri (φ).
(S.25)
i=1
1) Derivatives wrt hyperparameters: The derivative of (S.22) with respect to the kernel
hyperparameter αj is
−1 ∂K −1
∂
1
1 T
−1 ∂K
K + W̃
t̃ − tr (K + W̃)
log q(y|X) = t̃ K + W̃
∂αj
2
∂αj
2
∂αj
(S.26)
∂K
1
, z = (K + W̃)−1 t̃.
(S.27)
= tr zzT − (K + W̃)−1
2
∂αj
∂K
where ∂α
is the element-wise derivative of the kernel matrix with respect to the kernel
j
hyperparameter αj , and we use the derivative properties,
∂A −1
∂ −1
A = −A−1
A ,
∂α
∂α
∂
∂A
log |A| = tr(A−1
).
∂α
∂α
(S.28)
4
2) Derivative wrt dispersion: For the derivative with respect to the dispersion parameter,
we first note that
−1 a′ (φ)
∂
w̃i = a′ (φ) b′′ (θ(η))θ′ (η)2 − [y − b′ (θ(η))] θ′′ (η)
=
w̃i , (S.29)
∂φ
a(φ)
∂
a′ (φ)
W̃ =
W̃,
(S.30)
∂φ
a(φ)
∂
a′ (φ) ′
a′ (φ)
′
ũi = −
θ
(η)
[y
−
b
(θ(η))]
=
−
ũi ,
(S.31)
∂φ
a(φ)2
a(φ)
′
a′ (φ)
a (φ)
a′ (φ)
∂
2
2
w̃i ũi =
w̃i ũi + 2w̃i ũi −
ũi = −
w̃i ũ2i
(S.32)
∂φ
a(φ)
a(φ)
a(φ)
∂
1
∂
log p(yi |θ(η̃i )) =
[yi θ(η̃i ) − b(θ(η̃i ))] + log h(yi , φ)
(S.33)
∂φ
∂φ a(φ)
a′ (φ)
∂
=−
[yi θ(η̃i ) − b(θ(η̃i ))] +
log h(yi , φ)
(S.34)
2
a(φ)
∂φ
∂
a′ (φ)
ṽi +
log h(yi , φ),
(S.35)
=−
a(φ)
∂φ
where ṽi =
1
a(φ)
[yi θ(η̃i ) − b(θ(η̃i ))]. Thus,
a′ (φ)
∂
1 a′ (φ)
1 1 a′ (φ)
∂
ri (φ) = −
ṽi +
log h(yi , φ) −
w̃i ũ2i +
w̃i
∂φ
a(φ)
∂φ
2 a(φ)
2 w̃i a(φ)
a′ (φ) 1
1
∂
2
=
− ṽi − w̃i ũi +
log h(yi , φ).
a(φ) 2
2
∂φ
(S.36)
(S.37)
Summing over i,
X a′ (φ) 1
1
∂
∂
2
r(φ) =
− ṽi − w̃i ũi +
log h(yi , φ)
∂φ
a(φ) 2
2
∂φ
i
X
n
∂
a′ (φ) n
1 T
T
=
− 1 ṽ − ũ W̃ũ +
log h(yi , φ).
a(φ) 2
2
∂φ
i=1
(S.38)
(S.39)
Also note that t̃ is not a function of φ, as the term cancels out in W̃ũ. Finally,
∂
log q(y|X)
∂φ
(S.40)
−1 ∂ W̃ −1
1 T
1
∂
−1 ∂ W̃
= t̃ K + W̃
K + W̃
t̃ − tr((K + W̃)
)+
r(φ)
2
∂φ
2
∂φ
∂φ
#
"
X
n
1
∂
a′ (φ) n
1 T
T
−1 ∂ W̃
T
= tr zz − (K + W̃)
+
− 1 ṽ − ũ W̃ũ +
log h(yi , φ)
2
∂φ
a(φ) 2
2
∂φ
i=1
(S.41)
X
n
h
i
′
∂
n
1
a (φ) 1
tr zzT − (K + W̃)−1 W̃ + − 1T ṽ − ũT W̃ũ +
log h(yi , φ)
=
a(φ) 2
2
2
∂φ
i=1
(S.42)
X
n
h
i
′
a (φ) 1 T
∂
1
n
1
=
z W̃z − tr (K + W̃)−1 W̃ + − 1T ṽ − ũT W̃ũ +
log h(yi , φ)
a(φ) 2
2
2
2
∂φ
i=1
(S.43)
5
3) Derivative wrt dispersion when bφ (θ): We next look at the special case where the
term bφ (θ) is also a function of φ. We first note that
−1
∂
w̃i = a′ (φ) b′′φ (θ(η))θ′ (η)2 − y − b′φ (θ(η)) θ′′ (η)
∂φ
∂ ′′
∂ ′
(S.44)
θ′ (η)2 ∂φ
bφ (θ(η)) + θ′′ (η) ∂φ
bφ (θ(η))
− a(φ) 2
b′′φ (θ(η))θ′ (η)2 − y − b′φ (θ(η)) θ′′ (η)
a′ (φ)
1
∂ ′
2
′
2 ∂ ′′
′′
=
w̃i −
w̃ θ (η)
b (θ(η)) + θ (η) bφ (θ(η)) , (S.45)
a(φ)
a(φ) i
∂φ φ
∂φ
′
′
θ (η) ∂ ′
∂
a (φ) ′
′
ũi = −
θ
(η)
y
−
b
(θ(η))
−
b (θ(η))
(S.46)
φ
∂φ
a(φ)2
a(φ) ∂φ φ
θ′ (η) ∂ ′
a′ (φ)
ũi −
b (θ(η)),
(S.47)
=−
a(φ)
a(φ) ∂φ φ
∂ w̃i 2
∂ ũi
∂
(w̃i ũ2i ) =
ũi + 2w̃i ũi
,
(S.48)
∂φ
∂φ
∂φ
∂
∂ ũi
∂ w̃i
∂
(η̃i + w̃i ũi ) = w̃i
+ ũi
,
(S.49)
t̃i =
∂φ
∂φ
∂φ
∂φ
1
∂
∂
log p(yi |θ(η̃i )) =
[yi θ(η̃i ) − bφ (θ(η̃i ))] + log h(yi , φ)
(S.50)
∂φ
∂φ a(φ)
a′ (φ)
1 ∂
∂
=−
[y
θ(η̃
)
−
b
(θ(η̃
))]
−
b
(θ(η̃
))
+
log h(yi , φ)
i
i
φ
i
φ
i
a(φ)2
a(φ) ∂φ
∂φ
(S.51)
′
a (φ)
∂
1 ∂
=−
ṽi +
log h(yi , φ) −
bφ (θ(η̃i )).
(S.52)
a(φ)
∂φ
a(φ) ∂φ
Thus,
1 1 ∂ w̃i
∂
1 ∂ w̃i 2
∂ ũi
∂
+
ri (φ) =
log p(yi |θ(η̃i )) +
ũi + 2w̃i ũi
∂φ
∂φ
2 ∂φ
∂φ
2 w̃i ∂φ
(S.53)
For the first term in (S.22),
i
∂ h T
∂(W̃ + K)−1
∂ t̃
−1
t̃ (W̃ + K) t̃ = t̃T
t̃ + 2t̃T (W̃ + K)−1
(S.54)
∂φ
∂φ
∂φ
−1 ∂ W̃ −1
∂ t̃
K + W̃
= −t̃T K + W̃
t̃ + 2t̃T (W̃ + K)−1 .
∂φ
∂φ
(S.55)
For the second term in (S.22),
Finally, we have
∂ W̃
∂
log K + W̃ = tr((K + W̃)−1
).
∂φ
∂φ
−1 ∂ W̃ −1
∂ t̃
1 T
∂
K + W̃
t̃ − t̃T (W̃ + K)−1
log q(y|X) = t̃ K + W̃
∂φ
2
∂φ
∂φ
X
∂
1
∂ W̃
− tr((K + W̃)−1
)+
ri (φ).
2
∂φ
∂φ
i
(S.56)
(S.57)
6
II. L APLACE A PPROXIMATION
In this section, we present more details of the Laplace approximation for GGPMs, which
is discussed in Section 5.2 of the paper. The Laplace’s approximation is a Gaussian approximation of the posterior p(η|X, y) using a 2nd-order Taylor expansion at its maximum
(mode). Hence, the Laplace approximation is a special case of the closed-form Taylor
approximation in the previous section, where the expansion points η̃i are selected so that
m̂ is the maximum of the true posterior p(η|X, y).
A. Approximate Posterior
The maximum of the posterior is given by
η̂ = argmax log p(η|X, y) = argmax log p(y|θ(η)) + log p(η|X).
η
(S.58)
η
In vector notation, the derivatives of the posterior are
∂
log p(η|X, y) = u − K−1 η
∂η
(S.59)
∂2
log p(η|X, y) = −W−1 − K−1
∂η∂η T
(S.60)
At the maximum of the posterior, we have the condition
∂
log p(η|X, y) = û − K−1 η = 0
∂η
⇒ û = K−1 η̂,
(S.61)
(S.62)
where û is u evaluated at η̂. The maximum can be obtained iteratively using the NewtonRaphson method. In particular, each iteration is
−1 ∂
∂
(new)
log p(η̂|X, y)
log p(η̂|X, y)
(S.63)
η̂
= η̂ −
∂ηη T
∂η
−1
= η̂ + Ŵ−1 + K−1
(û − K−1 η̂)
(S.64)
−1
= Ŵ−1 + K−1
(û + Ŵ−1 η̂)
(S.65)
−1
= Ŵ−1 + K−1
Ŵ−1 (Ŵû + η̂),
(S.66)
where Ŵ is W evaluated at η̂. Note that this is exactly the same form as the closed-form
Taylor expansion in the previous section. In particular, at each iteration the expansion point
η̂ is moved closer to the maximum. As a result, the target vector t̂(new) = Ŵû + η̂ is also
updated.
B. Approximate Marginal
The marginal likelihood is approximated by applying a Taylor approximation to the
marginal likelihood at the mode of the posterior, η̂. The marginal distribution is
Z
Z
p(y|X) = p(y|θ(η))p(η|X)dη = eκ(η) dη
(S.67)
where
κ(η) = log p(y|θ(η)) + log p(η|X).
(S.68)
7
The second-order Taylor approximation of κ(η) around η̂ is
1
kη − η̂k2V̂
2
(S.69)
= log p(y|θ(η̂)) + log p(η̂|X) −
(S.70)
κ(η) ≈ κ(η̂) −
1
kη − η̂k2V̂
2
Substituting into (S.67), we obtain the approximate marginal distribution
Z
2
1
log q(y|X) = log p(y|θ(η̂)) + log p(η̂|X) + e− 2 kη−η̂kV̂ dη
(S.71)
1
n
n
1
1
= log p(y|θ(η̂)) − η̂ T K−1 η̂ − log |K| − log 2π + log 2π + log V̂
2
2
2
2
2
(S.72)
1 T −1
1
= log p(y|θ(η̂)) − η̂ K η̂ − log Ŵ−1 + K−1 K
(S.73)
2
2
1
1
(S.74)
= log p(y|θ(η̂)) − η̂ T K−1 η̂ − log Ŵ−1K + I
2
2
Note that η̂ is dependent on the kernel matrix K and a(φ). Hence, at each iteration during
optimization of q(y|X), we have to recompute its value.
1) Derivative wrt hyperparameters: We next look at the derivatives of each term in
(S.74). Note that η̂ is a function of the kernel hyperparameters. Using the chain rule, the
derivative of the first term is
∂
∂ η̂
∂ η̂
∂
log p(y|θ(η̂)) =
log p(y|θ(η̂))
= ûT
.
(S.75)
T
∂αj
∂αj
∂αj
∂ η̂
For the second term, we use the product rule,
∂AB
∂α
= A ∂B
+
∂α
∂A
B,
∂α
∂ T −1
∂ T
η̂ K η̂ =
η̂ K−1 η̂
∂αj
∂αj
∂ η̂ T
T ∂
−1
= η̂
K η̂ +
K−1 η̂
∂αj
∂αj
∂ η̂ T
∂K−1
T
−1 ∂ η̂
K−1 η̂
η̂ +
+
= η̂ K
∂αj
∂αj
∂αj
∂K −1
∂ η̂
= −η̂ T K−1
K η̂ + 2η̂ T K−1
.
∂αj
∂αj
For the third term,
∂
−1
−1
−1 ∂
−1
log Ŵ K + I = tr (Ŵ K + I)
(Ŵ K)
∂αj
∂αj
#
"
−1
∂
Ŵ
∂K
+
K)
= tr (Ŵ−1 K + I)−1 (Ŵ−1
∂αj
∂αj
"
#
−1
−1 ∂K
−1
−1 −1 ∂ Ŵ
= tr (K + Ŵ)
+ tr (Ŵ + K )
∂αj
∂αj
(S.76)
(S.77)
(S.78)
(S.79)
(S.80)
(S.81)
(S.82)
8
Putting it all together, we have
∂
∂ η̂
1
∂ η̂
∂K −1
log q(y|X) = ûT
+ η̂ T K−1
K η̂ − η̂ T K−1
∂αj
∂αj
2
∂αj
∂αj
#
"
(S.83)
−1
1
∂K
∂
Ŵ
1
−1
−1
−1 −1
− tr (Ŵ + K )
− tr (K + Ŵ)
2
∂αj
2
∂αj
#
"
∂K 1
−1
1
∂
Ŵ
= tr ûûT − (K + Ŵ)−1
− tr (Ŵ−1 + K−1 )−1
,
2
∂αj
2
∂αj
(S.84)
which follows from û = K−1 η̂.
Next, we note that Ŵ is a function of η̂, which in turn is a function of the kernel
hyperparameter αj , and thus
n
The first term is
∂ Ŵ−1 X ∂ Ŵ−1 ∂ η̂i
=
∂αj
∂ η̂i ∂αj
i=1
∂ Ŵ−1
=
∂ η̂i
∂3
− 3 log p(yi |θ(η̂i )) ei eTi .
∂ η̂i
(S.85)
(S.86)
Elements of the second term are given by,
∂
∂ η̂
=
(Ŵ−1 + K−1 )−1 Ŵ−1 t̂
∂αj
∂αj
−1
−1
−1 −1 ∂K
= −(Ŵ + K )
(Ŵ−1 + K−1 )−1 Ŵ−1 t̂
∂αj
∂K −1
= (Ŵ−1 + K−1 )−1 K−1
K (Ŵ−1 + K−1 )−1 Ŵ−1t̂
∂αj
∂K
= Ŵ(Ŵ + K)−1
û.
∂αj
(S.87)
(S.88)
(S.89)
(S.90)
Hence, (S.85) can be expressed as
∂ η̂
∂W−1
,
= diag A
∂αj
∂αj
where A is a diagonal matrix of with entries
∂3
A = diag − 3 log p(yi |θ(η̂i ))
∂ η̂i
(S.91)
(S.92)
and
∂
1 ′
∂3
−1 ′
′′
−1
θ (η̂i )[g ] (η̂i ) − θ (η̂i ) y − g (η̂i )
(S.93)
− 3 log p(yi |θ(η̂i )) =
∂ η̂i
∂ η̂i a(φ)
1 ′
θ (η̂i )[g −1]′′ (η̂i ) + θ′′ (η̂i )[g −1 ]′ (η̂i ) + θ′′ (η̂i )[g −1 ]′ (η̂i ) − θ′′′ (η̂i ) y − g −1 (η̂i )
=
a(φ)
(S.94)
1 ′
=
θ (η̂i )[g −1]′′ (η̂i ) + 2θ′′ (η̂i )[g −1]′ (η̂i ) − θ′′′ (η̂i ) y − g −1(η̂i ) .
(S.95)
a(φ)
9
2) Derivative wrt dispersion: Looking at the derivatives of each term in (S.74), for the
first term,
1
∂
∂
log p(yi |θ(η̂i )) =
[yi θ(η̂i ) − b(θ(η̂i ))] + log h(yi , φ)
(S.96)
∂φ
∂φ a(φ)
a′ (φ)
1 ∂
∂
=−
[y
θ(η̂
)
−
b(θ(η̂
))]
+
[y
θ(η̂
)
−
b(θ(η̂
))]
+
log h(yi , φ)
i
i
i
i
i
i
a(φ)2
a(φ) ∂φ
∂φ
(S.97)
′
∂ η̂i
a (φ)
1 ∂
∂
[yi θ(η̂i ) − b(θ(η̂i ))]
=−
v̂i +
+
log h(yi , φ)
a(φ)
a(φ) ∂ η̂i
∂φ
∂φ
(S.98)
′
a (φ)
∂ η̂i
∂
=−
v̂i + ûi
+
log h(yi , φ)
(S.99)
a(φ)
∂φ
∂φ
where v̂i = v(η̂i , yi ) and ûi = u(η̂i , yi ). Hence,
n
X a′ (φ)
∂ η̂i
∂
∂
log p(y|θ(η̂)) =
−
v̂i + ûi
+
log h(yi , φ)
∂φ
a(φ)
∂φ
∂φ
i=1
(S.100)
n
=−
The derivative of η̂ is
a′ (φ) T
∂ η̂ X ∂
1 v̂ + ûT
+
log h(yi , φ).
a(φ)
∂φ i=1 ∂φ
∂
∂
∂ η̂
=
(Ŵ−1 + K−1 )−1 Ŵ−1 t̂ =
(I + ŴK−1 )−1 t̂
∂φ
∂φ
∂φ
∂
= −(I + ŴK−1 )−1 (ŴK−1 )(I + ŴK−1 )−1 t̂
∂φ
a′ (φ)
(I + ŴK−1 )−1 ŴK−1 (I + ŴK−1 )−1 t̂
=−
a(φ)
a′ (φ)
(Ŵ−1 + K−1 )−1 (K + Ŵ)−1 t̂,
=−
a(φ)
a′ (φ)
(Ŵ−1 + K−1 )−1 û.
=−
a(φ)
(S.101)
(S.102)
(S.103)
(S.104)
(S.105)
(S.106)
Hence,
n
X
∂
∂
a′ (φ) T
1 v̂ + ûT (W−1 + K−1 )−1 û +
log p(y|θ(η̂)) = −
log h(yi , φ).
∂φ
a(φ)
∂φ
i=1
(S.107)
For the second term,
∂ 1 T −1
∂ η̂
∂ η̂
= ûT
,
η̂ K η̂ = η̂ T K−1
∂φ 2
∂φ
∂φ
a′ (φ) T
û (W−1 + K−1 )−1 û.
=−
a(φ)
(S.108)
(S.109)
10
For the third term,
∂
−1
−1
−1 ∂
−1
log Ŵ K + I = tr (Ŵ K + I)
(Ŵ K)
∂φ
∂φ
#
"
∂
Ŵ
Ŵ−1 K)
= −tr (Ŵ−1 K + I)−1 Ŵ−1
∂φ
i
a′ (φ) h
=−
tr (Ŵ−1K + I)−1 Ŵ−1 K)
a(φ)
i
a′ (φ) h
−1
tr (K + Ŵ) K) .
=−
a(φ)
(S.110)
(S.111)
(S.112)
(S.113)
Finally, putting it all together,
n
i X
∂
1 h
a′ (φ)
∂
T
−1
−1 v̂ + tr (K + Ŵ) K) +
log q(y|X) =
log h(yi , φ). (S.114)
∂φ
a(φ)
2
∂φ
i=1
III. E XPECTATION P ROPAGATION
In this section, we present more details on expectation propagation (EP) for for GGPMs,
which is discussed in Section 5.3 of the paper.
A. Computing the site parameters
Instead of computing the optimal site parameters all at once, EP works by iteratively
updating each individual site using the other site approximations [2], [3]. In particular,
assume that we are updating site ti . We first compute the cavity distribution, which is the
marginalization over all sites except ti ,
Z
Y
tj (ηj |Z̃j , µ̃j , σ̃j2 )dηj ,
(S.115)
q¬i (ηi ) ∝ p(η|X)
j6=i
where the notation ¬i indicates the sites without ti . q¬i (ηi ) is an approximation to the
posterior distribution of ηi , given all observations except yi . Since both terms are Gaussian,
this integral can be computed in closed-form, leading to the cavity distribution
2
,
(S.116)
q¬i (ηi ) = N ηi µ¬i , σ¬i
2
2
µ¬i = σ¬i
(σi−2 µi − σ̃i−2 µ̃i ), σ¬i
= (σi−2 − σ̃i−2 )−1
(S.117)
where σi2 = [V̂]ii and µi = [m̂]i .
Next, multiplying the cavity distribution by the true data likelihood of yi gives an
approximation to the unnormalized posterior of ηi , given the observations,
q(ηi |X, y) = q¬i (ηi )p(yi |θ(ηi )).
(S.118)
Note that this incorporates the true likelihood of yi and the approximate likelihoods of the
remaining observations. On the other hand, the approximation to the posterior of ηi using
the site function is
q̂(ηi ) = q¬i (ηi )ti (ηi |Z̃i , µ̃i , σ̃i2 )
(S.119)
Hence, the new parameters for site ti can be computed by minimizing the KL divergence
between (S.118) and (S.119), i.e., between the approximate posterior using the true likeli-
11
hood and that using the site ti ,
{Z̃i∗ , µ̃∗i , σ̃i∗2 } = argmin KL (q(ηi |X, y) kq̂(ηi ) )
(S.120)
Z̃i ,µ̃i ,σ̃i2
= argmin KL q¬i (ηi )p(yi |θ(ηi )) q¬i (ηi )ti (ηi |Z̃i , µ̃i , σ̃i2 )
(S.121)
Z̃i ,µ̃i ,σ̃i2
Note that q̂(ηi ) in (S.119) is an unnormalized Gaussian, i.e. q̂(ηi ) = Ẑi N (ηi |µ̂i , σ̂i2 ), with
parameters (given by the product of Gaussians)
−2
µ̂i = σ̂i2 (σ¬i
µ¬i + σ̃i−2 µ̃i ),
−2
σ̂i2 = (σ¬i
+ σ̃i−2 )−1 ,
Ẑi = Z̃i (2π)
− 21
(σ̃i2
+
2 − 21
σ¬i
)
(S.122)
(S.123)
(µ¬i − µ̃i )2
exp −
2
2(σ¬i
+ σ̃i2 )
.
(S.124)
Hence, to compute (S.121), it suffices to first find the parameters of the unnormalized
Gaussian that minimizes the KL divergence to q(ηi |X, y), and then “subtract” q¬i (ηi ) from
this Gaussian. To find q̂(ηi ) we minimize the KL divergence,
{Ẑi∗ , µ̂∗i , σ̂i∗2 } = argmin KL q¬i (ηi )p(yi |θ(ηi )) Ẑi N ηi µ̂i , σ̂i2 .
(S.125)
Ẑi ,µ̂i ,σ̂i2
In particular, it is well known that the KL divergence in (S.125) is minimized when the
moments of the Gaussian match those of the first argument. In addition to the mean and
variance (1st and 2nd moments), we also must match the normalization constant (0th
moment), since q̂(ηi ) is unnormalized. The optimal parameters are
Z
1
µ̂i = Eq [ηi ] =
ηi q¬i (ηi )p(yi |θ(ηi ))dηi
(S.126)
Ẑi Z
1
(ηi − µ̂i )2 q¬i (ηi )p(yi |θ(ηi ))dηi
(S.127)
σ̂i2 = varq (ηi ) =
Ẑ
i
Z
Ẑi = Zq = q¬i (ηi )p(yi |θ(ηi ))dηi ,
(S.128)
where Eq and varq are the expectation and variance with respect to q(ηi |X, y). These
moments are discussed later in this section. Finally, the new parameters for site ti are
obtained by “subtracting” the two Gaussians, q¬i (θi ) from q̂(θi ), leading to the site updates
−2
µ̃i = σ̃i2 (σ̂i−2 µ̂i − σ¬i
µ¬i )
2
−2
−2 −1
σ̃i = (σ̂i − σ¬i )
q
(µ¬i − µ̃i )2
2
2
Z̃i = Ẑi 2π(σ¬i + σ̃i ) exp
.
2
2(σ¬i
+ σ̃i2 )
(S.129)
(S.130)
(S.131)
EP iterates over each of the site ti , i.e. each observation yi , iteratively until convergence.
1) Computing the moments: Each EP iteration requires the moments of q(ηi ) in (S.126S.128), where
1
2
,
(S.132)
q(ηi ) = p(yi |θ(ηi ))N ηi µ¬i , σ¬i
ZẐi
2
dηi .
(S.133)
Ẑi = p(yi |θ(ηi ))N ηi µ¬i , σ¬i
12
Depending on the form of the likelihood p(yi |θ(ηi )) the integrals may not be analytically
tractable. Hence, approximate integration is required. In this paper, we use numerical
integration with 10,000 equally spaced points between µ¬i ± 5σ¬i . We also tried GaussianHermite quadrature to approximate the integral, but this caused EP to fail to converge in
many cases, due to inaccuracy in the approximation.
B. Marginal Likelihood
The EP approximation to the marginal log-likelihood is
log p(y|X) ≈ log q(y|X) = log ZEP
Z
= log q(y|θ(η))p(η|X)dη
Z
n
Y
= log N η µ̃, Σ̃
Z̃i N (η|0, K) dη
(S.134)
(S.135)
(S.136)
i=1
=
n
X
i=1
Hence, we have
log q(y|X) =
n
X
i=1
Finally,
log Z̃i + log
Z
N µ̃η, Σ̃ N (η|0, K) dη
log Z̃i + log N µ̃0, Σ̃ + K
n
1 T
1
−1
= − µ̃ (K + Σ̃) µ̃ − log K + Σ̃ − log 2π
2 2
2
X
1
(µ¬i − µ̃i )2
1
2
2
.
+
log Ẑi + log 2π + log(σ¬i + σ̃i ) +
2
2
2
2
2(σ
+
σ̃
)
¬i
i
i
(S.137)
(S.138)
(S.139)
log q(y|X)
X
1
1 T
(µ¬i − µ̃i )2
1
2
2
−1
log Ẑi + log(σ¬i + σ̃i ) +
= − µ̃ (K + Σ̃) µ̃ − log K + Σ̃ +
.
2
2
2
2
2
2(σ
+
σ̃
)
¬i
i
i
(S.140)
1) Derivatives wrt hyperparameters: Note that µ̃, Σ̃, µ¬ , and Σ¬ are all implicitly
functions of the kernel hyperparameter, due to the EP update steps. Hence, the derivative
of the marginal is composed of the explicit derivative wrt the hyperparameter ( ∂α∂ j ) as well
∂ µ̃ ∂
) It can be shown that the implicit derivatives are
as several implicit derivative (e.g., ∂α
j ∂ µ̃
all zero [4], and hence
∂ log q(y|X) ∂ log q(y|X)
(S.141)
=
∂αj
∂αj
explicit
1 T
1
−1
−1 ∂K
−1 ∂K
= µ̃ (K + Σ̃)
(K + Σ̃) µ̃ − tr (K + Σ̃)
(S.142)
2
∂αj
2
∂αj
h
i
1
T
−1 ∂K
, z̃ = (K + Σ̃)−1 µ̃.
(S.143)
= tr z̃z̃ − (K + Σ̃)
2
∂αj
13
2) Derivative wrt dispersion: The only term in the marginal that depends on φ directly
is log Ẑi . Its derivative is
Z
∂
∂
2
log Ẑi =
log p(yi |θ(ηi ))N ηi µ¬i , σ¬i
dηi
(S.144)
∂φ
∂φ Z
∂
1
2
p(yi |θ(ηi ))N ηi µ¬i , σ¬i
dηi
(S.145)
=
Ẑi ∂φ
Z
1
h′ (yi , φ) a′ (φ)
2
=
dηi
+
(θ(ηi )yi − b(θ(ηi ))) p(yi |θ(ηi ))N ηi µ¬i , σ¬i
2
h(yi , φ)
a(φ)
Ẑi
(S.146)
′
∂
a (φ)
=
log h(yi , φ) +
(yi Eq [θ(ηi )] − Eq [b(θ(ηi ))]) ,
(S.147)
∂φ
a(φ)2
where Eq is the expectation under q(ηi |X, y). Hence, the derivative of the marginal wrt the
dispersion parameter is
n
X
∂ log q(y|X)
∂ log q(y|X) ∂
=
log Ẑi
(S.148)
=
∂φ
∂φ
∂φ
explicit
i=1
=
n
n
X
a′ (φ) X
∂
log h(yi , φ) +
(yi Eq [θ(ηi )] − Eq [b(θ(ηi ))]) .
2
∂φ
a(φ)
i=1
i=1
(S.149)
Hence, additional expectations Eq [θ(ηi )] and Eq [b(θ(ηi ))] are required.
R EFERENCES
[1] A. B. Chan and D. Dong, “Generalized gaussian process models,” in IEEE Conf. Computer Vision and Pattern
Recognition, 2011.
[2] T. Minka, A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts Institute of
Technology, 2001.
[3] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT Press, 2006.
[4] M. Seeger, “Expectation propagation for exponential families,” tech. rep., 2005.
Download