1 Derivations for Generalized Gaussian Process Models Antoni B. Chan Daxiang Dong Department of Computer Science City University of Hong Kong abchan@cityu.edu.hk, daxidong@cityu.edu.hk Abstract This is the supplemental material for the CVPR 2011 paper, “Generalized Gaussian Process Models” [1]. It contains derivations for the Taylor and Laplace approximations, and a summary of EP. I. C LOSED - FORM TAYLOR A PPROXIMATION In this section, we derive the closed-form Taylor approximation proposed in Section 5.1 of the paper [1]. A. Joint likelihood approximation The joint likelihood of data and parameters is, log p(y, η|X) = log p(y|θ(η)) + log p(η|X) (S.1) n X 1 1 1 n = [yi θ(ηi ) − b(θ(ηi ))] + log h(yi , φ) − η T K−1 η − log |K| − log 2π a(φ) 2 2 2 i=1 (S.2) Note that the terms that prevents the posterior of η from being Gaussian are the functions b(θ(ηi )) and θ(ηi ). Let us consider a second-order Taylor expansion of the likelihood term at the point η̃i , 1 [yi θ(ηi ) − b(θ(ηi ))] + log h(yi , φ) (S.3) a(φ) 1 (S.4) ≈ log p(yi |θ(η̃i )) + ũi (ηi − η̃i ) − w̃i−1(ηi − η̃i )2 2 where ũi = u(η̃i , yi ) and w̃i = w(η̃i , yi ) as defined in (19). Summing over (S.4), we have the approximation log p(yi |θ(ηi )) = n X i=1 n X 1 log p(yi |θ(η̃i )) + ũi (ηi − η̃i ) − w̃i−1 (ηi − η̃i )2 2 (S.5) 1 = − (η − η̃)T W̃−1 (η − η̃) + ũT (η − η̃) + log p(y|θ(η̃)) 2 (S.6) log p(yi |θ(ηi )) ≈ i=1 2 where ũ is the vector of ũi , and W̃ = diag(w̃1 , . . . , w̃n ). Substituting the Taylor approximation into (S.2), we obtain an approximation to the joint posterior, log q(y, η|X) = log p(y|θ(η̃)) − 1 n log |K| − log 2π 2 2 (S.7) 1 1 − (η − η̃)T W̃−1(η − η̃) + ũT (η − η̃) − η T K−1 η 2 2 n 1 = log p(y|θ(η̃)) − log |K| − log 2π 2 2 1 1 1 − (η − η̃ − W̃ũ)T W̃−1 (η − η̃ − W̃ũ) − η T K−1 η + ũT W̃ũ 2 2 2 (S.8) Next, we note that the first two terms of the second line are of the form (x − a)T B(x − a) + xT Cx = xT Dx − 2xT Ba + aT Ba T T −1 T T −1 T T −1 (S.9) T = x Dx − 2x DD Ba + a B D Ba − a B D Ba + a Ba 2 = x − D−1 BaD−1 + aT (B − BT D−1 B)a 2 = x − D−1 BaD−1 + kak2B−1 +C−1 (S.10) (S.11) (S.12) where D = B + C, and the last line uses the matrix inversion lemma. Using this property (x = η, a = η̃ + W̃ũ, B = W−1 , C = K−1 ), the joint likelihood can be rewritten as n 1 log q(y, η|X) = log p(y|θ(η̃)) − log |K| − log 2π 2 2 2 1 2 1 1 − t̃W̃+K + ũT W̃ũ − η − A−1 W̃−1 t̃ −1 A 2 2 2 (S.13) where A = W̃−1 + K−1 , and t̃ = η̃ + W̃ũ is the target vector. B. Approximate Posterior Removing terms in (S.13) that are not dependent on η, the approximate posterior of η is 2 1 −1 −1 log q(η|X, y) ∝ log q(y, η|X) ∝ − η − A W̃ t̃ (S.14) 2 A−1 (S.15) ⇒ q(η|X, y) = N η A−1 W̃−1 t̃, A−1 , and hence m̂ = V̂W̃−1 t̃, −1 V̂ = W̃−1 + K−1 . (S.16) 3 C. Approximate Marginal The approximate marginal is obtained by substituting the approximate joint in (S.13) Z Z log p(y|X) = log exp(log p(y, η|X))dη ≈ log exp(log q(y, η|X))dη (S.17) 1 n = log p(y|θ(η̃)) − log |K| − log 2π 2 2 Z 2 2 1 −1 −1 1 1 − t̃W̃+K + ũT W̃ũ + log e− 2 kη−A W̃ t̃kA−1 dη 2 2 n 1 = log p(y|θ(η̃)) − log |K| − log 2π 2 2 1 2 n 1 1 − t̃W̃+K + ũT W̃ũ + log(2π) 2 A−1 2 2 2 1 1 2 1 = log p(y|θ(η̃)) − log |K| |A| − t̃W̃+K + ũT W̃ũ 2 2 2 Looking at the determinant term, log |A| |K| = log (W̃−1 + K−1 )K = log I + W̃−1 K = log W̃ + K W̃−1 . (S.18) (S.19) (S.20) (S.21) Hence, the approximate marginal is where 1 T 1 −1 log q(y|X) = − t̃ (W̃ + K) t̃ − log W̃ + K + r(φ) 2 2 1 1 T r(φ) = log p(y|θ(η̃)) + ũ W̃ũ + log W̃ . 2 2 For an individual data point, we have 1 1 ri (φ) = log p(yi |θ(η̃i )) + w̃i ũ2i + log |w̃i | , 2 2 (S.22) (S.23) (S.24) and hence r(φ) = n X ri (φ). (S.25) i=1 1) Derivatives wrt hyperparameters: The derivative of (S.22) with respect to the kernel hyperparameter αj is −1 ∂K −1 ∂ 1 1 T −1 ∂K K + W̃ t̃ − tr (K + W̃) log q(y|X) = t̃ K + W̃ ∂αj 2 ∂αj 2 ∂αj (S.26) ∂K 1 , z = (K + W̃)−1 t̃. (S.27) = tr zzT − (K + W̃)−1 2 ∂αj ∂K where ∂α is the element-wise derivative of the kernel matrix with respect to the kernel j hyperparameter αj , and we use the derivative properties, ∂A −1 ∂ −1 A = −A−1 A , ∂α ∂α ∂ ∂A log |A| = tr(A−1 ). ∂α ∂α (S.28) 4 2) Derivative wrt dispersion: For the derivative with respect to the dispersion parameter, we first note that −1 a′ (φ) ∂ w̃i = a′ (φ) b′′ (θ(η))θ′ (η)2 − [y − b′ (θ(η))] θ′′ (η) = w̃i , (S.29) ∂φ a(φ) ∂ a′ (φ) W̃ = W̃, (S.30) ∂φ a(φ) ∂ a′ (φ) ′ a′ (φ) ′ ũi = − θ (η) [y − b (θ(η))] = − ũi , (S.31) ∂φ a(φ)2 a(φ) ′ a′ (φ) a (φ) a′ (φ) ∂ 2 2 w̃i ũi = w̃i ũi + 2w̃i ũi − ũi = − w̃i ũ2i (S.32) ∂φ a(φ) a(φ) a(φ) ∂ 1 ∂ log p(yi |θ(η̃i )) = [yi θ(η̃i ) − b(θ(η̃i ))] + log h(yi , φ) (S.33) ∂φ ∂φ a(φ) a′ (φ) ∂ =− [yi θ(η̃i ) − b(θ(η̃i ))] + log h(yi , φ) (S.34) 2 a(φ) ∂φ ∂ a′ (φ) ṽi + log h(yi , φ), (S.35) =− a(φ) ∂φ where ṽi = 1 a(φ) [yi θ(η̃i ) − b(θ(η̃i ))]. Thus, a′ (φ) ∂ 1 a′ (φ) 1 1 a′ (φ) ∂ ri (φ) = − ṽi + log h(yi , φ) − w̃i ũ2i + w̃i ∂φ a(φ) ∂φ 2 a(φ) 2 w̃i a(φ) a′ (φ) 1 1 ∂ 2 = − ṽi − w̃i ũi + log h(yi , φ). a(φ) 2 2 ∂φ (S.36) (S.37) Summing over i, X a′ (φ) 1 1 ∂ ∂ 2 r(φ) = − ṽi − w̃i ũi + log h(yi , φ) ∂φ a(φ) 2 2 ∂φ i X n ∂ a′ (φ) n 1 T T = − 1 ṽ − ũ W̃ũ + log h(yi , φ). a(φ) 2 2 ∂φ i=1 (S.38) (S.39) Also note that t̃ is not a function of φ, as the term cancels out in W̃ũ. Finally, ∂ log q(y|X) ∂φ (S.40) −1 ∂ W̃ −1 1 T 1 ∂ −1 ∂ W̃ = t̃ K + W̃ K + W̃ t̃ − tr((K + W̃) )+ r(φ) 2 ∂φ 2 ∂φ ∂φ # " X n 1 ∂ a′ (φ) n 1 T T −1 ∂ W̃ T = tr zz − (K + W̃) + − 1 ṽ − ũ W̃ũ + log h(yi , φ) 2 ∂φ a(φ) 2 2 ∂φ i=1 (S.41) X n h i ′ ∂ n 1 a (φ) 1 tr zzT − (K + W̃)−1 W̃ + − 1T ṽ − ũT W̃ũ + log h(yi , φ) = a(φ) 2 2 2 ∂φ i=1 (S.42) X n h i ′ a (φ) 1 T ∂ 1 n 1 = z W̃z − tr (K + W̃)−1 W̃ + − 1T ṽ − ũT W̃ũ + log h(yi , φ) a(φ) 2 2 2 2 ∂φ i=1 (S.43) 5 3) Derivative wrt dispersion when bφ (θ): We next look at the special case where the term bφ (θ) is also a function of φ. We first note that −1 ∂ w̃i = a′ (φ) b′′φ (θ(η))θ′ (η)2 − y − b′φ (θ(η)) θ′′ (η) ∂φ ∂ ′′ ∂ ′ (S.44) θ′ (η)2 ∂φ bφ (θ(η)) + θ′′ (η) ∂φ bφ (θ(η)) − a(φ) 2 b′′φ (θ(η))θ′ (η)2 − y − b′φ (θ(η)) θ′′ (η) a′ (φ) 1 ∂ ′ 2 ′ 2 ∂ ′′ ′′ = w̃i − w̃ θ (η) b (θ(η)) + θ (η) bφ (θ(η)) , (S.45) a(φ) a(φ) i ∂φ φ ∂φ ′ ′ θ (η) ∂ ′ ∂ a (φ) ′ ′ ũi = − θ (η) y − b (θ(η)) − b (θ(η)) (S.46) φ ∂φ a(φ)2 a(φ) ∂φ φ θ′ (η) ∂ ′ a′ (φ) ũi − b (θ(η)), (S.47) =− a(φ) a(φ) ∂φ φ ∂ w̃i 2 ∂ ũi ∂ (w̃i ũ2i ) = ũi + 2w̃i ũi , (S.48) ∂φ ∂φ ∂φ ∂ ∂ ũi ∂ w̃i ∂ (η̃i + w̃i ũi ) = w̃i + ũi , (S.49) t̃i = ∂φ ∂φ ∂φ ∂φ 1 ∂ ∂ log p(yi |θ(η̃i )) = [yi θ(η̃i ) − bφ (θ(η̃i ))] + log h(yi , φ) (S.50) ∂φ ∂φ a(φ) a′ (φ) 1 ∂ ∂ =− [y θ(η̃ ) − b (θ(η̃ ))] − b (θ(η̃ )) + log h(yi , φ) i i φ i φ i a(φ)2 a(φ) ∂φ ∂φ (S.51) ′ a (φ) ∂ 1 ∂ =− ṽi + log h(yi , φ) − bφ (θ(η̃i )). (S.52) a(φ) ∂φ a(φ) ∂φ Thus, 1 1 ∂ w̃i ∂ 1 ∂ w̃i 2 ∂ ũi ∂ + ri (φ) = log p(yi |θ(η̃i )) + ũi + 2w̃i ũi ∂φ ∂φ 2 ∂φ ∂φ 2 w̃i ∂φ (S.53) For the first term in (S.22), i ∂ h T ∂(W̃ + K)−1 ∂ t̃ −1 t̃ (W̃ + K) t̃ = t̃T t̃ + 2t̃T (W̃ + K)−1 (S.54) ∂φ ∂φ ∂φ −1 ∂ W̃ −1 ∂ t̃ K + W̃ = −t̃T K + W̃ t̃ + 2t̃T (W̃ + K)−1 . ∂φ ∂φ (S.55) For the second term in (S.22), Finally, we have ∂ W̃ ∂ log K + W̃ = tr((K + W̃)−1 ). ∂φ ∂φ −1 ∂ W̃ −1 ∂ t̃ 1 T ∂ K + W̃ t̃ − t̃T (W̃ + K)−1 log q(y|X) = t̃ K + W̃ ∂φ 2 ∂φ ∂φ X ∂ 1 ∂ W̃ − tr((K + W̃)−1 )+ ri (φ). 2 ∂φ ∂φ i (S.56) (S.57) 6 II. L APLACE A PPROXIMATION In this section, we present more details of the Laplace approximation for GGPMs, which is discussed in Section 5.2 of the paper. The Laplace’s approximation is a Gaussian approximation of the posterior p(η|X, y) using a 2nd-order Taylor expansion at its maximum (mode). Hence, the Laplace approximation is a special case of the closed-form Taylor approximation in the previous section, where the expansion points η̃i are selected so that m̂ is the maximum of the true posterior p(η|X, y). A. Approximate Posterior The maximum of the posterior is given by η̂ = argmax log p(η|X, y) = argmax log p(y|θ(η)) + log p(η|X). η (S.58) η In vector notation, the derivatives of the posterior are ∂ log p(η|X, y) = u − K−1 η ∂η (S.59) ∂2 log p(η|X, y) = −W−1 − K−1 ∂η∂η T (S.60) At the maximum of the posterior, we have the condition ∂ log p(η|X, y) = û − K−1 η = 0 ∂η ⇒ û = K−1 η̂, (S.61) (S.62) where û is u evaluated at η̂. The maximum can be obtained iteratively using the NewtonRaphson method. In particular, each iteration is −1 ∂ ∂ (new) log p(η̂|X, y) log p(η̂|X, y) (S.63) η̂ = η̂ − ∂ηη T ∂η −1 = η̂ + Ŵ−1 + K−1 (û − K−1 η̂) (S.64) −1 = Ŵ−1 + K−1 (û + Ŵ−1 η̂) (S.65) −1 = Ŵ−1 + K−1 Ŵ−1 (Ŵû + η̂), (S.66) where Ŵ is W evaluated at η̂. Note that this is exactly the same form as the closed-form Taylor expansion in the previous section. In particular, at each iteration the expansion point η̂ is moved closer to the maximum. As a result, the target vector t̂(new) = Ŵû + η̂ is also updated. B. Approximate Marginal The marginal likelihood is approximated by applying a Taylor approximation to the marginal likelihood at the mode of the posterior, η̂. The marginal distribution is Z Z p(y|X) = p(y|θ(η))p(η|X)dη = eκ(η) dη (S.67) where κ(η) = log p(y|θ(η)) + log p(η|X). (S.68) 7 The second-order Taylor approximation of κ(η) around η̂ is 1 kη − η̂k2V̂ 2 (S.69) = log p(y|θ(η̂)) + log p(η̂|X) − (S.70) κ(η) ≈ κ(η̂) − 1 kη − η̂k2V̂ 2 Substituting into (S.67), we obtain the approximate marginal distribution Z 2 1 log q(y|X) = log p(y|θ(η̂)) + log p(η̂|X) + e− 2 kη−η̂kV̂ dη (S.71) 1 n n 1 1 = log p(y|θ(η̂)) − η̂ T K−1 η̂ − log |K| − log 2π + log 2π + log V̂ 2 2 2 2 2 (S.72) 1 T −1 1 = log p(y|θ(η̂)) − η̂ K η̂ − log Ŵ−1 + K−1 K (S.73) 2 2 1 1 (S.74) = log p(y|θ(η̂)) − η̂ T K−1 η̂ − log Ŵ−1K + I 2 2 Note that η̂ is dependent on the kernel matrix K and a(φ). Hence, at each iteration during optimization of q(y|X), we have to recompute its value. 1) Derivative wrt hyperparameters: We next look at the derivatives of each term in (S.74). Note that η̂ is a function of the kernel hyperparameters. Using the chain rule, the derivative of the first term is ∂ ∂ η̂ ∂ η̂ ∂ log p(y|θ(η̂)) = log p(y|θ(η̂)) = ûT . (S.75) T ∂αj ∂αj ∂αj ∂ η̂ For the second term, we use the product rule, ∂AB ∂α = A ∂B + ∂α ∂A B, ∂α ∂ T −1 ∂ T η̂ K η̂ = η̂ K−1 η̂ ∂αj ∂αj ∂ η̂ T T ∂ −1 = η̂ K η̂ + K−1 η̂ ∂αj ∂αj ∂ η̂ T ∂K−1 T −1 ∂ η̂ K−1 η̂ η̂ + + = η̂ K ∂αj ∂αj ∂αj ∂K −1 ∂ η̂ = −η̂ T K−1 K η̂ + 2η̂ T K−1 . ∂αj ∂αj For the third term, ∂ −1 −1 −1 ∂ −1 log Ŵ K + I = tr (Ŵ K + I) (Ŵ K) ∂αj ∂αj # " −1 ∂ Ŵ ∂K + K) = tr (Ŵ−1 K + I)−1 (Ŵ−1 ∂αj ∂αj " # −1 −1 ∂K −1 −1 −1 ∂ Ŵ = tr (K + Ŵ) + tr (Ŵ + K ) ∂αj ∂αj (S.76) (S.77) (S.78) (S.79) (S.80) (S.81) (S.82) 8 Putting it all together, we have ∂ ∂ η̂ 1 ∂ η̂ ∂K −1 log q(y|X) = ûT + η̂ T K−1 K η̂ − η̂ T K−1 ∂αj ∂αj 2 ∂αj ∂αj # " (S.83) −1 1 ∂K ∂ Ŵ 1 −1 −1 −1 −1 − tr (Ŵ + K ) − tr (K + Ŵ) 2 ∂αj 2 ∂αj # " ∂K 1 −1 1 ∂ Ŵ = tr ûûT − (K + Ŵ)−1 − tr (Ŵ−1 + K−1 )−1 , 2 ∂αj 2 ∂αj (S.84) which follows from û = K−1 η̂. Next, we note that Ŵ is a function of η̂, which in turn is a function of the kernel hyperparameter αj , and thus n The first term is ∂ Ŵ−1 X ∂ Ŵ−1 ∂ η̂i = ∂αj ∂ η̂i ∂αj i=1 ∂ Ŵ−1 = ∂ η̂i ∂3 − 3 log p(yi |θ(η̂i )) ei eTi . ∂ η̂i (S.85) (S.86) Elements of the second term are given by, ∂ ∂ η̂ = (Ŵ−1 + K−1 )−1 Ŵ−1 t̂ ∂αj ∂αj −1 −1 −1 −1 ∂K = −(Ŵ + K ) (Ŵ−1 + K−1 )−1 Ŵ−1 t̂ ∂αj ∂K −1 = (Ŵ−1 + K−1 )−1 K−1 K (Ŵ−1 + K−1 )−1 Ŵ−1t̂ ∂αj ∂K = Ŵ(Ŵ + K)−1 û. ∂αj (S.87) (S.88) (S.89) (S.90) Hence, (S.85) can be expressed as ∂ η̂ ∂W−1 , = diag A ∂αj ∂αj where A is a diagonal matrix of with entries ∂3 A = diag − 3 log p(yi |θ(η̂i )) ∂ η̂i (S.91) (S.92) and ∂ 1 ′ ∂3 −1 ′ ′′ −1 θ (η̂i )[g ] (η̂i ) − θ (η̂i ) y − g (η̂i ) (S.93) − 3 log p(yi |θ(η̂i )) = ∂ η̂i ∂ η̂i a(φ) 1 ′ θ (η̂i )[g −1]′′ (η̂i ) + θ′′ (η̂i )[g −1 ]′ (η̂i ) + θ′′ (η̂i )[g −1 ]′ (η̂i ) − θ′′′ (η̂i ) y − g −1 (η̂i ) = a(φ) (S.94) 1 ′ = θ (η̂i )[g −1]′′ (η̂i ) + 2θ′′ (η̂i )[g −1]′ (η̂i ) − θ′′′ (η̂i ) y − g −1(η̂i ) . (S.95) a(φ) 9 2) Derivative wrt dispersion: Looking at the derivatives of each term in (S.74), for the first term, 1 ∂ ∂ log p(yi |θ(η̂i )) = [yi θ(η̂i ) − b(θ(η̂i ))] + log h(yi , φ) (S.96) ∂φ ∂φ a(φ) a′ (φ) 1 ∂ ∂ =− [y θ(η̂ ) − b(θ(η̂ ))] + [y θ(η̂ ) − b(θ(η̂ ))] + log h(yi , φ) i i i i i i a(φ)2 a(φ) ∂φ ∂φ (S.97) ′ ∂ η̂i a (φ) 1 ∂ ∂ [yi θ(η̂i ) − b(θ(η̂i ))] =− v̂i + + log h(yi , φ) a(φ) a(φ) ∂ η̂i ∂φ ∂φ (S.98) ′ a (φ) ∂ η̂i ∂ =− v̂i + ûi + log h(yi , φ) (S.99) a(φ) ∂φ ∂φ where v̂i = v(η̂i , yi ) and ûi = u(η̂i , yi ). Hence, n X a′ (φ) ∂ η̂i ∂ ∂ log p(y|θ(η̂)) = − v̂i + ûi + log h(yi , φ) ∂φ a(φ) ∂φ ∂φ i=1 (S.100) n =− The derivative of η̂ is a′ (φ) T ∂ η̂ X ∂ 1 v̂ + ûT + log h(yi , φ). a(φ) ∂φ i=1 ∂φ ∂ ∂ ∂ η̂ = (Ŵ−1 + K−1 )−1 Ŵ−1 t̂ = (I + ŴK−1 )−1 t̂ ∂φ ∂φ ∂φ ∂ = −(I + ŴK−1 )−1 (ŴK−1 )(I + ŴK−1 )−1 t̂ ∂φ a′ (φ) (I + ŴK−1 )−1 ŴK−1 (I + ŴK−1 )−1 t̂ =− a(φ) a′ (φ) (Ŵ−1 + K−1 )−1 (K + Ŵ)−1 t̂, =− a(φ) a′ (φ) (Ŵ−1 + K−1 )−1 û. =− a(φ) (S.101) (S.102) (S.103) (S.104) (S.105) (S.106) Hence, n X ∂ ∂ a′ (φ) T 1 v̂ + ûT (W−1 + K−1 )−1 û + log p(y|θ(η̂)) = − log h(yi , φ). ∂φ a(φ) ∂φ i=1 (S.107) For the second term, ∂ 1 T −1 ∂ η̂ ∂ η̂ = ûT , η̂ K η̂ = η̂ T K−1 ∂φ 2 ∂φ ∂φ a′ (φ) T û (W−1 + K−1 )−1 û. =− a(φ) (S.108) (S.109) 10 For the third term, ∂ −1 −1 −1 ∂ −1 log Ŵ K + I = tr (Ŵ K + I) (Ŵ K) ∂φ ∂φ # " ∂ Ŵ Ŵ−1 K) = −tr (Ŵ−1 K + I)−1 Ŵ−1 ∂φ i a′ (φ) h =− tr (Ŵ−1K + I)−1 Ŵ−1 K) a(φ) i a′ (φ) h −1 tr (K + Ŵ) K) . =− a(φ) (S.110) (S.111) (S.112) (S.113) Finally, putting it all together, n i X ∂ 1 h a′ (φ) ∂ T −1 −1 v̂ + tr (K + Ŵ) K) + log q(y|X) = log h(yi , φ). (S.114) ∂φ a(φ) 2 ∂φ i=1 III. E XPECTATION P ROPAGATION In this section, we present more details on expectation propagation (EP) for for GGPMs, which is discussed in Section 5.3 of the paper. A. Computing the site parameters Instead of computing the optimal site parameters all at once, EP works by iteratively updating each individual site using the other site approximations [2], [3]. In particular, assume that we are updating site ti . We first compute the cavity distribution, which is the marginalization over all sites except ti , Z Y tj (ηj |Z̃j , µ̃j , σ̃j2 )dηj , (S.115) q¬i (ηi ) ∝ p(η|X) j6=i where the notation ¬i indicates the sites without ti . q¬i (ηi ) is an approximation to the posterior distribution of ηi , given all observations except yi . Since both terms are Gaussian, this integral can be computed in closed-form, leading to the cavity distribution 2 , (S.116) q¬i (ηi ) = N ηi µ¬i , σ¬i 2 2 µ¬i = σ¬i (σi−2 µi − σ̃i−2 µ̃i ), σ¬i = (σi−2 − σ̃i−2 )−1 (S.117) where σi2 = [V̂]ii and µi = [m̂]i . Next, multiplying the cavity distribution by the true data likelihood of yi gives an approximation to the unnormalized posterior of ηi , given the observations, q(ηi |X, y) = q¬i (ηi )p(yi |θ(ηi )). (S.118) Note that this incorporates the true likelihood of yi and the approximate likelihoods of the remaining observations. On the other hand, the approximation to the posterior of ηi using the site function is q̂(ηi ) = q¬i (ηi )ti (ηi |Z̃i , µ̃i , σ̃i2 ) (S.119) Hence, the new parameters for site ti can be computed by minimizing the KL divergence between (S.118) and (S.119), i.e., between the approximate posterior using the true likeli- 11 hood and that using the site ti , {Z̃i∗ , µ̃∗i , σ̃i∗2 } = argmin KL (q(ηi |X, y) kq̂(ηi ) ) (S.120) Z̃i ,µ̃i ,σ̃i2 = argmin KL q¬i (ηi )p(yi |θ(ηi )) q¬i (ηi )ti (ηi |Z̃i , µ̃i , σ̃i2 ) (S.121) Z̃i ,µ̃i ,σ̃i2 Note that q̂(ηi ) in (S.119) is an unnormalized Gaussian, i.e. q̂(ηi ) = Ẑi N (ηi |µ̂i , σ̂i2 ), with parameters (given by the product of Gaussians) −2 µ̂i = σ̂i2 (σ¬i µ¬i + σ̃i−2 µ̃i ), −2 σ̂i2 = (σ¬i + σ̃i−2 )−1 , Ẑi = Z̃i (2π) − 21 (σ̃i2 + 2 − 21 σ¬i ) (S.122) (S.123) (µ¬i − µ̃i )2 exp − 2 2(σ¬i + σ̃i2 ) . (S.124) Hence, to compute (S.121), it suffices to first find the parameters of the unnormalized Gaussian that minimizes the KL divergence to q(ηi |X, y), and then “subtract” q¬i (ηi ) from this Gaussian. To find q̂(ηi ) we minimize the KL divergence, {Ẑi∗ , µ̂∗i , σ̂i∗2 } = argmin KL q¬i (ηi )p(yi |θ(ηi )) Ẑi N ηi µ̂i , σ̂i2 . (S.125) Ẑi ,µ̂i ,σ̂i2 In particular, it is well known that the KL divergence in (S.125) is minimized when the moments of the Gaussian match those of the first argument. In addition to the mean and variance (1st and 2nd moments), we also must match the normalization constant (0th moment), since q̂(ηi ) is unnormalized. The optimal parameters are Z 1 µ̂i = Eq [ηi ] = ηi q¬i (ηi )p(yi |θ(ηi ))dηi (S.126) Ẑi Z 1 (ηi − µ̂i )2 q¬i (ηi )p(yi |θ(ηi ))dηi (S.127) σ̂i2 = varq (ηi ) = Ẑ i Z Ẑi = Zq = q¬i (ηi )p(yi |θ(ηi ))dηi , (S.128) where Eq and varq are the expectation and variance with respect to q(ηi |X, y). These moments are discussed later in this section. Finally, the new parameters for site ti are obtained by “subtracting” the two Gaussians, q¬i (θi ) from q̂(θi ), leading to the site updates −2 µ̃i = σ̃i2 (σ̂i−2 µ̂i − σ¬i µ¬i ) 2 −2 −2 −1 σ̃i = (σ̂i − σ¬i ) q (µ¬i − µ̃i )2 2 2 Z̃i = Ẑi 2π(σ¬i + σ̃i ) exp . 2 2(σ¬i + σ̃i2 ) (S.129) (S.130) (S.131) EP iterates over each of the site ti , i.e. each observation yi , iteratively until convergence. 1) Computing the moments: Each EP iteration requires the moments of q(ηi ) in (S.126S.128), where 1 2 , (S.132) q(ηi ) = p(yi |θ(ηi ))N ηi µ¬i , σ¬i ZẐi 2 dηi . (S.133) Ẑi = p(yi |θ(ηi ))N ηi µ¬i , σ¬i 12 Depending on the form of the likelihood p(yi |θ(ηi )) the integrals may not be analytically tractable. Hence, approximate integration is required. In this paper, we use numerical integration with 10,000 equally spaced points between µ¬i ± 5σ¬i . We also tried GaussianHermite quadrature to approximate the integral, but this caused EP to fail to converge in many cases, due to inaccuracy in the approximation. B. Marginal Likelihood The EP approximation to the marginal log-likelihood is log p(y|X) ≈ log q(y|X) = log ZEP Z = log q(y|θ(η))p(η|X)dη Z n Y = log N η µ̃, Σ̃ Z̃i N (η|0, K) dη (S.134) (S.135) (S.136) i=1 = n X i=1 Hence, we have log q(y|X) = n X i=1 Finally, log Z̃i + log Z N µ̃η, Σ̃ N (η|0, K) dη log Z̃i + log N µ̃0, Σ̃ + K n 1 T 1 −1 = − µ̃ (K + Σ̃) µ̃ − log K + Σ̃ − log 2π 2 2 2 X 1 (µ¬i − µ̃i )2 1 2 2 . + log Ẑi + log 2π + log(σ¬i + σ̃i ) + 2 2 2 2 2(σ + σ̃ ) ¬i i i (S.137) (S.138) (S.139) log q(y|X) X 1 1 T (µ¬i − µ̃i )2 1 2 2 −1 log Ẑi + log(σ¬i + σ̃i ) + = − µ̃ (K + Σ̃) µ̃ − log K + Σ̃ + . 2 2 2 2 2 2(σ + σ̃ ) ¬i i i (S.140) 1) Derivatives wrt hyperparameters: Note that µ̃, Σ̃, µ¬ , and Σ¬ are all implicitly functions of the kernel hyperparameter, due to the EP update steps. Hence, the derivative of the marginal is composed of the explicit derivative wrt the hyperparameter ( ∂α∂ j ) as well ∂ µ̃ ∂ ) It can be shown that the implicit derivatives are as several implicit derivative (e.g., ∂α j ∂ µ̃ all zero [4], and hence ∂ log q(y|X) ∂ log q(y|X) (S.141) = ∂αj ∂αj explicit 1 T 1 −1 −1 ∂K −1 ∂K = µ̃ (K + Σ̃) (K + Σ̃) µ̃ − tr (K + Σ̃) (S.142) 2 ∂αj 2 ∂αj h i 1 T −1 ∂K , z̃ = (K + Σ̃)−1 µ̃. (S.143) = tr z̃z̃ − (K + Σ̃) 2 ∂αj 13 2) Derivative wrt dispersion: The only term in the marginal that depends on φ directly is log Ẑi . Its derivative is Z ∂ ∂ 2 log Ẑi = log p(yi |θ(ηi ))N ηi µ¬i , σ¬i dηi (S.144) ∂φ ∂φ Z ∂ 1 2 p(yi |θ(ηi ))N ηi µ¬i , σ¬i dηi (S.145) = Ẑi ∂φ Z 1 h′ (yi , φ) a′ (φ) 2 = dηi + (θ(ηi )yi − b(θ(ηi ))) p(yi |θ(ηi ))N ηi µ¬i , σ¬i 2 h(yi , φ) a(φ) Ẑi (S.146) ′ ∂ a (φ) = log h(yi , φ) + (yi Eq [θ(ηi )] − Eq [b(θ(ηi ))]) , (S.147) ∂φ a(φ)2 where Eq is the expectation under q(ηi |X, y). Hence, the derivative of the marginal wrt the dispersion parameter is n X ∂ log q(y|X) ∂ log q(y|X) ∂ = log Ẑi (S.148) = ∂φ ∂φ ∂φ explicit i=1 = n n X a′ (φ) X ∂ log h(yi , φ) + (yi Eq [θ(ηi )] − Eq [b(θ(ηi ))]) . 2 ∂φ a(φ) i=1 i=1 (S.149) Hence, additional expectations Eq [θ(ηi )] and Eq [b(θ(ηi ))] are required. R EFERENCES [1] A. B. Chan and D. Dong, “Generalized gaussian process models,” in IEEE Conf. Computer Vision and Pattern Recognition, 2011. [2] T. Minka, A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts Institute of Technology, 2001. [3] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT Press, 2006. [4] M. Seeger, “Expectation propagation for exponential families,” tech. rep., 2005.