Solution Manual For PRML

S OLUTION M ANUAL F OR PATTERN R ECOGNITION AND M ACHINE L EARNING E DITED B Y ZHENGQI GAO the State Key Lab. of ASIC and System School of Microelectronics Fudan University N OV.2017 1 0.1 Introduction Problem 1.1 Solution We let the derivative of error function E with respect to vector w equals E = 0), and this will be the solution of w = {w i } which minimizes to 0, (i.e. ∂∂w error function E . To solve this problem, we will calculate the derivative of E with respect to every w i , and let them equal to 0 instead. Based on (1.1) and (1.2) we can obtain : => ∂E ∂w i => = N ∑ n=1 { y( xn , w) − t n } xni = 0 y( xn , w) xni = n=1 => N ∑ N ∑ xni t n n=1 N M N ∑ ∑ ∑ j xni t n ( w j xn ) xni = n=1 n=1 j =0 => M N ∑ ∑ n=1 j =0 => M ∑ N ∑ j =0 n=1 ∑N ( j+ i) w j xn = ( j+ i) wj = xn i+ j N ∑ n=1 N ∑ n=1 xni t n xni t n ∑N If we denote A i j = n=1 xn and T i = n=1 xn i t n , the equation above can be written exactly as (1.222), Therefore the problem is solved. Problem 1.2 Solution This problem is similar to Prob.1.1, and the only difference is the last term on the right side of (1.4), the penalty term. So we will do the same thing as in Prob.1.1 : => ∂E ∂w i => = N ∑ n=1 M ∑ N ∑ j =0 n=1 => { y( xn , w) − t n } xni + λw i = 0 ( j+ i) xn w j + λw i = N ∑ n=1 xni t n M ∑ N N ∑ ∑ ( j+ i) { xn + δ ji λ}w j = xni t n j =0 n=1 n=1 2 where δ ji { 0 j ̸= i 1 j=i Problem 1.3 Solution This problem can be solved by Bayes’ theorem. The probability of selecting an apple P (a) : P ( a) = P ( a| r ) P ( r ) + P ( a| b ) P ( b ) + P ( a| g ) P ( g ) = 1 3 3 × 0.2 + × 0.2 + × 0.6 = 0.34 10 2 10 Based on Bayes’ theorem, the probability of an selected orange coming from the green box P ( g| o) : P ( g | o) = P ( o| g ) P ( g ) P ( o) We calculate the probability of selecting an orange P ( o) first : P ( o) = P ( o| r ) P ( r ) + P ( o| b ) P ( b ) + P ( o| g ) P ( g ) = 4 1 3 × 0.2 + × 0.2 + × 0.6 = 0.36 10 2 10 Therefore we can get : P ( o| g ) P ( g ) P ( g | o) = = P ( o) 3 10 × 0. 6 0.36 = 0.5 Problem 1.4 Solution This problem needs knowledge about calculus, especially about Chain rule. We calculate the derivative of P y ( y) with respect to y, according to (1.27) : d p y ( y) dy = d ( p x ( g( y))| g‘ ( y)|) d p x ( g( y)) ‘ d | g‘ ( y)| = | g ( y)| + p x ( g( y)) dy dy dy (∗) The first term in the above equation can be further simplified: d p x ( g( y)) ‘ d p x ( g( y)) d g( y) ‘ | g ( y)| = | g ( y)| dy d g ( y) dy (∗∗) If x̂ is the maximum of density over x, we can obtain : d p x ( x) ¯¯ =0 dx x̂ Therefore, when y = ŷ, s.t. x̂ = g( ŷ), the first term on the right side of (∗∗) will be 0, leading the first term in (∗) equals to 0, however because of the existence of the second term in (∗), the derivative may not equal to 0. But 3 when linear transformation is applied, the second term in (∗) will vanish, (e.g. x = a y + b). A simple example can be shown by : p x ( x) = 2 x, x ∈ [0, 1] => x̂ = 1 And given that: x = sin( y) Therefore, p y ( y) = 2 sin( y) | cos( y)|, y ∈ [0, π2 ], which can be simplified : π y ∈ [0, ] 2 p y ( y) = sin(2 y), => ŷ = π 4 However, it is quite obvious : x̂ ̸= sin( ŷ) Problem 1.5 Solution This problem takes advantage of the property of expectation: var [ f ] = E[( f ( x) − E[ f ( x)])2 ] = E[ f ( x)2 − 2 f ( x)E[ f ( x)] + E[ f ( x)]2 ] = E[ f ( x)2 ] − 2E[ f ( x)]2 + E[ f ( x)]2 => var [ f ] = E[ f ( x)2 ] − E[ f ( x)]2 Problem 1.6 Solution Based on (1.41), we only need to prove when x and y is independent, E x,y [ x y] = E[ x]E[ y]. Because x and y is independent, we have : p( x, y) = p x ( x) p y ( y) Therefore: ∫ ∫ ∫ ∫ x yp( x, y) dx d y = x yp x ( x) p y ( y) dx d y ∫ ∫ = ( xp x ( x) dx)( yp y ( y) d y) => E x,y [ x y] = E[ x]E[ y] Problem 1.7 Solution This problem should take advantage of Integration by substitution. ∫ +∞ ∫ +∞ 1 1 2 exp(− 2 x2 − 2 y2 ) dx d y I = 2σ 2σ −∞ −∞ ∫ 2π ∫ +∞ 1 2 = exp(− 2 r ) r dr d θ 2σ 0 0 4 Here we utilize : x = r cos θ , y = r sin θ Based on the fact : ∫ +∞ 1 r 2 ¯+∞ exp(− 2 ) r dr = −σ2 exp(− 2 )¯0 = −σ2 (0 − (−1)) = σ2 2σ 2σ 0 Therefore, I can be solved : ∫ 2π 2 I = 0 σ2 d θ = 2πσ2 , => I = p 2πσ ¯ And next,we will show that Gaussian distribution N ( x¯µ, σ2 ) is normal¯ ∫ +∞ ized, (i.e. −∞ N ( x¯µ, σ2 ) dx = 1) : ∫ +∞ −∞ ¯ N ( x¯µ, σ2 ) dx ∫ +∞ 1 1 exp{− 2 ( x − µ)2 } dx p 2 2σ −∞ 2πσ ∫ +∞ 1 1 = exp{− 2 y2 } d y ( y = x − µ ) p 2σ −∞ 2πσ2 ∫ +∞ 1 1 exp{− 2 y2 } d y = p 2σ 2πσ2 −∞ = 1 = Problem 1.8 Solution The first question will need the result of Prob.1.7 : ∫ +∞ ∫ +∞ ¯ 1 1 N ( x¯µ, σ2 ) x dx = exp{− 2 ( x − µ)2 } x dx p 2σ −∞ −∞ 2πσ2 ∫ +∞ 1 1 = exp{− 2 y2 }( y + µ) d y ( y = x − µ) p 2 2 σ −∞ 2πσ ∫ +∞ ∫ +∞ 1 1 2 1 1 = µ exp{− 2 y } d y + exp{− 2 y2 } y d y p p 2 2 2σ 2σ −∞ −∞ 2πσ 2πσ = µ+0 = µ The second problem has already be given hint in the description. Given that : dg df d ( f g) = f +g dx dx dx We differentiate both side of (1.127) with respect to σ2 , we will obtain : ∫ +∞ −∞ (− ¯ 1 ( x − µ)2 + )N ( x¯µ, σ2 ) dx = 0 2 4 2σ 2σ 5 Provided the fact that σ ̸= 0, we can get: ∫ +∞ ∫ +∞ ¯ ¯ 2 2 ¯ ( x − µ) N ( x µ, σ ) dx = σ2 N ( x¯µ, σ2 ) dx = σ2 −∞ −∞ So the equation above has actually proven (1.51), according to the definition: ∫ +∞ ¯ var [ x] = ( x − E[ x])2 N ( x¯µ, σ2 ) dx −∞ Where E[ x] = µ has already been proved. Therefore : var [ x] = σ2 Finally, E[ x2 ] = var [ x] + E[ x]2 = σ2 + µ2 Problem 1.9 Solution Here we only focus on (1.52), because (1.52) is the general form of (1.42). Based on the definition : The maximum of distribution is known as its mode and (1.52), we can obtain : ¯ ∂N (x¯µ, Σ) ∂x ¯ 1 = − [Σ−1 + (Σ−1 )T ] (x − µ)N (x¯µ, Σ) 2 ¯ = −Σ−1 (x − µ)N (x¯µ, Σ) Where we take advantage of : ∂xT Ax ∂x = (A + AT )x Therefore, only when x = µ, and (Σ−1 )T = Σ−1 ¯ ∂N (x¯µ, Σ) ∂x =0 Note: You may also need to calculate Hessian Matrix to prove that it is maximum. However, here we find that the first derivative only has one root. Based on the description in the problem, this point should be maximum point. Problem 1.10 Solution We will solve this problem based on the definition of expectation, variation 6 and independence. ∫ ∫ E[ x + z ] = = ∫ ∫ ( x + z) p( x, z) dx dz ( x + z) p( x) p( z) dx dz ∫ ∫ = xp( x) p( z) dx dz + z p( x) p( z) dx dz ∫ ∫ ∫ ∫ = ( p( z) dz) xp( x) dx + ( p( x) dx) z p( z) dz ∫ ∫ = xp( x) dx + z p( z) dz ∫ ∫ = E[ x ] + E[ z ] ∫ ∫ var [ x + z] = ∫ ∫ ( x + z − E[ x + z] )2 p( x, z) dx dz = { ( x + z )2 − 2 ( x + z) E[ x + z]) + E2 [ x + z] } p( x, z) dx dz ∫ ∫ ∫ ∫ 2 ( x + z) p( x, z) dx dz − 2E[ x + z] ( x + z) p( x, z) dx dz + E2 [ x + z] ∫ ∫ ( x + z)2 p( x, z) dx dz − E2 [ x + z] ∫ ∫ ( x2 + 2 xz + z2 ) p( x) p( z) dx dz − E2 [ x + z] ∫ ∫ ∫ ∫ ∫ ∫ 2 ( p( z) dz) x p( x) dx + 2 xz p( x) p( z) dx dz + ( p( x) dx) z2 p( z) dz − E2 [ x + z] ∫ ∫ E[ x 2 ] + E[ z 2 ] − E2 [ x + z ] + 2 xz p( x) p( z) dx dz ∫ ∫ E[ x2 ] + E[ z2 ] − (E[ x] + E[ z])2 + 2 xz p( x) p( z) dx dz ∫ ∫ E[ x2 ] − E2 [ x] + E[ z2 ] − E2 [ z] − 2E[ x] E[ z] + 2 xz p( x) p( z) dx dz ∫ ∫ var [ x] + var [ z] − 2E[ x] E[ z] + 2( xp( x) dx)( z p( z) dz) = var [ x] + var [ z] = = = = = = = = Problem 1.11 Solution Based on prior knowledge that µ ML and σ2ML will decouple. We will first calculate µ ML : ¯ N d ( ln p(x¯ µ, σ2 )) 1 ∑ = 2 ( x n − µ) dµ σ n=1 We let : ¯ d ( ln p(x¯ µ, σ2 )) dµ =0 7 Therefore : µ ML = And because: ¯ d ( ln p(x¯ µ, σ2 )) d σ2 = N 1 ∑ xn N n=1 N 1 ∑ ( ( x n − µ )2 − N σ 2 ) 2σ4 n=1 ¯ d ( ln p(x¯ µ, σ2 )) We let : d σ2 Therefore : σ2ML = =0 N 1 ∑ ( xn − µ ML )2 N n=1 Problem 1.12 Solution It is quite straightforward for E[µ ML ], with the prior knowledge that xn is i.i.d. and it also obeys Gaussian distribution N (µ, σ2 ). E[µ ML ] = E[ N N 1 ∑ 1 ∑ x n ] = E[ x n ] = µ x n ] = E[ N n=1 N n=1 For E[σ2ML ], we need to take advantage of (1.56) and what has been given in the problem : E[σ2ML ] = E[ N 1 ∑ ( xn − µ ML )2 ] N n=1 = N 1 ∑ E[ ( xn − µ ML )2 ] N n=1 = N 1 ∑ ( x2 − 2 xn µ ML + µ2ML )] E[ N n=1 n = N N N 1 ∑ 1 ∑ 1 ∑ E[ x2n ] − E[ 2 xn µ ML ] + E[ µ2 ] N n=1 N n=1 N n=1 ML = µ2 + σ 2 − N N 2 ∑ 1 ∑ E[ xn ( xn )] + E[µ2ML ] N n=1 N n=1 = µ2 + σ 2 − N N N ∑ ∑ 1 ∑ 2 x )] + E [( x ( xn )2 ] E [ n n N N 2 n=1 n=1 n=1 = µ2 + σ 2 − N N ∑ ∑ 1 2 2 x ) ] + x n )2 ] E [( E [( n N 2 n=1 N 2 n=1 N ∑ 1 x n )2 ] E [( N 2 n=1 1 = µ2 + σ 2 − 2 [ N ( N µ2 + σ 2 ) ] N = µ2 + σ 2 − 8 Therefore we have: E[σ2ML ] = ( N −1 2 )σ N Problem 1.13 Solution This problem can be solved in the same method used in Prob.1.12 : E[σ2ML ] = E[ N 1 ∑ ( x n − µ )2 ] N n=1 (Because here we use µ to replace µ ML ) = N 1 ∑ E[ ( xn − µ)2 ] N n=1 = N 1 ∑ E[ ( x2 − 2 xn µ + µ2 )] N n=1 n = N N N 1 ∑ 1 ∑ 1 ∑ E[ x2n ] − E[ 2 x n µ ] + E[ µ2 ] N n=1 N n=1 N n=1 = µ2 + σ 2 − N 2µ ∑ E[ x n ] + µ2 N n=1 = µ2 + σ2 − 2µ2 + µ2 = σ2 Note: The biggest difference between Prob.1.12 and Prob.1.13 is that the mean of Gaussian Distribution is known previously (in Prob.1.13) or not (in Prob.1.12). In other words, the difference can be shown by the following equations: E[ µ 2 ] = µ 2 (µ is determined, i.e. its expectation is itself, also true for µ2 ) N N ∑ 1 ∑ 1 1 σ2 E[µ2ML ] = E[( xn )2 ] = 2 E[( xn )2 ] = 2 N ( N µ2 + σ2 ) = µ2 + N n=1 N N N n=1 Problem 1.14 Solution This problem is quite similar to the fact that any function f ( x) can be written into the sum of an odd function and an even function. If we let: wSij = w i j + w ji 2 and w iAj = w i j − w ji 2 It is obvious that they satisfy the constraints described in the problem, which are : w i j = wSij + w iAj , wSij = wSji , w iAj = −w Aji 9 To prove (1.132), we only need to simplify it : D ∑ D ∑ wi j xi x j i =1 j =1 = = D ∑ D ∑ i =1 j =1 (wSij + w iAj ) x i x j D ∑ D ∑ i =1 j =1 wSij x i x j + D ∑ D ∑ i =1 j =1 w iAj x i x j Therefore, we only need to prove that the second term equals to 0, and here we use a simple trick: we will prove twice of the second term equals to 0 instead. 2 D ∑ D ∑ i =1 j =1 w iAj x i x j = = = = D ∑ D ∑ i =1 j =1 D ∑ D ∑ i =1 j =1 D ∑ D ∑ i =1 j =1 D ∑ D ∑ i =1 j =1 ( w iAj + w iAj ) x i x j ( w iAj − w Aji ) x i x j w iAj x i x j − w iAj x i x j − D ∑ D ∑ i =1 j =1 D ∑ D ∑ j =1 i =1 w Aji x i x j w Aji x j x i = 0 Therefore, we choose the coefficient matrix to be symmetric as described in the problem. Considering about the symmetry, we can see that if and only if for i = 1, 2, ..., D and i ≤ j , w i j is given, the whole matrix will be determined. Hence, the number of independent parameters are given by : D + D − 1 + ... + 1 = D (D + 1) 2 Note: You can view this intuitively by considering if the upper triangular part of a symmetric matrix is given, the whole matrix will be determined. Problem 1.15 Solution This problem is a more general form of Prob.1.14, so the method can also be used here: we will find a way to use w i 1 i 2 ...i M to represent w e i 1 i 2 ...i M . We begin by introducing a mapping function: F ( x i1 x i2 ...x iM ) = x j1 x j2 ..., x jM s.t. M ∪ k=1 x ik = M ∪ k=1 x jk , and x j1 ≥ x j2 ≥ x j3 ... ≥ x jM 10 It is complexed to write F in mathematical form. Actually this function does a simple work: it rearranges the element in a decreasing order based on its subindex. Several examples are given below, when D = 5, M = 4: F ( x5 x2 x3 x2 ) = x5 x3 x2 x2 F ( x1 x3 x3 x2 ) = x3 x3 x2 x1 F ( x1 x4 x2 x3 ) = x4 x3 x2 x1 F ( x1 x1 x5 x2 ) = x5 x2 x1 x1 After introducing F , the solution will be very simple, based on the fact that F will not change the value of the term, but only rearrange it. D ∑ D ∑ ... i 1 =1 i 2 =1 where D ∑ i M =1 w i 1 i 2 ...i M x i1 x i2 ...x iM = w e j 1 j 2 ... j M = ∑ j1 D ∑ ∑ j 1 =1 j 2 =1 ... j∑ M −1 j M =1 w e j 1 j 2 ... j M x j1 x j2 ...x jM w w∈Ω ¯ Ω = {w i 1 i 2 ...i M ¯ F ( x i1 x i2 ...x iM ) = x j1 x j2 ...x jM , ∀ x i1 x i2 ...x iM } By far, we have already proven (1.134). Mathematical induction will be used to prove (1.135) and we will begin by proving D = 1, i.e. n(1, M ) = n(1, M − 1). When D = 1, (1.134) will degenerate into wx e 1M , i.e., it only has one term, whose coefficient is govern by w e regardless the value of M . Therefore, we have proven when D = 1, n(D, M ) = 1. Suppose (1.135) holds for D , let’s prove it will also hold for D + 1, and then (1.135) will be proved based on Mathematical induction. Let’s begin based on (1.134): i1 D∑ +1 ∑ i 1 =1 i 2 =1 ... i∑ M −1 i M =1 w e i 1 i 2 ...i M x i1 x i2 ...x iM (∗) We divide (∗) into two parts based on the first summation: the first part is made up of i i = 1, 2, ..., D and the second part i 1 = D + 1. After division, the first part corresponds to n(D, M ), and the second part corresponds to n(D + 1, M − 1). Therefore we obtain: n(D + 1, M ) = n(D, M ) + n(D + 1, M − 1) And given the fact that (1.135) holds for D : n(D, M ) = D ∑ i =1 n( i, M − 1) (∗∗) 11 Therefore,we substitute it into (∗∗) n ( D + 1, M ) = D ∑ i =1 n( i, M − 1) + n(D + 1, M − 1) = D∑ +1 i =1 n( i, M − 1) We will prove (1.136) in a different but simple way. We rewrite (1.136) in Permutation and Combination view: D ∑ i =1 M C iM+−M1−2 = C D + M −1 Firstly, We expand the summation. M −1 M −1 M −1 M CM + ... C D −1 + C M + M −2 = C D + M −1 M CM M M −1 Secondly, we rewrite the first term on the left side to C M , because C M = −1 = 1. In other words, we only need to prove: M M −1 M −1 M CM + CM + ... C D + M −2 = C D + M −1 r r r −1 Thirdly, we take advantage of the property : C N = CN + CN . So we −1 −1 can recursively combine the first term and the second term on the left side, and it will ultimately equal to the right side. (1.137) gives the mathematical form of n(D, M ), and we need all the conclusions above to prove it. Let’s give some intuitive concepts by illustrating M = 0, 1, 2. When M = 0, (1.134) will consist of only a constant term, which means n(D, 0) = 1. When M = 1,it is obvious n(D, 1) = D , because in this case (1.134) will only have D terms if we expand it. When M = 2, it degenerates to Prob.1.14, so n(D, 2) = D (D +1) is also obvious. Suppose (1.137) holds for M − 1, let’s prove it will also 2 hold for M . n(D, M ) = = = = = = D ∑ i =1 D ∑ n( i, M − 1) ( based on (1.135) ) C iM+−M1−2 ... = ( based on (1.137) holds for M − 1 ) i =1 M −1 M −1 M −1 M −1 CM + CM −1 + C M +1 ... + C D + M −2 M M −1 M −1 M −1 ( CM + CM ) + CM +1 ... + C D + M −2 M M −1 M −1 ( CM +1 + C M +1 )... + C D + M −2 M M −1 CM +2 ... + C D + M −2 M CD + M −1 By far, all have been proven. 12 Problem 1.16 Solution This problem can be solved in the same way as the one in Prob.1.15. Firstly, we should write the expression consisted of all the independent terms up to M th order corresponding to N (D, M ). By adding a summation regarding to M on the left side of (1.134), we obtain: i1 M ∑ D ∑ ∑ ... m=0 i 1 =1 i 2 =1 i∑ m−1 i m =1 w e i 1 i 2 ...i m x i1 x i2 ...x im (∗) (1.138) is quite obvious if we view m as an looping variable, iterating through all the possible orders less equal than M , and for every possible oder m, the independent parameters are given by n(D, m). Let’s prove (1.138) in a formal way by using Mathematical Induction. When M = 1,(∗) will degenerate to two terms: m = 0, corresponding to n(D, 0) and m = 1, corresponding to n(D, 1). Therefore N (D, 1) = n(D, 0) + n(D, 1). Suppose (1.138) holds for M , we will see that it will also hold for M + 1. Let’s begin by writing all the independent terms based on (∗) : D M +1 ∑ ∑ i1 ∑ ... m=0 i 1 =1 i 2 =1 i∑ m−1 i m =1 w e i 1 i 2 ...i m x i1 x i2 ...x im (∗∗) Using the same technique as in Prob.1.15, we divide (∗∗) to two parts based on the summation regarding to m: the first part consisted of m = 0, 1, ..., M and the second part m = M + 1. Hence, the first part will correspond to N (D, M ) and the second part will correspond to n(D, M + 1). So we obtain: N (D, M + 1) = N (D, M ) + n(D, M + 1) Then we substitute (1.138) into the equation above : N (D, M + 1) = = M ∑ m=0 M +1 ∑ n(D, m) + n(D, M + 1) n(D, m) m=0 To prove (1.139), we will also use the same technique in Prob.1.15 instead of Mathematical Induction. We begin based on already proved (1.138): N (D, M ) = M ∑ m=0 n(D, M ) 13 We then take advantage of (1.137): N (D, M ) = = = = M ∑ m CD + m−1 m=0 M C 0D −1 + C 1D + C 2D +1 + ... + C D + M −1 M ( C 0D + C 1D ) + C 2D +1 + ... + C D + M −1 M ( C 1D +1 + C 2D +1 ) + ... + C D + M −1 = ... = M CD +M Here as asked by the problem, we will view the growing speed of N (D, M ). We should see that in n(D, M ), D and M are symmetric, meaning that we only need to prove when D ≫ M , it will grow like D M , and then the situation of M ≫ D will be solved by symmetry. N (D, M ) = = = ≈ = = ≈ ( D + M )D + M (D + M )! ≈ D! M! DD MM 1 D+M D ( ) (D + M ) M D MM 1 M D [(1 + ) M ] M (D + M ) M M D M e M ( ) (D + M ) M M eM M (1 + ) M D M M D M eM M D M2 [(1 + ) M ] D D M M D M e M+ M2 D MM DM ≈ eM MM DM Where we use Stirling’s approximation, lim (1 + n1 )n = e and e n→+∞ M2 D ≈ e0 = 1. According to the description in the problem, When D ≫ M , we can actually eM M view M in this case. And by M as a constant, so N ( D, M ) will grow like D D symmetry, N (D, M ) will grow like M , when M ≫ D . Finally, we are asked to calculate N (10, 3) and N (100, 3): 3 N (10, 3) = C 13 = 286 3 N (100, 3) = C 103 = 176851 Problem 1.17 Solution 14 ∫ Γ( x + 1) = ∫ +∞ 0 u x e−u du +∞ − u x de−u 0 ¯+∞ ∫ ¯ − = − u x e−u ¯ = 0 = = +∞ e−u d (− u x ) 0 ∫ +∞ ¯+∞ x −u ¯ +x −u e ¯ e−u u x−1 du 0 0 ¯+∞ x −u ¯ + x Γ( x) −u e ¯ 0 Where we have taken advantage of Integration by parts and according to the equation above, we only need to prove the first term equals to 0. Given L’Hospital’s Rule: ux x! lim − u = lim − u = 0 u→+∞ e u→+∞ e And also when u = 0,− u x e u = 0, so we have proved Γ( x + 1) = xΓ( x). Based on the definition of Γ( x), we can write: ∫ +∞ ¯+∞ ¯ Γ(1) = e−u du = − e−u ¯ = −(0 − 1) = 1 0 0 Therefore when x is an integer: Γ( x) = ( x − 1) Γ( x − 1) = ( x − 1) ( x − 2) Γ( x − 2) = ... = x! Γ(1) = x! Problem 1.18 Solution Based on (1.124) and (1.126) and by substituting x to obvious to obtain : ∫ +∞ p 2 e− xi dx i = π p 2σ y, it is quite −∞ D Therefore, the left side of (1.42) will equal to π 2 . For the right side of (1.42): ∫ +∞ ∫ +∞ p D −1 2 SD e−r r D −1 dr = S D e−u u 2 d u ( u = r 2 ) 0 0 ∫ S D +∞ −u D −1 = e u 2 du 2 0 SD D Γ( ) = 2 2 Hence, we obtain: D π2 = SD D Γ( ) 2 2 D => SD = 2π 2 Γ( D2 ) 15 S D has given the expression of the surface area with radius 1 in dimension D , we can further expand the conclusion: the surface area with radius r in dimension D will equal to S D · r D −1 , and when r = 1, it will reduce to S D . This conclusion is naive, if you find that the surface area of different sphere in dimension D is proportion to the D − 1th power of radius, i.e. r D −1 . Considering the relationship between V and S of a sphere with arbitrary radius in dimension D : dV dr = S , we can obtain : ∫ V = ∫ S dr = S D r D −1 dr = SD D r D The equation above gives the expression of the volume of a sphere with radius r in dimension D , so we let r = 1 : VD = SD D For D = 2 and D = 3 : V2 = 1 2π S2 = · =π 2 2 Γ(1) 3 3 S3 1 2π 2 1 2π 2 4 V3 = = · 3 = · p = π π 3 3 Γ( 2 ) 3 3 2 Problem 1.19 Solution We have already given a hint in the solution of Prob.1.18, and here we will make it more clearly: the volume of a sphere with radius r is VD · r D . This is quite similar with the conclusion we obtained in Prob.1.18 about the surface area except that it is proportion to D th power of its radius, i.e. r D not r D −1 . D volume of sphere VD aD SD π2 = = D = (∗) volume of cube (2a)D 2 D 2D −1 D Γ( D ) 2 Where we have used the result of (1.143). And when D → +∞, we will use a simple method to show that (∗) will converge to 0. We rewrite it : (∗) = 2 π D 1 ·( ) 2 · D D 4 Γ( 2 ) Hence, it is now quite obvious, all the three terms will converge to 0 when D → +∞. Therefore their product will also converge to 0. The last problem is quite simple : p p center to one corner a2 · D p = = D and lim D = +∞ D →+∞ center to one side a Problem 1.20 Solution 16 The density of probability in a thin shell with radius r and thickness ϵ can be viewed as a constant. And considering that a sphere in dimension D with radius r has surface area S D r D −1 , which has already been proved in Prob.1.19 : ∫ ∫ shell p(x) d x = p(x) 2 shell dx = Thus we denote : p( r ) = exp(− 2rσ2 ) D (2πσ2 ) 2 S D r D −1 D 2 · V (shell) = exp(− exp(− 2rσ2 ) D (2πσ2 ) 2 S D r D −1 ϵ r2 ) 2σ2 (2πσ2 ) 2 We calculate the derivative of (1.148) with respect to r : d p( r ) SD r2 r2 D −2 = ) ( D − 1 − ) r exp ( − D dr 2σ 2 σ2 (2πσ2 ) 2 (∗) We let p the derivative equal to 0, we will obtain its unique root( stationary point) r̂ = D − 1 σ, because r ∈ [0, +∞]. When r < r̂ , the derivative is large than 0, p( r ) will increase as r ↑, and when r > r̂ , the derivative is less than 0, p( r ) will decrease as r ↑. Therefore r̂ will be the only maximum point. And it p is obvious when D ≫ 1, r̂ ≈ D σ. p( r̂ + ϵ ) p( r̂ ) ( r̂ + ϵ)D −1 exp(− (r̂2+σϵ2) ) 2 = 2 r̂ D −1 exp(− 2r̂σ2 ) ϵ 2ϵ r̂ + ϵ2 = (1 + )D −1 exp(− ) r̂ 2σ 2 2ϵ r̂ + ϵ2 ϵ = exp( − + (D − 1) ln(1 + ) ) 2 r̂ 2σ We process for the exponential term by using Taylor Theorems. − ϵ ϵ 2ϵ r̂ + ϵ2 ϵ2 2ϵ r̂ + ϵ2 + ( D − 1) ln (1 + + ( D − 1) ( ) ) ≈ − − r̂ r̂ 2 r̂ 2 2σ2 2σ 2 2ϵ r̂ + ϵ2 2 r̂ ϵ − ϵ2 = − + 2σ 2 2σ2 2 ϵ = − 2 σ 2 Therefore, p( r̂ + ϵ) = p( r̂ ) exp(− σϵ 2 ). Note: Here I draw a different conclusion compared with (1.149), but I do not think there is any mistake in my deduction. Finally, we see from (1.147) : ¯ 1 ¯ p(x)¯ = D x=0 (2πσ2 ) 2 17 ¯ ¯ p(x)¯ ||x||2 = r̂ 2 = 1 D (2πσ2 ) 2 exp(− r̂ 2 1 D )≈ exp(− ) D 2 2 2σ (2πσ2 ) 2 Problem 1.21 Solution The first question is rather simple : 1 1 1 1 (ab) 2 − a = a 2 ( b 2 − a 2 ) ≥ 0 Where we have taken advantage of b ≥ a ≥ 0. And based on (1.78): p(mistake) = p(x ∈ R 1 , C 2 ) + p(x ∈ R 2 , C 1 ) ∫ ∫ = p(x, C 2 ) dx + p(x, C 1 ) dx R1 R2 Recall that the decision rule which can minimize misclassification is that if p(x, C 1 ) > p(x, C 2 ), for a given value of x, we will assign that x to class C 1 . We can see that in decision area R 1 , it should satisfy p(x, C 1 ) > p(x, C 2 ). Therefore, using what we have proved, we can obtain : ∫ ∫ 1 { p(x, C 1 ) p(x, C 2 ) } 2 dx p(x, C 2 ) dx ≤ R1 R1 It is the same for decision area R 2 . Therefore we can obtain: ∫ 1 p(mistake) ≤ { p(x, C 1 ) p(x, C 2 ) } 2 dx Problem 1.22 Solution We need to deeply understand (1.81). When L k j = 1 − I k j : ∑ ∑ ¯ ¯ ¯ L k j p(C k ¯x) = p(C k ¯x) − p(C j ¯x) k k Given a specific x, the first term on the right side is a constant, which equals to 1, no matter which class C j we assign x to. Therefore if we want to ¯ ¯ minimize the loss, we will maximize p(C j x). Hence, we will ¯ assign x to class C j , which can give the biggest posterior probability p(C j ¯x). The explanation of the loss matrix is quite simple. If we label correctly, there is no loss. Otherwise, we will incur a loss, in the same degree whichever class we label it to. The loss matrix is given below to give you an intuitive view:   0 1 1 ... 1   1 0 1 . . . 1 . . . .  . . . ..  . ..  . . . 1 1 1 ... Problem 1.23 Solution 0 18 E[ L ] = ∑∑∫ k j Rj L k j p(x, C k ) d x = ∑∑∫ k j Rj ¯ L k j p(C k ) p(x¯C k ) d x If we denote a new loss matrix by L⋆jk = L jk p(C k ), we can obtain a new equation : ∑∑∫ ¯ ¯ E[ L ] = L⋆ k j p(x C k ) d x j k Rj Problem 1.24 Solution This description of the problem is a little confusing, and what it really mean is that λ is the parameter governing the loss, just like θ governing the posterior probability p(C k |x) when we introduce the reject option. Therefore the reject option can be written in a new way when we view it from the view of λ and the loss:  class C j min ∑k L kl p(C k | x) < λ l choice reject else Where C j is the class that can obtain the minimum. If L k j = 1 − I k j , according to what we have proved in Prob.1.22 : ∑ ∑ ¯ ¯ ¯ ¯ p(C k ¯x) − p(C j ¯x) = 1 − p(C j ¯x) L k j p(C k ¯x) = k k Therefore, the reject criterion from the view of λ above is actually equivalent to the largest posterior probability is larger than 1 − λ : min l ∑ L kl p(C k | x) < λ <=> max p(C l | x) > 1 − λ k l And from the view of θ and posterior probability, we label a class for x (i.e. we do not reject) is given by the constrain : max p(C l | x) > θ l Hence from the two different views, we can see that λ and θ are correlated with: λ+θ =1 Problem 1.25 Solution We can prove this informally by dealing with one dimension once a time just as the same process in (1.87) - (1.89) until all has been done, due to the fact that the total loss E can be divided to the summation of loss on every 19 dimension, and what’s more they are independent. Here, we will use a more informal way to prove this. In this case, the expected loss can be written : ∫ ∫ E[ L ] = {y(x) − t}2 p(x, t) d t d x Therefore, just as the same process in (1.87) - (1.89): ∫ ∂E[L] = 2 {y(x) − t} p(x, t) d t = 0 ∂ y(x) ∫ t p(x, t) d t => y(x) = = Et [t|x] p(x) Problem 1.26 Solution The process is identical as the deduction we conduct for (1.90). We will not repeat here. And what we should emphasize is that E[t|x] is a function of x, not t. Thus the integral over t and x can be simplified based on Integration by parts and that is how we obtain (1.90). Note: There is a mistake in (1.90), i.e. the second term on the right side is wrong. You can view (3.37) on P148 for reference. It should be : ∫ ∫ ¯ ¯ E[L] = { y(x) − E[ t¯x]}2 p(x) d x + {E[ t¯x − t]}2 p(x, t) d x dt Moreover, this mistake has already been revised in the errata. Problem 1.27 Solution We deal with this problem based on Calculus of Variations. ∫ ∂E[L q ] = q [ y(x − t)] q−1 si gn( y(x) − t) p(x, t) dt = 0 ∂ y(x) ∫ y(x) ∫ +∞ => [ y(x) − t] q−1 p(x, t) dt = [ y(x) − t] q−1 p(x, t) dt −∞ ∫ => y(x) −∞ [ y(x) − t] q−1 p( t|x) dt = ∫ y(x) +∞ y(x) [ y(x) − t] q−1 p( t|x) dt Where we take advantage of p(x, t) = p( t|x) p(x) and the property of sign function. Hence, when q = 1, the equation above will reduce to : ∫ y(x) ∫ +∞ p( t|x) dt = p( t|x) dt −∞ y(x) In other words, when q = 1, the optimal y(x) will be given by conditional median. When q = 0, it is non-trivial. We need to rewrite (1.91) : ∫ {∫ } E[ L q ] = | y(x) − t| q p( t|x) p(x) dt d x ∫ { ∫ } = p(x) | y(x) − t| q p( t|x) dt d x (∗) 20 If we want to minimize E[L q ], we only need to minimize the integrand of (∗): ∫ | y(x) − t| q p( t|x) dt (∗∗) When q = 0, | y(x) − t| q is close to 1 everywhere except in the neighborhood around t = y(x) (This can be seen from Fig1.29). Therefore: ∫ ∫ ∫ ∫ (∗∗) ≈ p( t|x) dt − (1 − | y(x) − t| q ) p( t|x) dt ≈ p( t|x) dt − p( t|x) dt U U ϵ ϵ Where ϵ means the small neighborhood,U means the whole space x lies in. Note that y(x) has no correlation with the first term, but the second term (because how to choose y(x) will affect the location of ϵ). Hence we will put ϵ at the location where p( t|x) achieve its largest value, i.e. the mode, because in this way we can obtain the largest reduction. Therefore, it is natural we choose y(x) equals to t that maximize p( t|x) for every x. Problem 1.28 Solution Basically this problem is focused on the definition of Information Content, i.e. h( x). We will rewrite the problem more precisely. In Information Theory, h(·) is also called Information Content and denoted as I (·). Here we will still use h(·) for consistency. The whole problem is about the property of h( x). Based on our knowledge that h(·) is a monotonic function of the probability p( x), we can obtain: h( x) = f ( p( x)) The equation above means that the Information we obtain for a specific value of a random variable x is correlated with its occurring probability p( x), and its relationship is given by a mapping function f (·). Suppose C is the intersection of two independent event A and B, then the information of event C occurring is the compound message of both independent events A and B occurring: h(C ) = h( A ∩ B ) = h( A ) + h(B ) (∗) Because A and B is independent: P (C ) = P ( A ) · P (B ) We apply function f (·) to both side: f (P (C )) = f (P ( A ) · P (B)) (∗∗) Moreover, the left side of (∗) and (∗∗) are equivalent by definition, so we can obtain: h( A ) + h(B) = f (P ( A ) · P (B)) => f ( p( A )) + f ( p(B)) = f (P ( A ) · P (B)) 21 We obtain an important property of function f (·): f ( x · y) = f ( x) + f ( y). Note: In problem (1.28), what it really wants us to prove is about the form and property of function f in our formulation, because there is one sentence in the description of the problem : "In this exercise, we derive the relation between h and p in the form of a function h( p)", (i.e. f (·) in our formulation is equivalent to h( p) in the description). At present, what we know is the property of function f (·): f ( x y ) = f ( x ) + f ( y) (∗) Firstly, we choose x = y, and then it is obvious : f ( x2 ) = 2 f ( x). Secondly, it is obvious f ( x n ) = n f ( x) , n ∈ N is true for n = 1, n = 2. Suppose it is also true for n, we will prove it is true for n + 1: f ( x n+1 ) = f ( x n ) + f ( x) = n f ( x) + f ( x) = ( n + 1) f ( x) Therefore, f ( x n ) = n f ( x) , n ∈ N has been proved. For an integer m, we n m rewrite x n as ( x m ) , and take advantage of what we have proved, we will obtain: n m n f ( x n ) = f (( x m ) ) = m f ( x m ) n Because f ( x n ) also equals to n f ( x), therefore n f ( x) = m f ( x m ). We simplify the equation and obtain: n f (x m ) = n f ( x) m For an arbitrary positive x , x ∈ R+ , we can find two positive rational array { yn } and { z n }, which satisfy: y1 < y2 < ... < yN < x and z1 > z2 > ... > z N > x, and lim yN = x N →+∞ lim z N = x N →+∞ We take advantage of function f (·) is monotonic: yN f ( p) = f ( p yN ) ≤ f ( p x ) ≤ f ( p z N ) = z N f ( p) And when N → +∞, we will obtain: f ( p x ) = x f ( p) , x ∈ R+ . We let p = e, it can be rewritten as : f ( e x ) = x f ( e). Finally, We denote y = e x : f ( y) = ln( y) f ( e) Where f ( e) is a constant once function f (·) is decided. Therefore f ( x) ∝ ln( x). Problem 1.29 Solution 22 This problem is a little bit tricky. The entropy for a M-state discrete random variable x can be written as : H [ x] = − M ∑ λ i ln(λ i ) i Where λ i is the probability that x choose state i . Here we choose a concave function f (·) = ln(·), we rewrite Jensen’s inequality, i.e.(1.115): ln( M ∑ i =1 We choose x i = 1 λi λi xi ) ≥ M ∑ i =1 λ i ln( x i ) and simplify the equation above, we will obtain : lnM ≥ − M ∑ i =1 λ i ln(λ i ) = H [ x] Problem 1.30 Solution Based on definition : ln{ p ( x) } = q ( x) = 1 s 1 ln( ) − [ 2 ( x − µ)2 − 2 ( x − m)2 ] σ 2σ 2s µ 1 m µ2 m2 s 1 ln( ) − [( 2 − 2 ) x2 − ( 2 − 2 ) x + ( 2 − 2 )] σ 2σ 2s σ s 2σ 2s We will take advantage of the following equations to solve this problem. ∫ E[ x2 ] = x2 N ( x|µ, σ2 ) dx = µ2 + σ2 ∫ E[ x ] = x N ( x|µ, σ2 ) dx = µ ∫ N ( x|µ, σ2 ) dx = 1 Given the equations above, it is easy to see : ∫ q ( x) K L( p|| q) = − p( x) ln{ } dx p ( x) ∫ p ( x) } dx = N ( x|µ, σ) ln{ q ( x) = = s 1 1 µ m µ2 m2 ln( ) − ( 2 − 2 )(µ2 + σ2 ) + ( 2 − 2 )µ − ( 2 − 2 ) σ 2σ 2s σ s 2σ 2s s σ2 + (µ − m)2 1 ln( ) + − σ 2 2 s2 23 We will discuss this result in more detail. Firstly, if K L distance is defined in Information Theory, the first term of the result will be log 2 ( σs ) instead of ln( σs ). Secondly, if we denote x = σs , K L distance can be rewritten as : K L( p|| q) = ln( x) + 1 1 − + a, 2 2 2x where a = (µ − m)2 2 s2 We calculate the derivative of K L with respect to x, and let it equal to 0: d (K L) 1 = − x−3 = 0 dx x => x = 1 ( ∵ s, σ > 0 ) When x < 1 the derivative is less than 0, and when x > 1, it is greater than 0, which makes x = 1 the global minimum. When x = 1, K L( p|| q) = a. What’s more, when µ = m, a will achieve its minimum 0. In this way, we have shown that the K L distance between two Gaussian Distributions is not less than 0, and only when the two Gaussian Distributions are identical, i.e. having same mean and variance, K L distance will equal to 0. Problem 1.31 Solution We evaluate H [x] + H [y] − H [x, y] by definition. Firstly, let’s calculate H [x, y] : ∫ ∫ H [x, y] = − p(x, y) lnp(x, y) d x d y ∫ ∫ ∫ ∫ = − p(x, y) lnp(x) d x d y − p(x, y) lnp(y|x) d x d y ∫ ∫ ∫ = − p(x) lnp(x) d x − p(x, y) lnp(y|x) d x d y = H [x] + H [y|x] ∫ Where we take advantage of p(x, y) = p(x) p(y|x), p(x, y) d y = p(x) and (1.111). Therefore, we have actually solved Prob.1.37 here. We will continue our proof for this problem, based on what we have proved: H [x] + H [y] − H [x, y] = = = = = H [y] − H [y|x] ∫ ∫ ∫ − p(y) lnp(y) d y + p(x, y) lnp(y|x) d x d y ∫ ∫ ∫ ∫ − p(x, y) lnp(y) d x d y + p(x, y) lnp(y|x) d x d y ∫ ∫ ( p(x) p(y) ) − p(x, y) ln d xd y p(x, y) K L( p(x, y)|| p(x) p(y) ) = I (x, y) ≥ 0 Where we take advantage of the following properties: ∫ p(y) = p(x, y) d x 24 p(y) p(x) p(y) = p(y|x) p(x, y) Moreover, it is straightforward that if and only if x and y is statistically independent, the equality holds, due to the property of KL distance. You can also view this result by : ∫ ∫ H [x, y] = − p(x, y) lnp(x, y) d x d y ∫ ∫ ∫ ∫ = − p(x, y) lnp(x) d x d y − p(x, y) lnp(y) d x d y ∫ ∫ ∫ = − p(x) lnp(x) d x − p(y) lnp(y) d y = H [x] + H [y] Problem 1.32 Solution It is straightforward based on definition and note that if we want to change variable in integral, we have to introduce a redundant term called Jacobian Determinant. ∫ H [y] = − p(y) lnp(y) d y ∫ p(x) p(x) ∂y = − ln | |d x | A| |A| ∂x ∫ p(x) = − p(x) ln dx |A| ∫ ∫ 1 = − p(x) lnp(x) d x − p(x) ln dx |A| = H [x] + ln|A| Where we have taken advantage of the following equations: ∂y ∂x =A p(x) = p(y) | and ∫ ∂y ∂x | = p(y) |A| p(x) d x = 1 Problem 1.33 Solution Based on the definition of Entropy, we write: ∑∑ H [ y| x ] = − p( x i , y j ) lnp( y j | x i ) xi y j Considering the property of probability, we can obtain that 0 ≤ p( y j | x i ) ≤ 1, 0 ≤ p( x i , y j ) ≤ 1. Therefore, we can see that − p( x i , y j ) lnp( y j | x i ) ≥ 0 when 0 < p( y j | x i ) ≤ 1. And when p( y j | x i ) = 0, provided with the fact that lim plnp = p→0 25 0, we can see that − p( x i , y j ) lnp( y j | x i ) = − p( x i ) p( y j | x i ) lnp( y j | x i ) ≈ 0, (here we view p( x) as a constant). Hence for an arbitrary term in the equation above, we have proved that it can not be less than 0. In other words, if and only if every term of H [ y| x] equals to 0, H [ y| x] will equal to 0. Therefore, for each possible value of random variable x, denoted as x i : ∑ − p( x i , y j ) lnp( y j | x i ) = 0 (∗) yj If there are more than one possible value of random variable y given x = x i , denoted as y j , such that p( y j | x i ) ̸= 0 (Because x i , y j are both "possible", p( x i , y j ) will also not equal to 0), constrained by 0 ≤ p( y j | x i ) ≤ 1 and ∑ j p( y j | x i ) = 1, there should be at least two value of y satisfied 0 < p( y j | x i ) < 1, which ultimately leads to (∗) > 0. Therefore, for each possible value of x, there will only be one y such that p( y| x) ̸= 0. In other words, y is determined by x. Note: This result is quite straightforward. If y is a function of x, we can obtain the value of y as soon as observing a x. Therefore we will obtain no additional information when observing a y j given an already observed x. Problem 1.34 Solution This problem is complicated. We will explain it in detail. According to Appenddix D, we can obtain the relation,i.e. (D.3) : ∫ ∂F F [ y( x) + ϵη( x)] = F [ y( x)] + ϵη( x) dx (∗∗) ∂y Where y( x) can be viewed as an operator that for any input x it will give an output value y, and equivalently, F [ y( x)] can be viewed as an functional operator that for any input value y( x), it will give an ouput value F [ y( x)]. Then we consider a functional operator: ∫ I [ p( x)] = p( x) f ( x) dx Under a small variation p( x) → p( x) + ϵη( x), we will obtain : ∫ ∫ I [ p( x) + ϵη( x)] = p( x) f ( x) dx + ϵη( x) f ( x) dx Comparing the equation above and (∗), we can draw a conclusion : ∂I ∂ p ( x) = f ( x) Similarly, let’s consider another functional operator: ∫ J [ p( x)] = p( x) lnp( x) dx 26 Then under a small variation p( x) → p( x) + ϵη( x): ∫ J [ p( x) + ϵη( x)] = ( p( x) + ϵη( x) ) ln( p( x) + ϵη( x) ) dx ∫ ∫ = p( x) ln( p( x) + ϵη( x) ) dx + ϵη( x) ln( p( x) + ϵη( x) ) dx Note that ϵη( x) is much smaller than p( x), we will write its Taylor Theorems at point p( x): ln( p( x) + ϵη( x) ) = lnp( x) + ϵη( x) p ( x) + O (ϵη( x)2 ) Therefore, we substitute the equation above into J [ p( x) + ϵη( x)]: ∫ ∫ J [ p( x) + ϵη( x)] = p( x) lnp( x) dx + ϵη( x) ( lnp( x) + 1) dx + O (ϵ2 ) Therefore, we also obtain : ∂J ∂ p ( x) = lnp( x) + 1 Now we can go back to (1.108). Based on ∂ ∂pJ( x) and ∂ p∂(Ix) , we can calculate the derivative of the expression just before (1.108) and let it equal to 0: − lnp( x) − 1 + λ1 + λ2 x + λ3 ( x − µ)2 = 0 Hence we rearrange it and obtain (1.108). From (1.108) we can see that p( x) should take the form of a Gaussian distribution. So we rewrite it into Gaussian form and then compare it to a Gaussian distribution with mean µ and variance σ2 , it is straightforward: exp(−1 + λ1 ) = 1 1 (2πσ2 ) 2 , exp(λ2 x + λ3 ( x − µ)2 ) = exp{ ( x − µ )2 } 2σ 2 Finally, we obtain : λ1 = 1 − ln(2πσ2 ) λ2 = 0 1 2σ 2 Note that there is a typo in the official solution manual about λ3 . Moreover, in the following parts, we will substitute p( x) back into the three constraints and analytically prove that p( x) is Gaussian. You can skip the following part. (The writer would especially thank Dr.Spyridon Chavlis from IMBB,FORTH for this analysis) λ3 = − 27 We already know: p( x) = exp(−1 + λ1 + λ2 x + λ3 ( x − µ)2 ) Where the exponent is equal to: −1 + λ1 + λ2 x + λ3 ( x − µ)2 = λ3 x2 + (λ2 − 2λ3 µ) x + (λ3 µ2 + λ1 − 1) Completing the square, we can obtain that: ax2 + bx + c = a( x − d )2 + f , d = − b b2 , f = c− 2a 4a Using this quadratic form, the constraints can be written as ∫∞ ∫∞ 2 1. −∞ p( x) dx = −∞ e[a( x−d ) + f ] dx = 1 2. ∫∞ ∫∞ [ a ( x − d )2 + f ] dx = µ −∞ xp( x) dx = −∞ xe 3. ∫∞ ∫∞ 2 2 [a( x− d )2 + f ] dx = σ2 −∞ ( x − µ) p( x) dx = −∞ ( x − µ) e The first constraint can be written as: ∫ ∫ ∞ 2 e[a( x−d ) + f ] dx = e f I1 = −∞ ∞ −∞ e a( x−d ) dx 2 Let u = x − d , which gives du = dx, and thus: ∫ ∞ 2 e au du I1 = e f −∞ p p Let −w2 = au2 ⇒ w = −au ⇒ dw = −adu, and thus: ef I1 = p −a ∫ ∞ −∞ e−w dw 2 As e− x is an even function the integral is written as: 2 2 Let w = t ⇒ w = 2e f I1 = p −a ∫ ∞ t 0 p 2e f I1 = p −a t ⇒ dw = 1 p 2 t 2e f e dt = p −a − 12 − t ∫ ∞ e−w dw 2 0 dt, and thus: ∫ ∞ 0 1 1 −1 − t ef 1 t 2 e dt = p Γ( ) = e f 2 −a 2 √ Here the Gamma function is used. Gamma function is defined as ∫ ∞ Γ( z) = t z−1 e− t dt 0 π −a 28 where for non-negative integer values of n, we have: 1 (2 n)! p Γ( + n) = n π 2 4 n! Thus, the first constraint can be rewritten as: √ π f e =1 −a The second constraint can be written as: ∫ ∞ ∫ 2 I2 = xe[a( x−d ) + f ] dx = e f −∞ ∞ (∗) xe a( x−d ) dx 2 −∞ Let u = x − d ⇒ x = u + d ⇒ du = dx, and thus: ∫ ∞ 2 I2 = e f ( u + d ) e au du −∞ Using integral additivity, we have: ∫ ∞ ∫ 2 I2 = e f ue au du + e f −∞ ∞ −∞ 2 de au du We first deal with the first term on the right hand side. Here we denote it as I 21 : ) (∫ 0 ∫ ∞ ∫ ∞ au2 au2 f au2 f I 21 = e ue du = e ue du ue du + −∞ −∞ 0 Swapping the integration limits, we obtain: ) ( ∫ −∞ ∫ ∞ au2 au2 f ue du ue du + I 21 = e − 0 0 ) (∫ −∞ ∫ ∞ au2 a (− u )2 f = e (− u) e du + ue du 0 ( 0∫ ∞ ) ∫ ∞ a (− u )2 au2 f ue du = 0 = e − (− u ) e (− du) + 0 0 Then we deal with the second term I 22 : ∫ ∞ 2 I 22 = e f de au du −∞ Let −w2 = au2 ⇒ w = p p −au ⇒ dw = −adu, and thus: ef d I 22 = p −a ∫ ∞ −∞ e−w dw 2 As e− x is an even function the integral is written as: 2 2e f d I 22 = p −a ∫ ∞ 0 e−w dw 2 29 Let w2 = t ⇒ w = 2e f d I 22 = p −a ∫ ∞ t p t ⇒ dw = 1 p dt, 2 t 2e f d e dt = p −a − 12 − t 0 ∫ ∞ 0 and thus: ef d 1 1 1 −1 − t t 2 e dt = p Γ( ) = e f d 2 −a 2 Thus, the second constraint can be rewritten √ π ef d =µ −a √ π −a (∗∗) Combining (∗) and (∗∗), we can obtain that d = µ. Recall that: d=− b λ2 − 2λ3 µ =− = µ ⇒ λ2 − 2λ3 µ = −2λ3 µ ⇒ λ2 = 0 2a 2λ3 So far, we have: b = −2λ3 µ And 4λ2 µ2 b2 = λ3 µ2 + λ1 − 1 − 3 = λ1 − 1 4a 4λ3 Finally, we deal with the third also the last constraint. Substituting λ2 = 0 into the last constraint we have: ∫ ∞ ∫ ∞ 2 2 ( x − µ)2 e[λ3 ( x−µ) +λ1 −1] dx = eλ1 −1 ( x − µ)2 eλ3 ( x−µ) dx I3 = f = c− −∞ −∞ Let u = x − µ ⇒ du = dx, and thus: ∫ ∞ 2 λ1 −1 I3 = e u2 eλ3 u dx −∞ √ √ Let −w = λ3 u ⇒ w = −λ3 u ⇒ dw = −λ3 du, and thus: ∫ ∫ ∞ 2 dw eλ1 −1 ∞ 2 −w2 1 w e dw = I 3 = eλ1 −1 − w2 e − w √ 3 −∞ −∞ λ3 2 −λ3 −λ3 2 2 Because it is an even function, we can further obtain: ∫ eλ1 −1 ∞ 2 −w2 I3 = 2 w e dw 3 −λ32 0 p 1 Let w2 = t ⇒ w = t ⇒ dw = p dt, and thus: 2 t I3 = 2 = = e λ1 −1 ∫ te 3 2 −λ3 eλ1 −1 3 2 −λ3 eλ1 −1 ∞ 0 ∫ ∞ −t 1 eλ1 −1 p dt = 3 2 ( t) −λ32 3 t 2 −1 e− t dt 0 3 eλ1 −1 π Γ( ) = 3 2 2 −λ3 −λ32 3 2 ∫ ∞ 0 1 t1− 2 e− t dt 30 Thus, the third constraint can be rewritten p eλ1 −1 π = σ2 3 2 −λ32 ( ∗ ∗ ∗) Rewriting (∗) with f = λ1 − 1, d = µ and a = λ3 , we obtain the following equation √ π =1 (∗ ∗ ∗∗) eλ1 −1 −λ3 Substituting the equation above back into (∗ ∗ ∗), we obtain √ p −λ3 1 π 1 1 = σ2 ⇔ − = 2σ2 ⇔ λ3 = − 2 3 π 2 λ3 2σ −λ32 Substituting λ3 back into (∗ ∗ ∗∗), we obtain: √ √ π π 1 1 eλ1 −1 = 1 ⇔ eλ1 −1 = 1 ⇔ eλ1 −1 = p ⇔ λ1 − 1 = ln( p ) 1 2 2 −λ3 2 πσ 2 πσ 2 2σ Thus, we obtain: 1 ln(2πσ2 ) 2 So far, we have obtainde λ i , where i = 1, 2, 3. We substitute them back into p( x), yielding: ( ) 1 1 2 2 p( x) = exp −1 + 1 − ln(2πσ ) − 2 ( x − µ) 2 2σ ) ( ) ( 1 1 2 2 = exp − ln(2πσ ) exp − 2 ( x − µ) 2 2σ )) ( ( ( ) 1 1 2 exp − 2 ( x − µ) = exp ln p 2σ 2πσ2 λ1 = 1 − Thus, 1 ( 1 p ( x) = p exp − 2 ( x − µ)2 2 2σ 2πσ ) Just as required. Problem 1.35 Solution If p( x) = N (µ, σ2 ), we write its entropy: ∫ H [ x] = − p( x) lnp( x) dx ∫ ∫ 1 ( x − µ)2 = − p( x) ln{ } dx − p ( x ) {− } dx 2πσ2 2σ 2 1 σ2 = − ln{ } + 2πσ2 2σ 2 1 = { 1 + ln(2πσ2 ) } 2 31 Where we have taken advantage of the following properties of a Gaussian distribution: ∫ ∫ p( x) dx = 1 and ( x − µ)2 p( x) dx = σ2 Problem 1.36 Solution Here we should make it clear that if the second derivative is strictly positive, the function must be strictly convex. However, the converse may not be true. For example f ( x) = x4 , g( x) = x2 , x ∈ R are both strictly convex by definition, but their second derivatives at x = 0 are both indeed 0 (See keyword convex function on Wikipedia or Page 71 of the book Convex Optimization written by Boyd, Vandenberghe for more details). Hence, here more precisely we will prove that a convex function is equivalent to its second derivative is non-negative by first considering Taylor Theorems: f ( x + ϵ) = f ( x ) + f ′ ( x) f ′′ ( x) 2 f ′′′ ( x) 3 ϵ+ ϵ + ϵ + ... 1! 2! 3! f ( x − ϵ) = f ( x ) − f ′ ( x) f ′′ ( x) 2 f ′′′ ( x) 3 ϵ+ ϵ − ϵ + ... 1! 2! 3! Then we can obtain the expression of f ′′ ( x): f ′′ ( x) = lim ϵ→0 f ( x + ϵ) + f ( x − ϵ) − 2 f ( x ) ϵ2 Where O (ϵ4 ) is neglected and if f ( x) is convex, we can obtain: 1 1 1 1 f ( x) = f ( ( x + ϵ) + ( x − ϵ)) ≤ f ( x + ϵ) + f ( x − ϵ) 2 2 2 2 Hence f ′′ ( x) ≥ 0. The converse situation is a little bit complex, we will use Lagrange form of Taylor Theorems to rewrite the Taylor Series Expansion above : f ′′ ( x⋆ ) f ( x) = f ( x0 ) + f ′ ( x0 )( x − x0 ) + ( x − x0 ) 2 Where x⋆ lies between x and x0 . By hypothesis, f ′′ ( x) ≥ 0, the last term is non-negative for all x. We let x0 = λ x1 + (1 − λ) x2 , and x = x1 : f ( x1 ) ≥ f ( x0 ) + (1 − λ)( x1 − x2 ) f ′ ( x0 ) (∗) And then, we let x = x2 : f ( x2 ) ≥ f ( x0 ) + λ( x2 − x1 ) f ′ ( x0 ) (∗∗) We multiply (∗) by λ, (∗∗) by 1 − λ and then add them together, we will see : λ f ( x1 ) + (1 − λ) f ( x2 ) ≥ f (λ x1 + (1 − λ) x2 ) 32 Problem 1.37 Solution See Prob.1.31. Problem 1.38 Solution When M = 2, (1.115) will reduce to (1.114). We suppose (1.115) holds for M , we will prove that it will also hold for M + 1. f( M ∑ m=1 = λm xm ) f (λ M +1 x M +1 + (1 − λ M +1 ) M ∑ m=1 1 − λ M +1 ≤ λ M +1 f ( x M +1 ) + (1 − λ M +1 ) f ( ≤ λ M +1 f ( x M +1 ) + (1 − λ M +1 ) M +1 ∑ ≤ m=1 λm M ∑ xm ) λm m=1 1 − λ M +1 xm ) M ∑ λm f ( xm ) 1 − λ M +1 m=1 λm f ( xm ) Hence, Jensen’s Inequality, i.e. (1.115), has been proved. Problem 1.39 Solution It is quite straightforward based on definition. H [ x] = − ∑ i H [ y] = − ∑ i H [ x, y] = − ∑ i, j H [ x | y] = − ∑ i, j H [ y| x ] = − ∑ i, j 2 2 1 1 p( x i ) lnp( x i ) = − ln − ln = 0.6365 3 3 3 3 2 2 1 1 p( yi ) lnp( yi ) = − ln − ln = 0.6365 3 3 3 3 1 1 p( x i , y j ) lnp( x i , y j ) = −3 · ln − 0 = 1.0986 3 3 1 1 1 1 1 p( x i , y j ) lnp( x i | y j ) = − ln1 − ln − ln = 0.4621 3 3 2 3 2 1 1 1 1 1 p( x i , y j ) lnp( y j | x i ) = − ln − − ln − ln1 = 0.4621 3 2 3 2 3 I [ x, y] = − ∑ p( x i , y j ) ln i, j 2 3 p( x i ) p( y j ) p( x i , y j ) 1 3 2 2 1 2 · 1 1 3·3 1 3·3 = − ln − ln − ln = 0.1744 3 1/3 3 1/3 3 1/3 Their relations are given below, diagrams omitted. I [ x, y] = H [ x] − H [ x| y] = H [ y] − H [ y| x] 33 H [ x, y] = H [ y| x] + H [ x] = H [ x| y] + H [ y] Problem 1.40 Solution f ( x) = lnx is actually a strict concave function, therefore we take advantage of Jensen’s Inequality to obtain: f( M ∑ i =1 We let λm = ln( λm xm ) ≥ 1 M , m = 1, 2, ..., M . M ∑ i =1 λm f ( xm ) Hence we will obtain: 1 1 x1 + x2 + ... + xm )≥ [ ln( x1 ) + ln( x2 ) + ... + ln( x M ) ] = ln( x1 x2 ...x M ) M M M We take advantage of the fact that f ( x) = lnx is strictly increasing and then obtain : x1 + x2 + ... + xm p ≥ M x1 x2 ...x M M Problem 1.41 Solution Based on definition of I [x, y], i.e.(1.120), we obtain: ∫ ∫ p(x) p(y) dx dy I [x, y] = − p(x, y) ln p(x, y) ∫ ∫ p(x) = − p(x, y) ln dx dy p(x|y) ∫ ∫ ∫ ∫ = − p(x, y) lnp(x) d x d y + p(x, y) lnp(x|y) d x d y ∫ ∫ ∫ ∫ = − p(x) lnp(x) d x + p(x, y) lnp(x|y) d x d y = H [x] − H [x|y] Where we have taken advantage of the fact: p(x, y) = p(y) p(x|y), and p(x, y) d y = p(x). The same process can be used for proving I [x, y] = H [y] − H [y|x], if we substitute p(x, y) with p(x) p(y|x) in the second step. ∫ 0.2 Probability Distribution Problem 2.1 Solution Based on definition, we can obtain : ∑ p( x i ) = µ + (1 − µ) = 1 x i =0,1 E[ x ] = ∑ x i =0,1 x i p( x i ) = 0 · (1 − µ) + 1 · µ = µ 34 var [ x] = ∑ x i =0,1 ( x i − E[ x])2 p( x i ) = (0 − µ)2 (1 − µ) + (1 − µ)2 · µ = µ(1 − µ) H [ x] = − ∑ x i =0,1 p( x i ) lnp( x i ) = − µ lnµ − (1 − µ) ln(1 − µ) Problem 2.2 Solution The proof in Prob.2.1. can also be used here. ∑ x i =−1,1 E[ x ] = ∑ x i =−1,1 var[ x] = p( x i ) = 1−µ 1+µ + =1 2 2 x i · p ( x i ) = −1 · ∑ x i =−1,1 1+µ 1−µ + 1· =µ 2 2 ( x i − E[ x])2 · p( x i ) = (−1 − µ)2 · 1−µ 1+µ + (1 − µ)2 · 2 2 = 1 − µ2 H [ x] = − ∑ x i =−1,1 p( x i ) · ln p( x i ) = − 1−µ 1+µ 1+µ 1−µ · ln − · ln 2 2 2 2 Problem 2.3 Solution (2.262) is an important property of Combinations, which we have used m before, such as in Prob.1.15. We will use the ’old fashioned’ denotation C N to represent choose m objects from a total of N . With the prior knowledge: m CN = N! m! ( N − m)! We evaluate the left side of (2.262) : m m−1 CN + CN = = = N! N! + m! ( N − m)! ( m − 1)! ( N − ( m − 1))! N! 1 1 ( + ) ( m − 1)! ( N − m)! m N − m + 1 ( N + 1)! m = CN +1 m! ( N + 1 − m)! To proof (2.263), here we will proof a more general form: ( x + y) N = N ∑ m=0 m m N −m CN x y (∗) 35 If we let y = 1, (∗) will reduce to (2.263). We will proof it by induction. First, it is obvious when N = 1, (∗) holds. We assume that it holds for N , we will proof that it also holds for N + 1. ( x + y) N +1 = ( x + y) = = = = = = x N ∑ m=0 N ∑ m=0 N∑ +1 m=1 N ∑ m=1 N ∑ m=1 N∑ +1 m=0 N ∑ m=0 m m N −m CN x y m m N −m CN x y +y m m+1 N − m CN x y + N ∑ m=0 N ∑ m=0 m−1 m N +1− m CN x y + m m N −m CN x y m m N +1− m CN x y N ∑ m=0 m m N +1− m CN x y m−1 m m N +1− m (C N + CN )x y + x N +1 + y N +1 m m N +1− m CN + x N +1 + y N +1 +1 x y m m N +1− m CN +1 x y By far, we have proved (∗). Therefore, if we let y = 1 in (∗), (2.263) has been proved. If we let x = µ and y = 1 − µ, (2.264) has been proved. Problem 2.4 Solution Solution has already been given in the problem, but we will solve it in a 36 more intuitive way, beginning by definition: E[ m ] = = = N ∑ m=0 N ∑ m=1 N ∑ m=1 m m mC N µ (1 − µ) N −m m m mC N µ (1 − µ) N −m N! µm (1 − µ) N −m ( m − 1)!( N − m)! N ∑ ( N − 1)! µm−1 (1 − µ) N −m ( m − 1)!( N − m)! = N ·µ = N ·µ = N ·µ = N · µ [ µ + (1 − µ) ] N −1 = N µ m=1 N ∑ m=1 N∑ −1 k=0 m−1 m−1 CN (1 − µ) N −m −1 µ k k N −1− k CN −1 µ (1 − µ) Some details should be explained here. We note that m = 0 actually doesn’t affect the Expectation, so we let the summation begin from m = 1, i.e. (what we have done from the first step to the second step). Moreover, in the second last step, we rewrite the subindex of the summation, and what we actually do is let k = m − 1. And in the last step, we have taken advantage of (2.264). Variance is straightforward once Expectation has been calculated. var [ m] = E[ m2 ] − E[ m]2 N ∑ m m = m2 C N µ (1 − µ) N −m − E[ m] · E[ m] m=0 = = = N ∑ m=0 N ∑ m=1 N ∑ m=1 = Nµ = Nµ m m m2 C N µ (1 − µ) N −m − ( N µ) · m m m2 C N µ (1 − µ) N −m − N µ · m N ∑ m=0 N ∑ m=1 m m mC N µ (1 − µ) N −m m m mC N µ (1 − µ) N −m N ∑ N! m m mC N µ (1 − µ) N −m µm (1 − µ) N −m − ( N µ) · ( m − 1)!( N − m)! m=1 N ∑ m=1 N ∑ m=1 m N ∑ ( N − 1)! m m µm−1 (1 − µ) N −m − N µ · mC N µ (1 − µ) N −m ( m − 1)!( N − m)! m=1 m−1 m mµm−1 (1 − µ) N −m ( C N −1 − µC N ) Here we will use a little tick, −µ = −1 + (1 − µ) and then take advantage 37 m m m−1 of the property, C N = CN + CN . −1 −1 var [ m] = Nµ = Nµ = Nµ = = = Nµ N ∑ m=1 N ∑ m=1 N ∑ m=1 [ m−1 ] m m mµm−1 (1 − µ) N −m C N −1 − C N + (1 − µ)C N [ ] m m−1 m mµm−1 (1 − µ) N −m (1 − µ)C N + CN −1 − C N [ ] m m mµm−1 (1 − µ) N −m (1 − µ)C N − CN −1 {∑ N { m=1 m mµm−1 (1 − µ) N −m+1 C N − N ∑ m=1 m mµm−1 (1 − µ) N −m C N −1 N µ · N (1 − µ)[µ + (1 − µ)] N −1 − ( N − 1)(1 − µ)[µ + (1 − µ)] N −2 { } N µ N (1 − µ) − ( N − 1)(1 − µ) = N µ(1 − µ) } } Problem 2.5 Solution Hints have already been given in the description, and let’s make a little improvement by introducing t = y + x and x = tµ at the same time, i.e. we will do following changes: { { t = x+ y x = tµ and µ = x+x y y = t(1 − µ) Note t ∈ [0, +∞], µ ∈ (0, 1), and that when we change variables in integral, we will introduce a redundant term called Jacobian Determinant. ¯ ∂x ∂x ¯ ¯ ¯ ∂( x, y) ¯¯ ∂µ ∂ t ¯¯ ¯¯ t µ ¯¯ = ¯ ∂y ∂y ¯ = =t ∂(µ, t) ¯ ∂µ ∂ t ¯ ¯ − t 1 − µ ¯ Now we can calculate the integral. ∫ +∞ ∫ +∞ exp(− y) yb−1 d y exp(− x) xa−1 dx Γ(a)Γ( b) = 0 0 ∫ +∞ ∫ +∞ a−1 = exp(− x) x exp(− y) yb−1 d y dx 0 0 ∫ +∞ ∫ +∞ = exp(− x − y) xa−1 yb−1 d y dx 0 0 ∫ 1 ∫ +∞ = exp(− t) ( tµ)a−1 ( t(1 − µ))b−1 t dt d µ 0 0 ∫ +∞ ∫ 1 a+ b−1 = exp(− t) t dt · µa−1 (1 − µ)b−1 d µ 0 0 ∫ 1 = Γ(a + b) · µa−1 (1 − µ)b−1 d µ 0 38 Therefore, we have obtained : ∫ 1 0 µa−1 (1 − µ)b−1 d µ = Γ(a)Γ( b) Γ(a + b) Problem 2.6 Solution We will solve this problem based on definition. ∫ E[µ] = ∫ = = = = = 1 µ Beta(µ|a, b) d µ 0 Γ(a + b) a µ (1 − µ)b−1 d µ 0 Γ(a) Γ( b) ∫ Γ(a + b)Γ(a + 1) 1 Γ(a + 1 + b) a µ (1 − µ)b−1 d µ Γ(a + 1 + b)Γ(a) 0 Γ(a + 1) Γ( b) ∫ Γ(a + b)Γ(a + 1) 1 Beta(µ|a + 1, b) d µ Γ(a + 1 + b)Γ(a) 0 Γ(a + b) Γ(a + 1) · Γ(a + 1 + b) Γ(a) a a+b 1 Where we have taken advantage of the property: Γ( z + 1) = zΓ( z). For variance, it is quite similar. We first evaluate E [µ2 ]. ∫ E[ µ 2 ] = ∫ = = = = = 1 0 µ2 Beta(µ|a, b) d µ Γ(a + b) a+1 µ (1 − µ)b−1 d µ Γ ( a ) Γ ( b ) 0 ∫ Γ(a + b)Γ(a + 2) 1 Γ(a + 2 + b) a+1 µ (1 − µ)b−1 d µ Γ(a + 2 + b)Γ(a) 0 Γ(a + 2) Γ( b) ∫ Γ(a + b)Γ(a + 2) 1 Beta(µ|a + 2, b) d µ Γ(a + 2 + b)Γ(a) 0 Γ(a + b) Γ(a + 2) · Γ(a + 2 + b) Γ(a) a(a + 1) (a + b)(a + b + 1) 1 Then we use the formula: var [µ] = E [µ2 ] − E [µ]2 . var [µ] = = Problem 2.7 Solution a(a + 1) a 2 −( ) (a + b)(a + b + 1) a+b ab 2 (a + b) (a + b + 1) 39 The maximum likelihood estimation for µ, i.e. (2.8), can be written as : µ ML = m m+l Where m represents how many times we observe ’head’, l represents how many times we observe ’tail’. And the prior mean of µ is given by (2.15), the posterior mean value of x is given by (2.20). Therefore, we will prove that ( m + a) / ( m + a + l + b) lies between m / ( m + l ), a / (a + b). Given the fact that : λ m m+a a+b a + (1 − λ) = where λ = a+b m+l m+a+l +b m+l +a+b We have solved problem. Note : you can also solve it in a more simple way by prove that : ( m+a a m+a m − )·( − )≤0 m+a+l +b a+b m+a+l +b m+l The expression above can be proved by reduction of fractions to a common denominator. Problem 2.8 Solution We solve it base on definition. E y [E x [ x| y]]] = = = = = ∫ E x [ x| y] p( y) d y ∫ ∫ ( x p( x| y) dx) p( y) d y ∫ ∫ x p( x| y) p( y) dx d y ∫ ∫ x p( x, y) dx d y ∫ x p( x) dx = E[ x] (2.271) is complicated and we will calculate every term separately. ∫ E y [var x [ x| y]] = var x [ x| y] p( y) d y ∫ ∫ = ( ( x − E x [ x| y])2 p( x| y) dx ) p( y) d y ∫ ∫ = ( x − E x [ x| y])2 p( x, y) dx d y ∫ ∫ = ( x2 − 2 x E x [ x| y] + E x [ x| y]2 ) p( x, y) dx d y ∫ ∫ ∫ ∫ ∫ ∫ = x2 p( x) dx − 2 x E x [ x| y] p( x, y) dx d y + ( E x [ x | y] 2 ) p ( y) d y 40 About the second term in the equation above, we further simplify it : ∫ ∫ ∫ ∫ 2 x E x [ x| y] p( x, y) dx d y = 2 E x [ x| y] ( xp( x, y) dx ) d y ∫ ∫ = 2 E x [ x| y] p( y) ( xp( x| y) dx ) d y ∫ = 2 E x [ x | y] 2 p ( y) d y Therefore, we obtain the simple expression for the first term on the right side of (2.271) : ∫ ∫ ∫ ∫ E y [var x [ x| y]] = x2 p( x) dx − E x [ x| y]2 p( y) d y (∗) Then we process for the second term. ∫ var y [E x [ x| y]] = (E x [ x| y] − E y [E x [ x| y]])2 p( y) d y ∫ = (E x [ x| y] − E[ x])2 p( y) d y ∫ ∫ ∫ = E x [ x| y]2 p( y) d y − 2 E[ x]E x [ x| y] p( y) d y + E[ x]2 p( y) d y ∫ ∫ 2 = E x [ x| y] p( y) d y − 2E[ x] E x [ x| y] p( y) d y + E[ x]2 Then following the same procedure, we deal with the second term of the equation above. ∫ 2E[ x] · E x [ x| y] p( y) d y = 2E[ x] · E y [E x [ x| y]]] = 2E[ x]2 Therefore, we obtain the simple expression for the second term on the right side of (2.271) : ∫ var y [E x [ x| y]] = E x [ x| y]2 p( y) d y − E[ x]2 (∗∗) Finally, we add (∗) and (∗∗), and then we will obtain: E y [var x [ x| y]] + var y [E x [ x| y]] = E[ x2 ] − E[ x]2 = var [ x] Problem 2.9 Solution This problem is complexed, but hints have already been given in the description. Let’s begin by performing integral of (2.272) over µ M −1 . (Note : 41 by integral over µ M −1 , we actually obtain Dirichlet distribution with M − 1 variables.) ∫ p M −1 (µ, m, ..., µ M −2 ) = 1−µ− m − ... −µ M −2 CM 0 = CM M −2 ∏ k=1 α −1 µk k M −1 ∏ k=1 ∫ α −1 µk k 1−µ− m − ... −µ M −2 0 (1 − α M −1 ∑ j =1 µ MM−−11 −1 µ j )αM −1 d µ M −1 (1 − M −1 ∑ j =1 µ j )αM −1 d µ M −1 We change variable by : t= µ M −1 1 − µ − m − ... − µ M −2 The reason we do so is that µ M −1 ∈ [0, 1 − µ − m − ... − µ M −2 ], by making this changing of variable, we can see that t ∈ [0, 1]. Then we can further simplify the expression. p M −1 = CM = CM = M −2 ∏ k=1 M −2 ∏ k=1 CM M −2 ∏ k=1 α −1 µk k (1 − α −1 µk k α −1 µk k (1 − (1 − M −2 ∑ j =1 M −2 ∑ j =1 M −2 ∑ j =1 µ j) α M −1 +α M −1 µ j )αM −1 +αM −1 µ j )αM −1 +αM −1 ∫ 1 1 0 −1 (1 − ∑ M −1 j =1 µ j )αM −1 (1 − µ − m − ... − µ M −2 )αM −1 +αM −2 0 ∫ α µ MM−−11 tαM −1 −1 (1 − t)αM −1 dt Γ(α M −1 − 1)Γ(α M ) Γ(α M −1 + α M ) Comparing the expression above with a normalized Dirichlet Distribution with M −1 variables, and supposing that (2.272) holds for M −1, we can obtain that: Γ(α M −1 )Γ(α M ) Γ(α1 + α2 + ... + α M ) CM = Γ(α M −1 + α M ) Γ(α1 )Γ(α2 )...Γ(α M −1 + α M ) Therefore, we obtain CM = as required. Problem 2.10 Solution Γ(α1 + α2 + ... + α M ) Γ(α1 )Γ(α2 )...Γ(α M −1 )Γ(α M ) dt 42 Based on definition of Expectation and (2.38), we can write: ∫ E[µ j ] = µ j D ir (µ|α) d µ ∫ = = = = K ∏ Γ(α0 ) α −1 µ k dµ Γ(α1 )Γ(α2 )...Γ(αK ) k=1 k ∫ K ∏ Γ(α0 ) α −1 µj µk k d µ Γ(α1 )Γ(α2 )...Γ(αK ) k=1 µj Γ(α1 )Γ(α2 )...Γ(α j−1 )Γ(α j + 1)Γ(α j+1 )...Γ(αK ) Γ(α0 ) Γ(α1 )Γ(α2 )...Γ(αK ) Γ(α0 + 1) Γ(α0 )Γ(α j + 1) αj = Γ(α j )Γ(α0 + 1) α0 It is quite the same for variance, let’s begin by calculating E[µ2j ]. ∫ E[µ2j ] = = = = µ2j D ir (µ|α) d µ Γ(α0 ) Γ(α1 )Γ(α2 )...Γ(αK ) ∫ µ2j K ∏ k=1 α −1 µk k dµ Γ(α1 )Γ(α2 )...Γ(α j−1 )Γ(α j + 2)Γ(α j+1 )...Γ(αK ) Γ(α0 ) Γ(α1 )Γ(α2 )...Γ(αK ) Γ(α0 + 2) Γ(α0 )Γ(α j + 2) α j (α j + 1) = Γ(α j )Γ(α0 + 2) α0 (α0 + 1) Hence, we obtain : var [µ j ] = E[µ2j ] − E[µ j ]2 = α j (α j + 1) α0 (α0 + 1) −( αj α0 )2 = α j (α0 − α j ) α20 (α0 + 1) It is the same for covariance. ∫ cov[µ j µl ] = (µ j − E[µ j ])(µl − E[µl ]) D ir (µ|α) d µ ∫ = (µ j µl − E[µ j ]µl − E[µl ]µ j + E[µ j ]E[µl ]) D ir (µ|α) d µ = = = = Γ(α0 )Γ(α j + 1)Γ(αl + 1) Γ(α j )Γ(αl )Γ(α0 + 2) α j αl α0 (α0 + 1) α j αl − E[µ j ]E[µl ] − α0 (α0 + 1) α j αl − 2 α0 (α0 + 1) α j αl α20 ( j ̸= l ) − 2E[µ j ]E[µl ] + E[µ j ]E[µl ] 43 Note : when j = l , cov[µ j µl ] will actually reduce to var [µ j ], however we cannot simply replace l with j in the expression of cov[µ j µl ] to get the right ∫ ∫ result and that is because µ j µl D ir (µ|α) d α will reduce to µ2j D ir (µ|α) d α in this case. Problem 2.11 Solution Based on definition of Expectation and (2.38), we first denote : Γ(α0 ) = K (α) Γ(α1 )Γ(α2 )...Γ(αK ) Then we can write : ∂D ir (µ|α) ∂α j = ∂(K (α) K ∏ i =1 K ∂K (α) ∏ = ∂α j i =1 K ∂K (α) ∏ = ∂α j α −1 µi i i =1 ) / ∂α j α −1 µ i i + K (α) α −1 µi i ∂ ∏K α i −1 i =1 µ i ∂α j + lnµ j · D ir (µ|α) Then let us perform integral to both sides: ∫ ∂D ir (µ|α) ∂α j ∫ dµ = K ∂ K (α ) ∏ ∂α j i =1 α −1 µi i dµ + ∫ lnµ j · D ir (µ|α) d µ The left side can be further simplified as : ∫ ∂ D ir (µ|α) d µ ∂1 left side = = =0 ∂α j ∂α j The right side can be further simplified as : right side = = = ∂K (α) ∂α j ∂K (α) ∫ ∏ K i =1 α −1 µi i d µ + E[ lnµ j ] 1 + E[ lnµ j ] ∂α j K (α) ∂ lnK (α) + E[ lnµ j ] ∂α j 44 Therefore, we obtain : ∂ lnK (α) E[ lnµ j ] = − ∂α j } { ∑ ∂ lnΓ(α0 ) − K i =1 lnΓ(α i ) = − = = = ∂α j ∂ lnΓ(α j ) ∂α j ∂ lnΓ(α j ) ∂α j ∂ lnΓ(α j ) − − − ∂ lnΓ(α0 ) ∂α j ∂ lnΓ(α0 ) ∂α0 ∂α0 ∂α j ∂ lnΓ(α0 ) ∂α j ∂α0 = ψ(α j ) − ψ(α0 ) Therefore, the problem has been solved. Problem 2.12 Solution Since we have : ∫ b a 1 dx = 1 b−a It is straightforward that it is normalized. Then we calculate its mean : ∫ E[ x ] = b x a 1 x2 ¯¯b a + b dx = ¯ = b−a 2( b − a) a 2 Then we calculate its variance. ∫ b 2 x a+b 2 x3 ¯¯b a+b 2 2 2 var [ x] = E[ x ] − E[ x] = dx − ( ) = ) ¯ −( 2 3( b − a) a 2 a b−a Hence we obtain: var [ x] = ( b − a )2 12 Problem 2.13 Solution This problem is an extension of Prob.1.30. We can follow the same procep ( x) dure to solve it. Let’s begin by calculating ln q(( x) : ln( p ( x) ) = q ( x) 1 |L| 1 1 ln ( ) + ( x − m)T L−1 ( x − m) − ( x − µ)T Σ−1 ( x − µ) 2 |Σ| 2 2 If x ∼ p( x) = N (µ|Σ), we then take advantage of the following properties. ∫ p( x) dx = 1 45 ∫ E[ x ] = x p( x) dx = µ E[( x − a)T A ( x − a)] = tr( A Σ) + (µ − a)T A (µ − a) We obtain : ∫ { } 1 1 |L| 1 ln − ( x − µ)T Σ−1 ( x − µ) + ( x − m)T L−1 ( x − m) p( x) dx KL = 2 |Σ| 2 2 1 |L| 1 1 = ln − E [( x − µ)Σ−1 ( x − µ)T ] + E [( x − m)T L−1 ( x − m)] 2 | Σ| 2 2 1 |L| 1 1 1 = ln − tr { I D } + (µ − m)T L−1 (µ − m) + tr{L−1 Σ} 2 | Σ| 2 2 2 |L| 1 = [ ln − D + tr{L−1 Σ} + ( m − µ)T L−1 ( m − µ)] 2 |Σ| Problem 2.14 Solution The hint given in the problem is straightforward, however it is a little bit difficult to calculate, and here we will use a more simple method to solve this problem, taking advantage of the property of Kullback—Leibler Distance. Let g( x) be a Gaussian PDF with mean µ and variance Σ, and f ( x) an arbitrary PDF with the same mean and variance. { } ∫ ∫ g ( x) dx = − H ( f ) − f ( x) lng( x) dx (∗) 0 ≤ K L( f || g) = − f ( x) ln f ( x) Let’s calculate the second term of the equation above. { } ∫ ∫ [ 1 ] 1 1 T −1 exp − f ( x) lng( x) dx = f ( x) ln ( x − µ ) Σ ( x − µ ) dx 2 (2π)D /2 |Σ|1/2 { } ∫ ∫ [ 1 ] 1 1 T −1 = f ( x) ln dx + f ( x ) − ( x − µ ) Σ ( x − µ ) dx 2 (2π)D /2 |Σ|1/2 } { ] 1 1 [ 1 T −1 − E ( x − µ ) Σ ( x − µ ) = ln 2 (2π)D /2 |Σ|1/2 { } 1 1 1 = ln − tr{ I D } D /2 1/2 2 (2π) |Σ| { } 1 D = − ln|Σ| + (1 + ln(2π)) 2 2 = − H ( g) We take advantage of two properties of PDF f ( x), with mean µ and variance Σ, as listed below. What’s more, we also use the result of Prob.2.15, which we will proof later. ∫ f ( x) dx = 1 E[( x − a)T A ( x − a)] = tr( A Σ) + (µ − a)T A (µ − a) 46 Now we can further simplify (∗) to obtain: H ( g) ≥ H ( f ) In other words, we have proved that an arbitrary PDF f ( x) with the same mean and variance as a Gaussian PDF g( x), its entropy cannot be greater than that of Gaussian PDF. Problem 2.15 Solution We have already used the result of this problem to solve Prob.2.14, and now we will prove it. Suppose x ∼ p( x) = N (µ|Σ) : ∫ H [ x] = − p( x) lnp( x) dx { } ∫ ] [ 1 1 1 T −1 = − p( x) ln exp − ( x − µ) Σ ( x − µ) dx 2 (2π)D /2 |Σ|1/2 { } ∫ ∫ [ 1 ] 1 1 = − p( x) ln dx − f ( x) − ( x − µ)T Σ−1 ( x − µ) dx D /2 1/2 2 (2π) | Σ| { } ] 1 1 1 [ = − ln + E ( x − µ)T Σ−1 ( x − µ) D /2 1/2 2 (2π) |Σ| { } 1 1 1 = − ln + tr{ I D } 2 (2π)D /2 |Σ|1/2 1 D = ln|Σ| + (1 + ln(2π)) 2 2 Where we have taken advantage of : ∫ p( x) dx = 1 E[( x − a)T A ( x − a)] = tr( A Σ) + (µ − a)T A (µ − a) Note : Actually in Prob.2.14, we have already solved this problem, you can intuitively view it by replacing the integrand f ( x) lng( x) with g( x) lng( x), and ∫ the same procedure in Prob.2.14 still holds to calculate g( x) lng( x) dx. Problem 2.16 Solution Let us consider a more general conclusion about the Probability Density Function (PDF) of the summation of two independent random variables. We denote two random variables X and Y . Their summation Z = X + Y , is still a random variable. We also denote f (·) as PDF, and F (·) as Cumulative Distribution Function (CDF). We can obtain : Ï F Z ( z) = P ( Z < z) = f X ,Y ( x, y) dxd y x+ y≤ z 47 Where z represents an arbitrary real number. We rewrite the double integral into iterated integral : ] ∫ +∞ [∫ z− y F Z ( z) = f X ,Y ( x, y) dx d y −∞ −∞ We fix z and y, and then make a change of variable x = u− y to the integral. ] ] ∫ +∞ [∫ z− y ∫ +∞ [∫ z F Z ( z) = f X ,Y ( x, y) dx d y = f X ,Y ( u − x, y) du d y −∞ −∞ −∞ −∞ Note: f X ,Y (·) is the joint PDF of X and Y , and then we rearrange the order, we will obtain : ] ∫ z [∫ +∞ F Z ( z) = f X ,Y ( u − y, y) d y du −∞ −∞ Compare the equation above with th definition of CDF : ∫ z F Z ( z) = f Z ( u) du −∞ We can obtain : ∫ f Z ( u) = +∞ −∞ f X ,Y ( u − y, y) d y And if X and Y are independent, which means f X ,Y ( x, y) = f X ( x) f Y ( y), we can simplify f Z ( z) : ∫ +∞ f X ( u − y) f Y ( y) d y i.e. f Z = f X ∗ f Y f Z ( u) = −∞ Until now we have proved that the PDF of the summation of two independent random variable is the convolution of the PDF of them. Hence it is straightforward to see that in this problem, where random variable x is the summation of random variable x1 and x2 , the PDF of x should be the convolution of the PDF of x1 and x2 . To find the entropy of x, we will use a simple method, taking advantage of (2.113)-(2.117). With the knowledge : 1 p( x2 ) = N (µ2 , τ− 2 ) 1 p( x| x2 ) = N (µ1 + x2 , τ− 1 ) We make analogies : x2 in this problem to x in (2.113), x in this problem to y in (2.114). Hence by using (2.115), we can obtain p( x) is still a normal distribution, and since the entropy of a Gaussian is fully decided by its variance, there is no need to calculate the mean. Still by using (2.115), the variance of 1 −1 x is τ− 1 + τ2 , which finally gives its entropy : H [ x] = ] 1[ 1 −1 1 + ln2π(τ− 1 + τ2 ) 2 48 Problem 2.17 Solution This is an extension of Prob.1.14. The same procedure can be used here. We suppose an arbitrary precision matrix Λ can be written as ΛS + Λ A , where they satisfy : Λ i j + Λ ji Λ i j − Λ ji ΛSij = , Λ iAj = 2 2 Hence it is straightforward that ΛSij = ΛSji , and Λ iAj = − Λ Aji . If we expand the quadratic form of exponent, we will obtain : ( x − µ)T Λ( x − µ) = D ∑ D ∑ i =1 j =1 ( x i − µ i )Λ i j ( x j − µ j ) (∗) It is straightforward then : (∗) = = D ∑ D ∑ i =1 j =1 D ∑ D ∑ ( x i − µ i )ΛSij ( x j − µ j ) + D ∑ D ∑ i =1 j =1 ( x i − µ i )Λ iAj ( x j − µ j ) ( x i − µ i )ΛSij ( x j − µ j ) i =1 j =1 Therefore, we can assume precision matrix is symmetric, and so is covariance matrix. Problem 2.18 Solution We will just follow the hint given in the problem. Firstly, we take complex conjugate on both sides of (2.45) : Σu i = λi u i => Σu i = λi u i Where we have taken advantage of the fact that Σ is a real matrix, i.e., Σ = Σ. Then using that Σ is a symmetric, i.e., ΣT = Σ : u i T Σu i = u i T ( Σu i ) = u i T ( λi u i ) = λi u i T u i T u i T Σ u i = ( Σ u i )T u i = ( λ i u i )T u i = λ i u i T u i T Since u i ̸= 0, we have u i T u i ̸= 0. Thus λTi = λ i , which means λ i is real. Next we will proof that two eigenvectors corresponding to different eigenvalues are orthogonal. λ i < u i , u j > = < λ i u i , u j > = < Σ u i , u j > = < u i , ΣT u j > = λ j < u i , u j > Where we have taken advantage of ΣT = Σ and for arbitrary real matrix A and vector x, y, we have : < Ax, y > = < x, A T y > 49 Provided λ i ̸= λ j , we have < u i , u j > = 0, i.e., u i and u j are orthogonal. And then if we perform normalization on every eigenvector to force its Euclidean norm to equal to 1, (2.46) is straightforward. By performing normalization, I mean multiplying the eigenvector by a real number a to let its Euclidean norm (length) to equal to 1, meanwhile we should also divide its corresponding eigenvalue by a. Problem 2.19 Solution For every N × N real symmetric matrix, the eigenvalues are real and the eigenvectors can be chosen such that they are orthogonal to each other. Thus a real symmetric matrix Σ can be decomposed as Σ = U ΛU T ,where U is an orthogonal matrix, and Λ is a diagonal matrix whose entries are the eigenvalues of A. Hence for an arbitrary vector x, we have:     u 1T x λ1 u 1T x D ∑  .    .. Σ x = U ΛU T x = U Λ  ..  = U  λk u k u T =( . k )x uT x D λD u T x D k=1 And since Σ−1 = U Λ−1U T , the same procedure can be used to prove (2.49). Problem 2.20 Solution Since u 1 , u 2 , ..., u D can constitute a basis for RD , we can make projection for a : a = a 1 u 1 + a 2 u 2 + ... + a D u D We substitute the expression above into aT Σa, taking advantage of the property: u i u j = 1 only if i = j , otherwise 0, we will obtain : aT Σa = (a 1 u 1 + a 2 u 2 + ... + a D u D )T Σ(a 1 u 1 + a 2 u 2 + ... + a D u D ) = (a 1 u 1 T + a 2 u 2 T + ... + a D u D T )Σ(a 1 u 1 + a 2 u 2 + ... + a D u D ) = (a 1 u 1 T + a 2 u 2 T + ... + a D u D T )(a 1 λ1 u 1 + a 2 λ2 u 2 + ... + a D λD u D ) = λ1 a 1 2 + λ2 a 2 2 + ... + λD a D 2 Since a is real,the expression above will be strictly positive for any nonzero a, if all eigenvalues are strictly positive. It is also clear that if an eigenvalue, λ i , is zero or negative, there will exist a vector a (e.g. a = u i ), for which this expression will be no greater than 0. Thus, that a real symmetric matrix has eigenvectors which are all strictly positive is a sufficient and necessary condition for the matrix to be positive definite. Problem 2.21 Solution It is straightforward. For a symmetric matrix Λ of size D × D , when the lower triangular part is decided, the whole matrix will be decided due to 50 symmetry. Hence the number of independent parameters is D + (D − 1) + ... + 1, which equals to D (D + 1)/2. Problem 2.22 Solution Suppose A is a symmetric matrix, and we need to prove that A −1 is also symmetric, i.e., A −1 = ( A −1 )T . Since identity matrix I is also symmetric, we have : A A −1 = ( A A −1 )T And since AB T = B T A T holds for arbitrary matrix A and B, we will obtain : T A A −1 = ( A −1 ) A T Since A = A T , we substitute the right side: T A A −1 = ( A −1 ) A And note that A A −1 = A −1 A = I , we rearrange the order of the left side : T A −1 A = ( A −1 ) A Finally, by multiplying A −1 to both sides, we can obtain: T A −1 A A −1 = ( A −1 ) A A −1 Using A A −1 = I , we will get what we are asked : A −1 = ( A −1 ) T Problem 2.23 Solution Let’s reformulate the problem. What the problem wants us to prove is that if ( x − µ)T Σ−1 ( x − µ) = r 2 , where r 2 is a constant, we will have the volume of the hyperellipsoid decided by the equation above will equal to VD |Σ|1/2 r D . Note that the center of this hyperellipsoid locates at µ, and a translation operation won’t change its volume, thus we only need to prove that the volume of a hyperellipsoid decided by xT Σ−1 x = r 2 , whose center locates at 0 equals to VD |Σ|1/2 r D . This problem can be viewed as two parts. Firstly, let’s discuss about VD , the volume of a unit sphere in dimension D . The expression of VD has already be given in the solution procedure of Prob.1.18, i.e., (1.144) : VD = SD 2πD /2 = D Γ( D2 + 1) And also in the procedure, we show that a D dimensional sphere with radius r , i.e., xT x = r 2 , has volume V ( r ) = VD r D . We move a step forward: we 51 perform a linear transform using matrix Σ1/2 , i.e., yT y = r 2 , where y = Σ1/2 x. After the linear transformation, we actually get a hyperellipsoid whose center locates at 0, and its volume is given by multiplying V ( r ) with the determinant of the transformation matrix, which gives |Σ|1/2 VD r D , just as required. Problem 2.24 Solution We just following the hint, and firstly let’s calculate : [ A C ] [ × B D M −D −1 CM − MBD −1 −1 D + D −1 CMBD −1 ] The result can also be partitioned into four blocks. The block located at left top equals to : AM − BD −1 CM = ( A − BD −1 C )( A − BD −1 C )−1 = I Where we have taken advantage of (2.77). And the right top equals to : − AMBD −1 + BD −1 + BD −1 CMBD −1 = ( I − AM + BD −1 CM )BD −1 = 0 Where we have used the result of the left top block. And the left bottom equals to : CM − DD −1 CM = 0 And the right bottom equals to : −CMBD −1 + DD −1 + DD −1 CMDD −1 = I we have proved what we are asked. Note: if you want to be more precise, you should also multiply the block matrix on the right side of (2.76) and then prove that it will equal to a identity matrix. However, the procedure above can be also used there, so we omit the proof and what’s more, if two arbitrary square matrix X and Y satisfied X Y = I , it can be shown that Y X = I also holds. Problem 2.25 Solution We will take advantage of the result of (2.94)-(2.98). Let’s first begin by grouping xa and x b together, and then we rewrite what has been given as : ( x= xa,b xc ) ( µ= µa,b µc ) [ Σ= Σ(a,b)(a,b) Σ(a,b) c Then we take advantage of (2.98), we can obtain : p( xa,b ) = N ( xa,b |µa,b , Σ(a,b)(a,b) ) Σ(a,b) c Σ cc ] 52 Where we have defined: ) ( µa µa,b = µb [ Σ(a,b)(a,b) = Σaa Σba Σab Σbb ] Since now we have obtained the joint contribution of xa and x b , we will take advantage of (2.96) (2.97) to obtain conditional distribution, which gives: 1 p( xa | x b ) = N ( x|µa|b , Λ− aa ) Where we have defined 1 µa|b = µa − Λ− aa Λab ( x b − µ b ) 1 And the expression of Λ− aa and Λab can be given by using (2.76) and (2.77) once we notice that the following relation exits: [ Λaa Λba Λab Λbb ] [ = Σaa Σba Σab Σbb ]−1 Problem 2.26 Solution This problem is quite straightforward, if we just follow the hint. ( ) ( A + BCD ) A −1 − A −1 B(C −1 + D A −1 B)−1 D A −1 = A A −1 − A A −1 B(C −1 + D A −1 B)−1 D A −1 + BCD A −1 − BCD A −1 B(C −1 + D A −1 B)−1 D A −1 = I − B(C −1 + D A −1 B)−1 D A −1 + BCD A −1 + B(C −1 + D A −1 B)−1 D A −1 − BCD A −1 =I Where we have taken advantage of − BCD A −1 B(C −1 + D A −1 B)−1 D A −1 = −BC (−C −1 + C −1 + D A −1 B)(C −1 + D A −1 B)−1 D A −1 = (−BC )(−C −1 )(C −1 + D A −1 B)−1 D A −1 + (−BC )(C −1 + D A −1 B)(C −1 + D A −1 B)−1 D A −1 = B(C −1 + D A −1 B)−1 D A −1 − BCD A −1 Here we will also directly calculate the inverse matrix instead to give another solution. Let’s first begin by introducing two useful formulas. ( I + P )−1 = ( I + P )−1 ( I + P − P ) = I − ( I + P )−1 P And since P + PQP = P ( I + QP ) = ( I + PQ )P 53 The second formula is : ( I + PQ )−1 P = P ( I + QP )−1 And now let’s directly calculate ( A + BCD )−1 : ( A + BCD )−1 = [ A ( I + A −1 BCD )]−1 = ( I + A −1 BCD )−1 A −1 [ ] = I − ( I + A −1 BCD )−1 A −1 BCD A −1 = A −1 − ( I + A −1 BCD )−1 A −1 BCD A −1 Where we have assumed that A is invertible and also used the first formula we introduced. Then we also assume that C is invertible and recursively use the second formula : ( A + BCD )−1 = A −1 − ( I + A −1 BCD )−1 A −1 BCD A −1 = A −1 − A −1 ( I + BCD A −1 )−1 BCD A −1 = = A −1 − A −1 B( I + CD A −1 B)−1 CD A −1 [ ]−1 CD A −1 A −1 − A −1 B C (C −1 + D A −1 B) = A −1 − A −1 B(C −1 + D A −1 B)−1 C −1 CD A −1 = A −1 − A −1 B(C −1 + D A −1 B)−1 D A −1 Just as required. Problem 2.27 Solution The same procedure used in Prob.1.10 can be used here similarly. ∫ ∫ E[ x + z ] = ( x + z) p( x, z) dx dz ∫ ∫ = ( x + z) p( x) p( z) dx dz ∫ ∫ ∫ ∫ = xp( x) p( z) dx dz + z p( x) p( z) dx dz ∫ ∫ ∫ ∫ = ( p( z) dz) xp( x) dx + ( p( x) dx) z p( z) dz ∫ ∫ = xp( x) dx + z p( z) dz = E[ x ] + E[ z ] And for covariance matrix, we will use matrix integral : ∫ ∫ cov[ x + z] = ( x + z − E[ x + z] )( x + z − E[ x + z] )T p( x, z) dx dz Also the same procedure can be used here. We omit the proof for simplicity. 54 Problem 2.28 Solution It is quite straightforward when we compare the problem with (2.94)(2.98). We treat x in (2.94) as z in this problem, xa in (2.94) as x in this problem, x b in (2.94) as y in this problem. In other words, we rewrite the problem in the form of (2.94)-(2.98), which gives : ( z= x y ) ( E( z ) = µ Aµ + b ) [ cov( z) = Λ−1 A Λ−1 Λ−1 A T −1 L + A Λ−1 A T ] By using (2.98), we can obtain: p( x) = N ( x|µ, Λ−1 ) And by using (2.96) and (2.97), we can obtain : p( y| x) = N ( y|µ y| x , Λ−yy1 ) Where Λ yy can be obtained by the right bottom part of (2.104),which gives Λ yy = L−1 , and you can also calculate it using (2.105) combined with (2.78) and (2.79). Finally the conditional mean is given by (2.97) : µ y| x = A µ + L − L−1 (−LA )( x − µ) = Ax + L Problem 2.29 Solution It is straightforward. Firstly, we calculate the left top block : ]−1 [ = Λ−1 left top = (Λ + A T LA ) − (− A T L)(L−1 )(−LA ) And then the right top block : right top = −Λ−1 (− A T L)L−1 = Λ−1 A T And then the left bottom block : left bottom = −L−1 (−LA )Λ−1 = A Λ−1 Finally the right bottom block : right bottom = L−1 + L−1 (−LA )Λ−1 (− A T L)L−1 = L−1 + A Λ−1 A T Problem 2.30 Solution It is straightforward by multiplying (2.105) and (2.107), which gives : ( Λ−1 A Λ−1 Λ−1 A T L−1 + A Λ−1 A T Just as required in the problem. )( Λµ − A T Lb Lb ) ( = µ Aµ + b ) 55 Problem 2.31 Solution According to the problem, we can write two expressions : p ( x ) = N ( x |µ x , Σ x ) , p( y| x) = N ( y|µ z + x, Σ z ) By comparing the expression above and (2.113)-(2.117), we can write the expression of p( y) : p ( y) = N ( y|µ x + µ z , Σ x + Σ z ) Problem 2.32 Solution Let’s make this problem more clear. The deduction in the main text, i.e., (2.101-2.110), firstly denote a new random variable z corresponding to the joint distribution, and then by completing square according to z,i.e.,(2.103), obtain the precision matrix R by comparing (2.103) with the PDF of a multivariate Gaussian Distribution, and then it takes the inverse of precision matrix to obtain covariance matrix, and finally it obtains the linear term i.e., (2.106) to calculate the mean. In this problem, we are asked to solve the problem from another perspective: we need to write the joint distribution p( x, y) and then perform integration over x to obtain marginal distribution p( y). Let’s begin by write the quadratic form in the exponential of p( x, y) : 1 1 − ( x − µ)T Λ( x − µ) − ( y − Ax − b)T L( y − Ax − b) 2 2 We extract those terms involving x : 1 = − xT (Λ + A T LA ) x + xT [Λµ + A T L( y − b) ] + const 2 1 1 = − ( x − m)T (Λ + A T LA ) ( x − m) + mT (Λ + A T LA ) m + const 2 2 Where we have defined : m = (Λ + A T LA )−1 [Λµ + A T L( y − b) ] Now if we perform integration over x, we will see that the first term vanish to a constant, and we extract the terms including y from the remaining parts, we can obtain : ] 1 [ = − yT L − LA (Λ + A T LA )−1 A T L y 2 { [ ] + yT L − LA (Λ + A T LA )−1 A T L b } +LA (Λ + A T LA )−1 Λµ We firstly view the quadratic term to obtain the precision matrix, and then we take advantage of (2.289), we will obtain (2.110). Finally, using the 56 linear term combined with the already known covariance matrix, we can obtain (2.109). Problem 2.33 Solution p( x,y) According to Bayesian Formula, we can write p( x| y) = p( y) , where we have already known the joint distribution p( x, y) in (2.105) and (2.108), and the marginal distribution p( y) in Prob.2.32., we can follow the same procedure in Prob.2.32., i.e. firstly obtain the covariance matrix from the quadratic term and then obtain the mean from the linear term. The details are omitted here. Problem 2.34 Solution Let’s follow the hint by firstly calculating the derivative of (2.118) with respect to Σ and let it equal to 0 : − N 1 ∂ ∑ N ∂ ln|Σ| − ( xn − µ)T Σ−1 ( xn − µ) = 0 2 ∂Σ 2 ∂Σ n=1 By using (C.28), the first term can be reduced to : − N N ∂ N ln|Σ| = − (Σ−1 )T = − Σ−1 2 ∂Σ 2 2 Provided with the result that the optimal covariance matrix is the sample covariance, we denote sample matrix S as : S= N 1 ∑ ( xn − µ)( xn − µ)T N n=1 We rewrite the second term : second term = − N 1 ∂ ∑ ( xn − µ)T Σ−1 ( xn − µ) 2 ∂Σ n=1 N ∂ T r [Σ−1 S ] 2 ∂Σ N −1 Σ S Σ−1 2 = − = Where we have taken advantage of the following property, combined with the fact that S and Σ is symmetric. (Note : this property can be found in The Matrix Cookbook.) ∂ ∂X T r ( A X −1 B) = −( X −1 BA X −1 )T = −( X −1 )T A T B T ( X −1 )T Thus we obtain : − N −1 N −1 Σ + Σ S Σ−1 = 0 2 2 57 Obviously, we obtain Σ = S , just as required. Problem 2.35 Solution The proof of (2.62) is quite clear in the main text, i.e., from page 82 to page 83 and hence we won’t repeat it here. Let’s prove (2.124). We first begin by proving (2.123) : E[µ ML ] = N 1 ∑ 1 E[ · Nµ = µ xn ] = N n=1 N Where we have taken advantage of the fact that xn is independently and identically distributed (i.i.d). Then we use the expression in (2.122) : E[Σ ML ] = N 1 ∑ E[ ( xn − µ ML )( xn − µ ML )T ] N n=1 = N 1 ∑ E[( xn − µ ML )( xn − µ ML )T ] N n=1 = N 1 ∑ E[( xn − µ ML )( xn − µ ML )T ] N n=1 = N 1 ∑ E[ xn xn T − 2µ ML xn T + µ ML µT ML ] N n=1 = N N N 1 ∑ 1 ∑ 1 ∑ E[ x n x n T ] − 2 E[µ ML xn T ] + E[µ ML µT ML ] N n=1 N n=1 N n=1 By using (2.291), the first term will equal to : first term = 1 · N (µµT + Σ) = µµT + Σ N The second term will equal to : second term = −2 N 1 ∑ E[µ ML xn T ] N n=1 = −2 N N 1 ∑ 1 ∑ xm ) xn T ] E[ ( N n=1 N m=1 = −2 N N ∑ 1 ∑ E[ x m x n T ] N 2 n=1 m=1 N N ∑ 1 ∑ (µµT + I nm Σ) N 2 n=1 m=1 1 = −2 2 ( N 2 µµT + N Σ) N 1 = −2(µµT + Σ) N = −2 58 Similarly, the third term will equal to : third term = N 1 ∑ E[µ ML µT ML ] N n=1 = N N N 1 ∑ 1 ∑ 1 ∑ E[( x j) · ( x i )] N n=1 N j=1 N i=1 = N N N ∑ ∑ 1 ∑ E [( x ) · ( x i )] j N 3 n=1 j=1 i =1 N 1 ∑ ( N 2 µµT + N Σ) N 3 n=1 1 = µµT + Σ N = Finally, we combine those three terms, which gives: E[Σ ML ] = N −1 Σ N Note: the same procedure from (2.59) to (2.62) can be carried out to prove (2.291) and the only difference is that we need to introduce index m and n to represent the samples. (2.291) is quite straightforward if we see it in this way: If m = n, which means xn and xm are actually the same sample, (2.291) will reduce to (2.262) (i.e. the correlation between different dimensions exists) and if m ̸= n, which means xn and xm are different samples, also i.i.d, then no correlation should exist, we can guess E[ xn xm T ] = µµT in this case. Problem 2.36 Solution Let’s follow the hint. However, firstly we will find the sequential expression based on definition, which will make the latter process on finding coefficient a N −1 more easily. Suppose we have N observations in total, and then we can write: N) σ2( ML = = = N 1 ∑ N) 2 ( xn − µ(ML ) N n=1 [ ] −1 1 N∑ N) 2 N) 2 ( xn − µ(ML ) + ( x N − µ(ML ) N n=1 −1 N − 1 1 N∑ 1 N) 2 N) 2 ( xn − µ(ML ) + ( x N − µ(ML ) N N − 1 n=1 N N − 1 2( N −1) 1 N) 2 σ ML + ( x N − µ(ML ) N N [ ] 1 N −1) (N ) 2 2( N −1) = σ2( + ( x − µ ) − σ N ML ML ML N = 59 And then let us write the expression for σ ML . } { ¯ N 1 ∑ ∂ ¯ lnp ( x | µ , σ ) =0 ¯ n 2 σ ML N n=1 ∂σ By exchanging the summation and the derivative, and letting N → +∞, we can obtain : [ ] N 1 ∑ ∂ ∂ lim lnp ( x | µ , σ ) = E lnp ( x | µ , σ ) n x n N →+∞ N n=1 ∂σ2 ∂σ2 Comparing it with (2.127), we can obtain the sequential formula to estimate σ ML : N) σ2( ML N −1) = σ2( + a N −1 ML ∂ N −1) ∂σ2( ML [ N −1) + a N −1 − = σ2( ML N) N −1) lnp( x N |µ(ML , σ(ML ) (∗) 1 N −1) 2σ2( ML + N) 2 ( x N − µ(ML ) ] N −1) 2σ4( ML N) Where we use σ2( to represent the N th estimation of σ2ML , i.e., the estiML mation of σ2ML after the N th observation. What’s more, if we choose : a N −1 = N −1) 2σ4( ML N Then we will obtain : N) N −1) σ2( = σ2( + ML ML ] 1 [ 2( N −1) N) 2 −σ ML + ( x N − µ(ML ) N We can see that the results are the same. An important thing should be N) noticed : In maximum likelihood, when estimating variance σ2( , we will ML N) N) first estimate mean µ(ML , and then we we will calculate variance σ2( . ML In other words, they are decoupled. It is the same in sequential method. For instance, if we want to estimate both mean and variance sequentially, N −1) after observing the N th sample (i.e., x N ), firstly we can use µ(ML together N) with (2.126) to estimate µ(ML and then use the conclusion in this problem N) N) N −1) to obtain σ(ML . That is why in (∗) we write lnp( x N |µ(ML , σ(ML ) instead of N −1) ( N −1) lnp( x N |µ(ML , σ ML ). Problem 2.37 Solution (Wait for revising) We follow the same procedure in Prob.2.36 to solve this problem. Firstly, 60 we can obtain the sequential formula based on definition. N) Σ(ML N 1 ∑ N) N) T ( xn − µ(ML )( xn − µ(ML ) N n=1 [ ] −1 1 N∑ (N ) (N ) T (N ) (N ) T ( xn − µ ML )( xn − µ ML ) + ( x N − µ ML )( x N − µ ML ) N n=1 = = N − 1 ( N −1) 1 N) N) T Σ ML + ( x N − µ(ML )( x N − µ(ML ) N N [ ] 1 N −1) N) N) T N −1) = Σ(ML + ( x N − µ(ML )( x N − µ(ML ) − Σ(ML N = If we use Robbins-Monro sequential estimation formula, i.e., (2.135), we can obtain : N) Σ(ML N −1) = Σ(ML + a N −1 N −1) + a N −1 = Σ(ML ∂ N −1) ∂Σ(ML ∂ N −1) ∂Σ(ML N) N −1) lnp( x N |µ(ML , Σ(ML ) N −1) N) ) , Σ(ML lnp( x N |µ(ML [ ] 1 1 N −1) N −1) −1 N −1) T N −1) N −1) −1 N −1) −1 + a N −1 − [Σ(ML ] = Σ(ML ) [Σ(ML )( xn − µ(ML ] + [Σ(ML ] ( xn − µ(ML 2 2 Where we have taken advantage of the procedure we carried out in Prob.2.34 to calculate the derivative, and if we choose : a N −1 = 2 2( N −1) Σ N ML We can see that the equation above will be identical with our previous conclusion based on definition. Problem 2.38 Solution It is straightforward. Based on (2.137), (2.138) and (2.139), we focus on the exponential term of the posterior distribution p(µ| X ), which gives : − N 1 1 ∑ 1 ( xn − µ)2 − 2 (µ − µ0 )2 = − 2 (µ − µ N )2 2 2σ n=1 2σ 0 2σ N We rewrite the left side regarding to µ. quadratic term = −( N 1 + ) µ2 2 2σ 2σ20 ∑N linear term = ( n=1 x n σ2 + µ0 σ20 )µ 61 We also rewrite the right side regarding to µ, and hence we will obtain : N 1 1 −( 2 + ) µ2 = − 2 µ2 , ( 2 2σ 2σ 0 2σ N Then we will obtain : 1 1 = σ2N σ20 ∑N n=1 x n σ2 + n=1 x n ∑N σ2N = µN = ( n=1 x n σ2 ·( 1 + σ20 )µ = µN σ2N µ + µ0 σ20 = N · µ ML , we can write : ) N µ ML σ20 + µ0 σ2 · σ2 + N σ20 σ2 = σ20 N −1 N µ ML µ0 ) ·( + 2) 2 2 σ σ σ0 σ20 σ2 = µ0 N σ2 ∑N And with the prior knowledge that + N σ20 + σ2 σσ20 µ0 + N σ20 N σ20 + σ2 µ ML Problem 2.39 Solution Let’s follow the hint. 1 σ2N = 1 σ20 + N N −1 1 1 1 1 + 2 = 2 + + 2 = 2 2 2 σ σ σ σ σ0 σ N −1 However, it is complicated to derive a sequential formula for µ N directly. Based on (2.142), we see that the denominator in (2.141) can be eliminated if we multiply 1/σ2N on both side of (2.141). Therefore we will derive a sequential formula for µ N /σ2N instead. µN σ2N = = = = = σ2 + N σ20 σ20 σ2 σ2 + N σ20 σ20 σ2 µ0 σ20 µ0 + ( ( σ2 N σ20 + σ2 σ2 N σ20 + σ2 N) N µ(ML σ2 ∑ N −1 + σ20 µ N −1 σ2N −1 n=1 σ2 + xN σ2 = xn µ0 σ20 + + xN σ2 µ0 + µ0 + ∑N N σ20 N σ20 + σ2 N σ20 N σ20 + σ2 n=1 x n σ2 N) µ(ML ) N) µ(ML ) 62 Another possible solution is also given in the problem. We solve it by completing the square. − 1 1 1 ( x N − µ)2 − 2 (µ − µ N −1 )2 = − 2 (µ − µ N )2 2 2σ 2σ N −1 2σ N By comparing the quadratic and linear term regarding to µ, we can obtain: 1 1 1 = 2 + 2 2 σ σN σ N −1 And : µN σ2N = xN µ N −1 + 2 2 σ σ N −1 It is the same as previous result. Note: after obtaining the N th observation, we will firstly use the sequential formula to calculate σ2N , and then µ N . This is because the sequential formula for µ N is dependent on σ2N . Problem 2.40 Solution Based on Bayes Theorem, we can write : p(µ| X ) ∝ p( X |µ) p(µ) We focus on the exponential term on the right side and then rearrange it regarding to µ. [ ] N ∑ 1 1 T −1 right = − ( xn − µ) Σ ( xn − µ) − (µ − µ0 )T Σ0 −1 (µ − µ0 ) 2 2 n=1 [ ] N ∑ 1 1 T −1 = − ( xn − µ) Σ ( xn − µ) − (µ − µ0 )T Σ0 −1 (µ − µ0 ) 2 n=1 2 N ∑ 1 1 −1 = − µ (Σ0 −1 + N Σ−1 ) µ + µT (Σ− xn ) + const 0 µ0 + Σ 2 n=1 Where ’const’ represents all the constant terms independent of µ. According to the quadratic term, we can obtain the posterior covariance matrix. Σ−N1 = Σ0 −1 + N Σ−1 Then using the linear term, we can obtain : 1 −1 Σ−N1 µ N = (Σ− 0 µ0 + Σ N ∑ xn ) n=1 Finally we obtain posterior mean : 1 −1 µ N = (Σ0 −1 + N Σ−1 )−1 (Σ− 0 µ0 + Σ N ∑ n=1 xn ) 63 Which can also be written as : µ N = (Σ0 −1 + N Σ−1 )−1 (Σ0 −1 µ0 + Σ−1 N µ ML ) Problem 2.41 Solution Let’s compute the integral of (2.146) over λ. ∫ +∞ ∫ +∞ ba 1 a a−1 b λ exp(− bλ) d λ = λa−1 exp(− bλ) d λ Γ(a) Γ(a) 0 0 ∫ +∞ ba 1 u = ( )a−1 exp(− u) du Γ(a) 0 b b ∫ +∞ 1 = u a−1 exp(− u) du Γ(a) 0 1 = · Γ(a) = 1 Γ(a) Where we first perform change of variable bλ = u, and then take advantage of the definition of gamma function: ∫ +∞ Γ( x) = u x−1 e−u du 0 Problem 2.42 Solution We first calculate its mean. ∫ +∞ 1 a a−1 b λ exp(− bλ) d λ = λ Γ ( a) 0 = = = ∫ +∞ ba λa exp(− bλ) d λ Γ(a) 0 ∫ +∞ ba u 1 ( )a exp(− u) du Γ(a) 0 b b ∫ +∞ 1 u a exp(− u) du Γ(a) · b 0 1 a · Γ(a + 1) = Γ(a) · b b Where we have taken advantage of the property Γ(a + 1) = aΓ(a). Then we calculate E[λ2 ]. ∫ +∞ ∫ +∞ ba 2 1 a a−1 λ b λ exp(− bλ) d λ = λa+1 exp(− bλ) d λ Γ(a) Γ(a) 0 0 ∫ +∞ u 1 ba ( )a+1 exp(− u) du = Γ(a) 0 b b ∫ +∞ 1 a+1 = u exp(− u) du Γ(a) · b2 0 1 a(a + 1) = · Γ(a + 2) = Γ(a) · b2 b2 64 Therefore, according to var [λ] = E[λ2 ] − E[λ]2 , we can obtain : var [λ] = E[λ2 ] − E[λ]2 = a 2 a a(a + 1) −( ) = 2 2 b b b For the mode of a gamma distribution, we need to find where the maximum of the PDF occurs, and hence we will calculate the derivative of the gamma distribution with respect to λ. [ ] d 1 a a−1 1 a a−2 b λ exp(− bλ) = [(a − 1) − bλ] b λ exp(− bλ) d λ Γ(a) Γ(a) It is obvious that Gam(λ|a, b) has its maximum at λ = (a − 1)/ b. In other words, the gamma distribution Gam(λ|a, b) has mode (a − 1)/ b. Problem 2.43 Solution Let’s firstly calculate the following integral. ∫ +∞ ∫ +∞ xq | x| q exp(− 2 ) dx exp(− 2 ) dx = 2 2σ 2σ −∞ −∞ 1 ∫ +∞ (2σ2 ) q 1q −1 exp(− u) = 2 u du q 0 1 ∫ 1 (2σ2 ) q +∞ −1 exp(− u) u q dx = 2 q 0 1 (2σ2 ) q 1 = 2 Γ( ) q q And then it is obvious that (2.293) is normalized. Next, we consider about the log likelihood function. Since ϵ = t − y( x, w) and ϵ ∼ p(ϵ|σ2 , q), we can write: ln p(t| X , w, σ2 ) = N ∑ n=1 ( ) ln p y( xn , w) − t n |σ2 , q = − [ ] N 1 ∑ q q | y ( x , w ) − t | + N · ln n n 2σ2 n=1 2(2σ2 )1/ q Γ(1/ q) = − N 1 ∑ N | y( xn , w) − t n | q − ln(2σ2 ) + const 2 q 2σ n=1 Problem 2.44 Solution Here we use a simple method to solve this problem by taking advantage of (2.152) and (2.153). By writing the prior distribution in the form of (2.153), i.e., p(µ, λ|β, c, d ), we can easily obtain the posterior distribution. p(µ, λ| X ) ∝ p( X |µ, λ) · p(µ, λ) [ ] [ ] N +β N x2 N ∑ ∑ λµ2 n 1/2 ∝ λ exp(− xn )λµ − ( d + ) exp ( c + )λ 2 n=1 2 n=1 65 Therefore, we can see that the posterior distribution has parameters: β′ = ∑ ∑ x2 β + N , c′ = c + nN=1 xn , d ′ = d + nN=1 2n . And since the prior distribution is actually the product of a Gaussian distribution and a Gamma distribution: [ ] p(µ, λ|µ0 , β, a, b) = N µ|µ0 , (βλ)−1 Gam(λ|a, b) Where µ0 = c/β, a = 1 + β/2, b = d − c2 /2β. Hence the posterior distribution can also be written as the product of a Gaussian distribution and a Gamma distribution. [ ] p(µ, λ| X ) = N µ|µ′0 , (β′ λ)−1 Gam(λ|a′ , b′ ) Where we have defined: µ′0 = c′ /β′ = ( c + N ∑ n=1 / xn ) ( N + β) / a′ = 1 + β′ /2 = 1 + ( N + β) 2 b′ = d ′ − c′ /2β′ = d + 2 N x2 ∑ n n=1 2 − (c + N ∑ n=1 / xn )2 (2(β + N )) Problem 2.45 Solution Let’s begin by writing down the dependency of the prior distribution W (Λ|W, v) and the likelihood function p( X |µ, Λ) on Λ. p( X |µ, Λ) ∝ |Λ| N /2 exp And if we denote S= N [∑ ] 1 − ( xn − µ)T Λ( xn − µ) n=1 2 N 1 ∑ ( xn − µ)( xn − µ)T N n=1 Then we can rewrite the equation above as: [ 1 ] p( X |µ, Λ) ∝ |Λ| N /2 exp − Tr(S Λ) 2 Just as what we have done in Prob.2.34, and comparing this problem with Prob.2.34, one important thing should be noticed: since S and Λ are both ( ) symmetric, we have: Tr(S Λ) = Tr (S Λ)T = Tr(ΛT S T ) = Tr(ΛS ). And we can also write down the prior distribution as: [ 1 ] W (Λ|W, v) ∝ |Λ|(v−D −1)/2 exp − Tr(W −1 Λ) 2 Therefore, the posterior distribution can be obtained: p(Λ| X ,W, v) ∝ p( X |µ, Λ) · W (Λ|W, v) { 1 [ ]} ∝ |Λ|( N +v−D −1)/2 exp − Tr (W −1 + S )Λ 2 66 Therefore, p(Λ| X ,W, v) is also a Wishart distribution, with parameters: vN = N + v W N = (W −1 + S )−1 Problem 2.46 Solution It is quite straightforward. ∫ ∞ p( x|µ, a, b) = N ( x|µ, τ−1 )Gam(τ|a, b) d τ ∫ = = 0 { τ } b a exp(− bτ)τa−1 τ 1/2 ( ) exp − ( x − µ)2 d τ Γ(a) 2π 2 0 ∫ ∞ { } a b 1 τ ( )1/2 τa−1/2 exp − bτ − ( x − µ)2 d τ Γ(a) 2π 2 0 ∞ And if we make change of variable: z = τ[ b + ( x − µ)2 /2], the integral above can be written as: ∫ { } b a 1 1/2 ∞ a−1/2 τ p( x|µ, a, b) = ( ) τ exp − bτ − ( x − µ)2 d τ Γ(a) 2π 2 0 ]a−1/2 ∫ ∞[ a b 1 z 1 = dz ( )1/2 exp {− z} 2 /2 Γ(a) 2π b + ( x − µ ) b + ( x − µ)2 /2 0 [ ]a+1/2 ∫ ∞ b a 1 1/2 1 = ( ) z a−1/2 exp {− z} dz Γ(a) 2π b + ( x − µ)2 /2 0 [ ]−a−1/2 b a 1 1/2 ( x − µ)2 = ( ) b+ Γ(a + 1/2) Γ(a) 2π 2 And if we substitute a = v/2 and b = v/2λ, we will obtain (2.159). Problem 2.47 Solution We focus on the dependency of (2.159) on x. [ ]−v/2−1/2 λ( x − µ)2 St( x|µ, λ, v) ∝ 1+ v [ ] −v − 1 λ( x − µ)2 ∝ exp ln(1 + ) 2 v [ ] −v − 1 λ( x − µ)2 −2 ∝ exp ( + O (v )) 2 v ] [ λ( x − µ)2 (v → ∞) ≈ exp − 2 Where we have used Taylor Expansion: ln(1 + ϵ) = ϵ + O (ϵ2 ). We see that this, up to an overall constant, is a Gaussian distribution with mean µ and precision λ. 67 Problem 2.48 Solution The same steps in Prob.2.46 can be used here. ∫ +∞ ¯ ¯ ¯v v ¯ St( x µ, Λ, v) = N ( x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η 2 2 0 { } ∫ +∞ 1 1 vη 1 v v/2 1/2 T = |ηΛ| exp − ( x − µ) (ηΛ)( x − µ) − ( ) ηv/2−1 d η D /2 2 2 Γ(v/2) 2 (2π) 0 } { v/2 1/2 ∫ +∞ (v/2) |Λ| vη D /2+v/2−1 1 T = ( x − µ ) ( η Λ )( x − µ ) − η dη exp − 2 2 (2π)D /2 Γ(v/2) 0 Where we have taken advantage of the property: |ηΛ| = ηD |Λ|, and if we denote: η ∆2 = ( x − µ)T Λ( x − µ) and z = (∆2 + v) 2 The expression above can be reduced to : ∫ ¯ 2 z D /2+v/2−1 (v/2)v/2 |Λ|1/2 +∞ 2 exp ( − z )( St( x ¯ µ, Λ, v) = ) · 2 dz 2 (2π)D /2 Γ(v/2) 0 ∆ +v ∆ +v D /2+v/2 ∫ +∞ (v/2)v/2 |Λ|1/2 2 exp(− z) · z D /2+v/2−1 dz = ) ( (2π)D /2 Γ(v/2) ∆2 + v 0 D /2+v/2 (v/2)v/2 |Λ|1/2 2 = ( ) Γ(D /2 + v/2) (2π)D /2 Γ(v/2) ∆2 + v And if we rearrange the expression above, we will obtain (2.162) just as required. Problem 2.49 Solution Firstly, we notice that if and only if x = µ, ∆2 equals to 0, so that St( x|µ, Λ, v) achieves its maximum. In other words, the mode of St( x|µ, Λ, v) is µ. Then we consider about its mean E[ x]. ∫ E[ x ] = St( x|µ, Λ, v) · x dx x∈RD [∫ +∞ ] ∫ ¯ ¯v v −1 ¯ ¯ N ( x µ, (ηΛ) ) · Gam(η , ) d η x dx = 2 2 x∈RD 0 ∫ ∫ +∞ ¯ ¯ v v = xN ( x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η dx D 2 2 x∈R 0 ] ∫ +∞ [∫ ¯ ¯v v −1 ¯ ¯ xN ( x µ, (ηΛ) ) dx · Gam(η , ) d η = 2 2 x∈RD 0 ∫ +∞ [ ¯v v ] = µ · Gam(η ¯ , ) d η 2 2 0 ∫ +∞ ¯v v = µ Gam(η ¯ , ) d η = µ 2 2 0 Where we have taken the following property: ∫ ¯ xN ( x ¯ µ, (ηΛ)−1 ) dx = E[ x] = µ x∈RD 68 Then we calculate E[ xxT ]. The steps above can also be used here. ∫ T E[ xx ] = St( x|µ, Λ, v) · xxT dx x∈RD ] [∫ +∞ ∫ ¯ ¯v v T −1 ¯ ¯ N ( x µ, (ηΛ) ) · Gam(η , ) d η xx dx = 2 2 x∈RD 0 ∫ ∫ +∞ ¯v v ¯ = xxT N ( x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η dx 2 2 x∈RD 0 ] ∫ +∞ [∫ ¯ ¯v v −1 T ¯ ¯ = xx N ( x µ, (ηΛ) ) dx · Gam(η , ) d η 2 2 0 x∈RD ∫ +∞ [ ] ¯ v v = E[µµT ] · Gam(η ¯ , ) d η 2 2 0 ∫ +∞ [ ] ¯v v = µµT + (ηΛ)−1 Gam(η ¯ , ) d η 2 2 0 ∫ +∞ ¯ v v (ηΛ)−1 · Gam(η ¯ , ) d η = µµT + 2 2 0 ∫ +∞ v v/2 v/2−1 v 1 T −1 ( ) η exp(− η) d η = µµ + (ηΛ) · Γ ( v /2) 2 2 0 ∫ v 1 v v/2 +∞ v/2−2 T −1 = µµ + Λ η exp(− η) d η ( ) Γ(v/2) 2 2 0 If we denote: z = T vη 2 , E[ xx ] = µµ T = µµT = µµT = µµT = µµT the equation above can be reduced to : ∫ v v/2 +∞ 2 z v/2−2 2 1 +Λ ( ) ( ) exp(− z) dz Γ(v/2) 2 v v 0 ∫ +∞ v 1 z v/2−2 exp(− z) dz · + Λ−1 Γ(v/2) 2 0 Γ(v/2 − 1) v + Λ−1 · Γ(v/2) 2 1 v + Λ−1 v/2 − 1 2 v + Λ−1 v−2 −1 Where we have taken advantage of the property: Γ( x + 1) = xΓ( x), and [ ] since we have cov[ x] = E ( x − E[ x])( x − E[ x])T , together with E[ x] = µ, we can obtain: v cov[ x] = Λ−1 v−2 Problem 2.50 Solution 69 The same steps in Prob.2.47 can be used here. [ ]−D /2−v/2 ∆2 St( x|µ, Λ, v) ∝ 1+ v ] [ ∆2 ∝ exp (−D /2 − v/2) · ln(1 + ) v [ ] 2 D+v ∆ −2 ∝ exp − ·( + O (v )) 2 v ∆2 ≈ exp(− ) (v → ∞) 2 Where we have used Taylor Expansion: ln(1 + ϵ) = ϵ + O (ϵ2 ). And since ∆2 = ( x−µ)T Λ( x−µ), we see that this, up to an overall constant, is a Gaussian distribution with mean µ and precision Λ. Problem 2.51 Solution We first prove (2.177). Since we have exp( i A )· exp(− i A ) = 1, and exp( i A ) = cosA + isinA . We can obtain: ( cosA + isinA ) · ( cosA − isinA ) = 1 Which gives cos2 A + sin2 A = 1. And then we prove (2.178) using the hint. cos( A − B) = ℜ[ exp( i ( A − B))] / = ℜ[ exp( i A ) exp( iB)] cosA + isinA = ℜ[ ] cosB + isinB ( cosA + isinA )( cosB − isinB) = ℜ[ ] ( cosB + isinB)( cosB − isinB) = ℜ[( cosA + isinA )( cosB − isinB)] = cosAcosB + sinAsinB It is quite similar for (2.183). sin( A − B) = ℑ[ exp( i ( A − B))] = ℑ[( cosA + isinA )( cosB − isinB)] = sinAcosB − cosAsinB Problem 2.52 Solution Let’s follow the hint. We first derive an approximation for exp[ mcos(θ − 70 θ0 )]. { [ ]} (θ − θ0 )2 4 exp m 1 − + O ((θ − θ0 ) ) 2 { } (θ − θ0 )2 exp m − m − mO ((θ − θ0 )4 ) 2 } { { } (θ − θ0 )2 exp( m) · exp − m · exp − mO ((θ − θ0 )4 ) 2 exp { mcos(θ − θ0 )} = = = It is same for exp( mcosθ ) : exp { mcosθ } = exp( m) · exp(− m θ2 2 { } ) · exp − mO (θ 4 ) Now we rearrange (2.179): p (θ | θ 0 , m ) = = = = 1 exp { mcos(θ − θ0 )} 2π I 0 ( m ) 1 exp { mcos(θ − θ0 )} ∫ 2π 0 exp { mcosθ } d θ } { { } 2 exp( m) · exp − m (θ−2θ0 ) · exp − mO ((θ − θ0 )4 ) { } ∫ 2π θ2 4 0 exp( m) · exp(− m 2 ) · exp − mO (θ ) d θ { } 1 (θ − θ 0 )2 exp − m ∫ 2π 2 2 exp(− m θ ) d θ 0 2 Where we have taken advantage of the following fact: } { } { (when m → ∞) exp − mO ((θ − θ0 )4 ) ≈ exp − mO (θ 4 ) Therefore, it is straightforward that when m → ∞, (2.179) reduces to a Gaussian Distribution with mean θ0 and precision m. Problem 2.53 Solution Let’s rearrange (2.182) according to (2.183). N ∑ n=1 sin(θ − θ0 ) = = N ∑ n=1 ( sinθn cosθ0 − cosθn sinθ0 ) cosθ0 N ∑ n=1 sinθn − sinθ0 N ∑ n=1 cosθn Where we have used (2.183), and then together with (2.182), we can obtain : N N ∑ ∑ cosθn = 0 sinθn − sinθ0 cosθ0 n=1 n=1 71 Which gives: θ0ML = tan −1 {∑ } n sinθ n ∑ n cosθ n Problem 2.54 Solution We calculate the first and second derivative of (2.179) with respect to θ . p(θ |θ0 , m)′ = p(θ |θ0 , m)′′ = { } 1 [− msin(θ − θ0 )] exp mcos(θ − θ0 ) 2π I 0 ( m ) [ ] { } 1 − mcos(θ − θ0 ) + (− msin(θ − θ0 ))2 exp mcos(θ − θ0 ) 2π I 0 ( m ) If we let p(θ |θ0 , m)′ equals to 0, we will obtain its root: θ = θ0 + k π (k ∈ Z ) When k ≡ 0 ( mod 2), i.e. θ ≡ θ0 ( mod 2π), we have: p(θ |θ0 , m)′′ = − m exp( m) <0 2π I 0 ( m ) Therefore, when θ = θ0 , (2.179) obtains its maximum. And when k ≡ 1 ( mod 2), i.e. θ ≡ θ0 + π ( mod 2π), we have: p(θ |θ0 , m)′′ = m exp(− m) >0 2π I 0 ( m ) Therefore, when θ = θ0 + π ( mod 2π), (2.179) obtains its minimum. Problem 2.55 Solution According to (2.185), we have : A ( m ML ) = N 1 ∑ cos(θn − θ0ML ) N n=1 By using (2.178), we can write : N 1 ∑ cos(θn − θ0ML ) N n=1 ) N ( 1 ∑ = cosθn cosθ0ML + sinθn sinθ0ML N n=1 ( ) ( ) N N ∑ 1 ∑ 1 cosθ N cosθ0ML + sinθ N sinθ0ML = N n=1 N n=1 A ( m ML ) = By using (2.168), we can further derive: ) ( ) ( N N 1 ∑ 1 ∑ ML cosθ N cosθ0 + sinθ N sinθ0ML A ( m ML ) = N n=1 N n=1 = r̄cosθ̄ · cosθ0ML + r̄sinθ̄ · sinθ0ML = r̄cos(θ̄ − θ0ML ) 72 And then by using (2.169) and (2.184), it is obvious that θ̄ = θ0ML , and hence A ( m ML ) = r̄ . Problem 2.56 Solution Recall that the distributions belonging to the exponential family have the form: p( x|η) = h( x) g(η) exp(ηT u( x)) And according to (2.13), the beta distribution can be written as: Beta( x|a, b) = = = Γ(a + b) a−1 x (1 − x)b−1 Γ(a)Γ( b) Γ(a + b) exp [(a − 1) lnx + ( b − 1) ln(1 − x)] Γ(a)Γ( b) Γ(a + b) exp [alnx + bln(1 − x)] Γ(a)Γ( b) x(1 − x) Comparing it with the standard form of exponential family, we can obtain:  η = [a, b]T      u( x) = [ lnx, ln(1 − x)]T /[ ]  g(η) = Γ(η 1 + η 2 ) Γ(η 1 )Γ(η 2 )    /  h( x) = 1 ( x(1 − x)) Where η 1 means the first element of η, i.e. η 1 = a − 1, and η 2 means the second element of η, i.e. η 2 = b − 1. According to (2.146), Gamma distribution can be written as: Gam( x|a, b) = 1 a a−1 b x exp(− bx) Γ(a) Comparing it with the standard form of exponential family, we can obtain:  T   η = [a, b]    u( x) = [0, − x] / η1  g ( η ) = η Γ(η 1 )  2     h( x) = xη1 −1 According to (2.179), the von Mises distribution can be written as: p ( x |θ 0 , m ) = = 1 exp( mcos( x − θ0 )) 2π I 0 ( m ) 1 exp [ m( cosxcosθ0 + sinxsinθ0 )] 2π I 0 ( m ) 73 Comparing it with the standard form of exponential family, we can obtain:   η = [ mcosθ0 , msinθ0 ]T      u( x) = [ cosx, sinx] √ /  g ( η ) = 1 2 π I ( η21 + η22 )  0     h( x) = 1 Note : a given distribution can be written into the exponential family in several ways with different natural parameters. Problem 2.57 Solution Recall that the distributions belonging to the exponential family have the form: p( x|η) = h( x) g(η) exp(ηT u( x)) And the multivariate Gaussian Distribution has the form: } { 1 1 1 T −1 Σ ( x − µ ) N ( x|µ, Σ) = exp − ( x − µ ) 2 (2π)D /2 |Σ|1 /2 We expand the exponential term with respect to µ. { } 1 1 T −1 1 T −1 −1 exp − N ( x|µ, Σ) = ( x Σ x − 2 µ Σ x + µ Σ µ ) 2 (2π)D /2 |Σ|1/2 { } { } 1 1 1 T −1 1 T −1 −1 = exp − x Σ x + µ Σ x exp − µ Σ µ ) 2 2 (2π)D /2 |Σ|1/2 Comparing it with the standard form of exponential family, we can obtain:  η = [Σ−1 µ, − 12 vec(Σ−1 ) ]T      u( x) = [ x, vec( xxT ) ]  g(η) = exp( 14 η 1 T η 2 −1 η 1 ) + | − 2η 2 |1/2     h( x) = (2π)−D /2 Where we have used η1 to denote the first element of η, and η2 to denote the second element of η. And we also take advantage of the vectorizing operator, i.e.vec(·). The vectorization of a matrix is a linear transformation which converts the matrix into a column vector. This can be viewed in an example : [ ] a b A= => vec( A ) = [a, c, b, d ]T c d Note: By introducing vectorizing operator, we actually have vec(Σ−1 ) · vec( xxT ) = xT Σ−1 x Problem 2.58 Solution 74 Based on (2.226), we have already obtained: ∫ −∇ ln g(η) = g(η) h( x) exp{ηT u( x)} u( x) dx Then we calculate the derivative of both sides of the equation above with respect to η using the Chain rule of Calculus: ∫ ∫ −∇∇ ln g(η) = ∇ g(η) h( x) exp{ηT u( x)} u( x)T dx + g(η) h( x) exp{ηT u( x)} u( x) u( x)T dx One thing needs to be addressed here: please pay attention to the transpose operation, and −∇∇ ln g(η) should be a matrix. Notice the relationship ∇ ln g(η) = ∇ g(η)/ g(η), the first term on the right hand side of the above equation can be simplified as: ∫ (first term on the right) = ∇ ln g(η) · g(η) h( x) exp{ηT u( x)} u( x)T dx = ∇ ln g(η) · E[ u( x)T ] = −E[ u( x)] · E[ u( x)T ] Based on the definition, the second term on the right hand side is E[ u( x) u( x)T ]. Therefore, combining these two terms, we obtain: −∇∇ ln g(η) = −E[ u( x)] · E[ u( x)T ] + E[ u( x) u( x)T ] = cov[ u( x)] Problem 2.59 Solution It is straightforward. ∫ ∫ 1 x f ( ) dx σ σ ∫ 1 f ( u)σ du = σ ∫ = f ( u) du = 1 p( x|σ) dx = Where we have denoted u = x/σ. Problem 2.60 Solution Firstly, we write down the log likelihood function. N ∑ n=1 lnp( xn ) = M ∑ n i ln( h i ) i =1 Some details should be explained here. If xn falls into region ∆ i , then p( xn ) will equal to h i , and since we have already been given that among 75 all the N observations, there are n i samples fall into region ∆ i , we can easily write down the likelihood function just as the equation above, and note we use M to denote the number of different regions. Therefore, an implicit equation should hold: M ∑ ni = N i =1 We now need to take account of the constraint that p( x) must integrate ∑ to unity, which can be written as M j =1 h j ∆ j = 1. We introduce a Lagrange multiplier to the expression, and then we need to minimize: M ∑ i =1 M ∑ n i ln( h i ) + λ( h j ∆ j − 1) j We calculate its derivative with respect to h i and let it equal to 0. ni + λ∆ i = 0 hi Multiplying both sides by h i , performing summation over i and then using the constraint, we can obtain: N +λ=0 In other words, λ = − N . Then we substitute the result into the likelihood function, which gives: ni 1 hi = N ∆i Problem 2.61 Solution It is straightforward. In K nearest neighbours (KNN), when we want to estimate probability density at a point x i , we will consider a small sphere centered on x i and then allow the radius to grow until it contains K data points, and then p( x i ) will equal to K /( NVi ), where N is total observations and Vi is the volume of the sphere centered on x i . We can assume that Vi is small enough that p( x i ) is roughly constant in it. In this way, We can write down the integral: ∫ p( x) dx ≈ N ∑ i =1 p( x i ) · Vi = N K ∑ · Vi = K ̸= 1 i =1 NVi We also see that if we use "1NN" (K = 1), the probability density will be well normalized. Note that if and only if the volume of all the spheres are small enough and N is large enough, the equation above will hold. Fortunately, these two conditions can be satisfied in KNN. 76 0.3 Probability Distribution Problem 3.1 Solution Based on (3.6), we can write : 2σ(2a) − 1 = 2 1 − exp(−2a) exp(a) − exp(−a) −1= = 1 + exp(−2a) 1 + exp(−2a) exp(a) + exp(−a) Which is exactly tanh(a). Then we will find the relation between µ i , w i in (3.101) and (3.102). Let’s start from (3.101). y( x, w) = w0 + = w0 + = w0 + M ∑ j =1 M ∑ w j σ( wj x −µj s tanh( j =1 ) x−µ j 2s ) + 1 2 M w M ∑ x −µj 1∑ j wj + tanh( ) 2 j=1 2s j =1 2 Hence the relation is given by : µ0 = w0 + M 1∑ wj 2 j=1 and µ j = wj 2 Note: there is a typo in (3.102), the denominator should be 2 s instead of s, or alternatively you can view it as a new s′ , which equals to 2 s. Problem 3.2 Solution We first need to show that (ΦT Φ)−1 is invertible. Suppose, for the sake of contradiction, c is a nonzero vector in the kernel(Null space) of ΦT Φ. Then ΦT Φ c equals to 0 and so we have: 0 = c T ΦT Φ c = (Φ c)T Φ c = ||Φ c||2 The equation above shows that Φ c = 0. However, Φ c = c 1 ϕ1 + c 2 ϕ2 + ... + c M ϕ M and {ϕ1 , ϕ2 , , ..., ϕ M } is a basis for Φ, there is no linear relation between the ϕ i and therefore we cannot have c 1 ϕ1 + c 2 ϕ2 + ... + c M ϕ M = 0. This is the contradiction. Hence ΦT Φ is invertible. Then let’s first prove two specific cases. Case 1: w1 is in Φ. In this case, we have Φ c = w1 for some c. So we have: Φ(ΦT Φ)−1 ΦT w1 = Φ(ΦT Φ)−1 ΦT Φ c = Φ c = w1 Case 2:w2 is in Φ⊥ , where Φ⊥ is used to denote the orthogonal complement of Φ and then we have ΦT w2 = 0, which leads to: Φ(ΦT Φ)−1 ΦT w2 = 0 77 Recall that any vector x ∈ R M can be divided into the summation of two vectors w1 and w2 , were w1 ∈ Φ and w2 ∈ Φ⊥ separately. And so we have: Φ(ΦT Φ)−1 ΦT w = Φ(ΦT Φ)−1 ΦT (w1 + w2 ) = w1 Which is exactly what orthogonal projection is supposed to do. Problem 3.3 Solution Let’s calculate the derivative of (3.104) with respect to w. ∇ E D ( w) = N ∑ n=1 { } r n t n − wT Φ( xn ) Φ( xn )T We set the derivative equal to 0. 0= If we denote N ∑ n=1 ( r n t n Φ( xn )T − wT N ∑ n=1 ) r n Φ( xn )Φ( xn )T p p r n ϕ( xn ) = ϕ′ ( xn ) and r n t n = t′n , we can obtain: ( ) N N ∑ ∑ ′ ′ T T ′ ′ T 0= t n Φ ( xn ) − w Φ ( xn )Φ ( xn ) n=1 n=1 Taking advantage of (3.11) – (3.17), we can derive a similar result, i.e. w ML = (ΦT Φ)−1 ΦT t. But here, we define t as: [p ]T p p r 1 t 1 , r 2 t 2 , ... , r N t N t= p We also define Φ as a N × M matrix, with element Φ( i, j ) = r i ϕ j ( x i ). The interpretation is two folds: (1) Examining Eq (3.10)-(3.12), we see that if we substitute β−1 by r n ·β−1 in the summation term, Eq (3.12) will become the expression in exercise 3.3. (2) r n can also be viewed as the effective number of observation of (xn , t n ). Alternatively speaking, you can treat (xn , t n ) as repeatedly occurring r n times. Problem 3.4 Solution Firstly, we rearrange E D (w). { }2 N D ∑ [ ] 1 ∑ E D ( w) = w0 + wi (xi + ϵi ) − t n 2 n=1 i =1 { }2 N D D ∑ ∑ ( ) 1 ∑ = w0 + wi xi − t n + wi ϵi 2 n=1 i =1 i =1 { }2 N D ∑ 1 ∑ = y( x n , w ) − t n + wi ϵi 2 n=1 i =1 { } D D N ( )2 ( ∑ )2 (∑ )( ) 1 ∑ wi ϵi + 2 w i ϵ i y( xn , w) − t n = y( x n , w ) − t n + 2 n=1 i =1 i =1 78 Where we have used y( xn , w) to denote the output of the linear model when input variable is xn , without noise added. For the second term in the equation above, we can obtain : Eϵ [( D ∑ i =1 w i ϵ i )2 ] = Eϵ [ D ∑ D ∑ i =1 j =1 wi w j ϵi ϵ j ] = D ∑ D ∑ i =1 j =1 w i w j Eϵ [ϵ i ϵ j ] = σ2 D ∑ D ∑ i =1 j =1 wi w j δi j Which gives Eϵ [( D ∑ i =1 w i ϵ i )2 ] = σ2 D ∑ i =1 w2i For the third term, we can obtain: D D ∑ (∑ )( ) ( ) Eϵ [2 w i ϵ i y( xn , w) − t n ] = 2 y( xn , w) − t n Eϵ [ w i ϵ i ] i =1 i =1 ( = 2 y( x n , w ) − t n = 0 D )∑ i =1 Eϵ [ w i ϵ i ] Therefore, if we calculate the expectation of E D (w) with respect to ϵ, we can obtain: N ( D )2 σ2 ∑ 1 ∑ Eϵ [E D (w)] = w2 y( xn , w) − t n + 2 n=1 2 i=1 i Problem 3.5 Solution We can firstly rewrite the constraint (3.30) as : ( ) M 1 ∑ q |w j | − η ≤ 0 2 j=1 Where we deliberately introduce scaling factor 1/2 for convenience.Then it is straightforward to obtain the Lagrange function. ( ) }2 λ ∑ N { M 1 ∑ T q L(w, λ) = t n − w ϕ( x n ) + |w j | − η 2 n=1 2 j=1 It is obvious that L(w, λ) and (3.29) has the same dependence on w. Meanwhile, if we denote the optimal w that can minimize L(w, λ) as w⋆ (λ), we can see that M ∑ |w⋆j | q η= j =1 Problem 3.6 Solution Firstly, we write down the log likelihood function. lnp(T | X ,W, β) = − N [ ]T [ ] N 1 ∑ ln|Σ| − t n − W T ϕ( xn ) Σ−1 t n − W T ϕ( xn ) 2 2 n=1 79 Where we have already omitted the constant term. We set the derivative of the equation above with respect to W equals to zero. 0=− N ∑ n=1 ] Σ−1 [ t n − W T ϕ( xn ) ϕ( xn )T Therefore, we can obtain similar result for W as (3.15). For Σ, comparing with (2.118) – (2.124), we can easily write down a similar result : Σ= N ] ]T 1 ∑ T [t n − W T ML ϕ( x n ) [ t n − W ML ϕ( x n ) N n=1 We can see that the solutions for W and Σ are also decoupled. Problem 3.7 Solution Let’s begin by writing down the prior distribution p(w) and likelihood function p( t| X , w, β). p ( w) = N ( w| m0 , S 0 ) , p( t| X , w, β) = N ∏ n=1 N ( t n |wT ϕ( xn ), β−1 ) Since the posterior PDF equals to the product of the prior PDF and likelihood function, up to a normalized constant. We mainly focus on the exponential term of the product. exponential term = = = N { β ∑ }2 1 1 − ( w − m0 ) T S − 0 ( w − m0 ) 2 } 1 N { β ∑ 1 − t2n − 2 t n wT ϕ( xn ) + wT ϕ( xn )ϕ( xn )T w − (w − m0 )T S − 0 ( w − m0 ) 2 n=1 2 [ ] N 1 T ∑ T −1 − w βϕ( xn )ϕ( xn ) + S 0 w 2 n=1 [ ] N ∑ 1 1 − −2 m0T S − 2β t n ϕ ( x n ) T w 0 − 2 n=1 − 2 n=1 t n − w T ϕ( x n ) + const Hence, by comparing the quadratic term with standard Gaussian Distri1 −1 T bution, we can obtain: S − N = S 0 + βΦ Φ. And then comparing the linear term, we can obtain : 1 −2 m N T S N −1 = −2 m0T S − 0 − N ∑ n=1 2β t n ϕ( xn )T If we multiply −0.5 on both sides, and then transpose both sides, we can easily see that m N = S N (S 0 −1 m 0 + βΦT t) 80 Problem 3.8 Solution Firstly, we write down the prior : p ( w) = N ( m N , S N ) Where m N , S N are given by (3.50) and (3.51). And if now we observe another sample ( X N +1 , t N +1 ), we can write down the likelihood function : p( t N +1 | x N +1 , w) = N ( t N +1 | y( x N +1 , w), β−1 ) Since the posterior equals to the production of likelihood function and the prior, up to a constant, we focus on the exponential term. exponential term = = 1 T 2 ( w − m N )T S − N (w − m N ) + β( t N +1 − w ϕ( x N +1 )) [ ] wT S N −1 + β ϕ( x N +1 ) ϕ( x N +1 )T w [ 1 ] −2wT S − N m N + β ϕ( x N +1 ) t N +1 +const Therefore, after observing ( X N +1 , t N +1 ), we have p(w) = N ( m N +1 , S N +1 ), where we have defined: 1 −1 T S− N +1 = S N + β ϕ( x N +1 ) ϕ( x N +1 ) And ( 1 ) m N +1 = S N +1 S − N m N + β ϕ( x N +1 ) t N +1 Problem 3.9 Solution We know that the prior p(w) can be written as: p ( w) = N ( m N , S N ) And the likelihood function p( t N +1 | x N +1 , w) can be written as: p( t N +1 | x N +1 , w) = N ( t N +1 | y( x N +1 , w), β−1 ) According to the fact that y( x N +1 , w) = wT ϕ( x N +1 ) = ϕ( x N +1 )T w, the likelihood can be further written as: p( t N +1 | x N +1 , w) = N ( t N +1 |(ϕ( x N +1 )T w, β−1 ) Then we take advantage of (2.113), (2.114) and (2.116), which gives: { } p(w| x N +1 , t N +1 ) = N (Σ ϕ( x N +1 )β t N +1 + S N −1 m N , Σ) Where Σ = (S N −1 + ϕ( x N +1 )βϕ( x N +1 )T )−1 , and we can see that the result is exactly the same as the one we obtained in the previous problem. 81 Problem 3.10 Solution We have already known: p( t|w, β) = N ( t| y( x, w), β−1 ) And p(w|t, α, β) = N (w| m N , S N ) Where m N , S N are given by (3.53) and (3.54). As what we do in previous problem, we can rewrite p( t|w, β) as: p( t|w, β) = N ( t|ϕ( x)T w, β−1 ) And then we take advantage of (2.113), (2.114) and (2.115), we can obtain: p( t|t, α, β) = N (ϕ( x)T m N , β−1 + ϕ( x)T S N ϕ( x)) Which is exactly the same as (3.58), if we notice that ϕ( x ) T m N = m N T ϕ( x ) Problem 3.11 Solution We need to use the result obtained in Prob.3.8. In Prob.3.8, we have de1 rived a formula for S − : N +1 1 −1 T S− N +1 = S N + β ϕ( x N +1 ) ϕ( x N +1 ) And then using (3.110), we can obtain : [ ]−1 S N +1 = S N −1 + β ϕ( x N +1 ) ϕ( x N +1 )T √ √ ]−1 [ = S N −1 + βϕ( x N +1 ) βϕ( x N +1 )T √ √ S N ( βϕ( x N +1 ))( βϕ( x N +1 )T )S N = SN − √ √ 1 + ( βϕ( x N +1 )T )S N ( βϕ( x N +1 )) = SN − βS N ϕ( x N +1 )ϕ( x N +1 )T S N 1 + βϕ( x N +1 )T S N ϕ( x N +1 ) Now we calculate σ2N ( x) − σ2N +1 ( x) according to (3.59). σ2N ( x) − σ2N +1 ( x) = ϕ( x)T (S N − S N +1 )ϕ( x) = ϕ( x ) T = = βS N ϕ( x N +1 )ϕ( x N +1 )T S N 1 + βϕ( x N +1 )T S N ϕ( x N +1 ) ϕ( x ) ϕ( x)T S N ϕ( x N +1 )ϕ( x N +1 )T S N ϕ( x) 1/β + ϕ( x N +1 )T S N ϕ( x N +1 ) [ ]2 ϕ( x)T S N ϕ( x N +1 ) (∗) 1/β + ϕ( x N +1 )T S N ϕ( x N +1 ) 82 And since S N is positive definite, (∗) is larger than 0. Therefore, we have proved that σ2N ( x) − σ2N +1 ( x) ≥ 0 Problem 3.12 Solution Let’s begin by writing down the prior PDF p(w, β): p(w, β) N (w| m 0 , β−1 S 0 ) Gam(β|a 0 , b 0 ) (∗) β 2 1 a ∝ ( ) exp(− (w − m 0 )T βS 0−1 (w − m 0 )) b 0 0 βa0 −1 exp(− b 0 β) |S 0 | 2 = And then we write down the likelihood function p(t|X, w, β) : p(t|X, w, β) = ∝ N ∏ n=1 N ( t n |wT ϕ( xn ), β−1 ) N ∏ ] [ β β1/2 exp − ( t n − wT ϕ( xn ))2 2 n=1 (∗∗) According to Bayesian Inference, we have p(w, β|t) ∝ p(t|X, w, β)× p(w, β). We first focus on the quadratic term with regard to w in the exponent. N ∑ β β − wT ϕ( xn )ϕ( xn )T w quadratic term = − wT S 0 −1 w + 2 n=1 2 N ∑ [ ] β = − wT S 0 −1 + ϕ ( x n )ϕ ( x n ) T w 2 n=1 Where the first term is generated by (∗), and the second by (∗∗). By now, we know that: N ∑ S N −1 = S 0 −1 + ϕ( xn )ϕ( xn )T n=1 We then focus on the linear term with regard to w in the exponent. linear term = β m 0 T S 0 −1 w + [ = β m 0 T S 0 −1 + N ∑ n=1 N ∑ n=1 β t n ϕ( x n ) T w ] t n ϕ( x n ) T w Again, the first term is generated by (∗), and the second by (∗∗). We can also obtain: N ∑ t n ϕ( x n ) T m N T S N −1 = m 0 T S 0 −1 + n=1 Which gives: N ∑ ] [ m N = S N S 0 −1 m 0 + t n ϕ( x n ) n=1 83 Then we focus on the constant term with regard to w in the exponent. N β β ∑ constant term = (− m 0 T S 0 −1 m 0 − b 0 β) − t2 2 2 n=1 n N ] [1 1 ∑ t2n = −β m 0 T S 0 −1 m 0 + b 0 + 2 2 n=1 Therefore, we can obtain: N 1 1 1 ∑ m N T S N −1 m N + b N = m 0 T S 0 −1 m 0 + b 0 + t2 2 2 2 n=1 n Which gives : bN = N 1 ∑ 1 1 m 0 T S 0 −1 m 0 + b 0 + t2 − m N T S N −1 m N 2 2 n=1 n 2 Finally, we focus on the exponential term whose base is β. exponent term = (2 + a 0 − 1) + N 2 Which gives: 2 + a N − 1 = (2 + a 0 − 1) + N 2 Hence, a N = a0 + N 2 Problem 3.13 Solution Similar to (3.57), we write down the expression of the predictive distribution p( t|X, t): ∫ ∫ p( t|X, t) = p( t|w, β) p(w, β|X, t) dw d β (∗) We know that: p( t|w, β) = N ( t| y( x, w), β−1 ) = N ( t|ϕ( x)T w, β−1 ) and that: p(w, β|X, t) = N (w| m N , β−1 S N ) Gam(β|a N , b N ) We go back to (∗), and first deal with the integral with regard to w: ∫ ∫ [ ] p( t|X, t) = N ( t|ϕ( x)T w, β−1 ) N (w| m N , β−1 S N ) dw Gam(β|a N , b N ) d β ∫ = N ( t|ϕ( x)T m N , β−1 + ϕ( x)T β−1 S N ϕ( x)) Gam(β|a N , b N ) d β ∫ ] [ = N t|ϕ( x)T m N , β−1 (1 + ϕ( x)T S N ϕ( x)) Gam(β|a N , b N ) d β 84 Where we have used (2.113), (2.114) and (2.115). Then, we follow (2.158)(2.160), we can see that p( t|X, t) = St( t|µ, λ, v), where we have defined: µ = ϕ( x ) T m N , λ= ]−1 aN [ · 1 + ϕ ( x ) T S N ϕ( x ) , bN v = 2a N Problem 3.14 Solution(Wait for updating) Firstly, according to (3.16), if we use the new orthonormal basis set specified in the problem to construct Φ, we can obtain an important property: ΦT Φ = I. Hence, if α = 0, together with (3.54), we know that SN = 1/β. Finally, according to (3.62), we can obtain: k( x, x′ ) = βψ( x)T SN ψ( x′ ) = ψ( x)T ψ( x′ ) Problem 3.15 Solution It is quite obvious if we substitute (3.92) and (3.95) into (3.82), which gives, α N −γ γ N β + = E ( m N ) = ||t − Φ m N ||2 + m N T m N = 2 2 2 2 2 Problem 3.16 Solution(Waiting for update) We know that p(t|w, β) = N ∏ n=1 And N (ϕ( xn )T w, β−1 ) ∝ N (Φw, β−1 I) p(w|α) = N (0, α−1 I) Comparing them with (2.113), (2.114) and (2.115), we can obtain: p(t|α, β) = N (0, β−1 I + α−1 ΦΦT ) Problem 3.17 Solution We know that: p(t|w, β) = = N ∏ n=1 N ∏ N (ϕ( xn )T w, β−1 ) } { 1 T 2 ( t − ϕ ( x ) w ) exp − n n −1 1/2 2β−1 n=1 (2πβ ) 1 N {∑ } β − ( t n − ϕ( xn )T w)2 2π n=1 2 { β } β N /2 = ( ) exp − ||t − Φw||2 2π 2 = ( β ) N /2 exp 85 And that: p(w|α) = N (0, α−1 I) = { α } 2 exp − || w || 2 (2π) M /2 α M /2 If we substitute the expressions above into (3.77), we can obtain (3.78) just as required. Problem 3.18 Solution We expand (3.79) as follows: E ( w) = = = β α ||t − Φw||2 + wT w 2 2 α β T (t t − 2tT Φw + wT ΦT Φw) + wT w 2 2 ] 1[ T T T w (βΦ Φ + αI)w − 2βt Φw + βtT t 2 Observing the equation above, we see that E (w) contains the following term : 1 (w − m N )T A(w − m N ) (∗) 2 Now, we need to solve A and m N . We expand (∗) and obtain: (∗) = 1 T (w Aw − 2 m N T Aw + m N T A m N ) 2 We firstly compare the quadratic term, which gives: A = βΦT Φ + αI And then we compare the linear term, which gives: m N T A = βtT Φ Noticing that A = AT , which implies A−1 is also symmetric, we first transpose and then multiply A−1 on both sides, which gives: m N = βA−1 ΦT t 86 Now we rewrite E (w): E ( w) = = = = = = = ] 1[ T w (βΦT Φ + αI)w − 2βtT Φw + βtT t 2 ] 1[ (w − m N )T A(w − m N ) + βtT t − m N T A m N 2 1 1 (w − m N )T A(w − m N ) + (βtT t − m N T A m N ) 2 2 1 1 (w − m N )T A(w − m N ) + (βtT t − 2 m N T A m N + m N T A m N ) 2 2 1 1 T (w − m N ) A(w − m N ) + (βtT t − 2 m N T A m N + m N T (βΦT Φ + αI) m N ) 2 2 ] α 1 1[ (w − m N )T A(w − m N ) + βtT t − 2βtT Φ m N + m N T (βΦT Φ) m N + m N T m N 2 2 2 β α 1 T 2 T (w − m N ) A(w − m N ) + ||t − Φ m N || + m N m N 2 2 2 Just as required. Problem 3.19 Solution Based on the standard form of a multivariate normal distribution, we know that ∫ { 1 } 1 1 T exp − ( w − m ) A( w − m ) dw = 1 N N 2 (2π) M /2 |A|1/2 Hence, ∫ { 1 } exp − (w − m N )T A(w − m N ) dw = (2π) M /2 |A|1/2 2 And since E ( m N ) doesn’t depend on w, (3.85) is quite obvious. Then we substitute (3.85) into (3.78), which will immediately gives (3.86). Problem 3.20 Solution You can just follow the steps from (3.87) to (3.92), which is already very clear. Problem 3.21 Solution Let’s first prove (3.117). According to (C.47) and (C.48), we know that if A is a M × M real symmetric matrix, with eigenvalues λ i , i = 1, 2, ..., M , |A| and Tr(A) can be written as: | A| = M ∏ i =1 λi , Tr(A) = M ∑ i =1 λi Back to this problem, according to section 3.5.2, we know that A has eigenvalues α + λ i , i = 1, 2, ..., M . Hence the left side of (3.117) equals to: left side = M d M M ∑ ] ∑ [∏ d 1 (α + λ i ) = ln ln(α + λ i ) = dα i =1 d α i =1 i =1 α + λ i 87 And according to (3.81), we can obtain: A−1 d A = A−1 I = A−1 dα For the symmetric matrix A, its inverse A−1 has eigenvalues 1 / (α+λ i ) , i = 1, 2, ..., M . Therefore, M ∑ d 1 Tr(A−1 A) = dα i =1 α + λ i Hence there are the same, and (3.92) is quite obvious. Problem 3.22 Solution Let’s derive (3.86) with regard to β. The first term dependent on β in (3.86) is : N d N ( lnβ) = dβ 2 2β The second term is : d E (m N ) = dβ 1 β d d α ||t − Φ m N ||2 + ||t − Φ m N ||2 + mN T mN 2 2 dβ dβ 2 The last two terms in the equation above can be further written as: β d 2 dβ ||t − Φ m N ||2 + d α mN T mN dβ 2 = = = = = = {β } dm N d d α ||t − Φ m N ||2 + mN T mN · 2 dm N dm N 2 dβ {β } α dm N [−2ΦT (t − Φ m N )] + 2 m N · 2 2 dβ { } dm N − βΦT (t − Φ m N ) + α m N · dβ { } dm N − βΦT t + (αI + βΦT Φ) m N · dβ { } dm N − βΦT t + A m N · dβ 0 Where we have taken advantage of (3.83) and (3.84). Hence N d 1 1 ∑ E ( m N ) = ||t − Φ m N ||2 = ( t n − m N T ϕ( xn ))2 dβ 2 2 n=1 The last term dependent on β in (3.86) is: d 1 γ ( ln|A|) = dβ 2 2β Therefore, if we combine all those expressions together, we will obtain (3.94). And then if we rearrange it, we will obtain (3.95). 88 Problem 3.23 Solution First, according to (3.10), we know that p(t|X, w, β) can be further written as p(t|X, w, β) = N (t|Φw, β−1 I), and given that p(w|β) = N ( m 0 , β−1 S0 ) and p(β) = Gam(β|a 0 , b 0 ). Therefore, we just follow the hint in the problem. ∫ ∫ p(t) = p(t|X, w, β) p(w|β) dw p(β) d β ∫ ∫ { β } β = ( ) N /2 exp − (t − Φw)T (t − Φw) · 2π 2 { } β M /2 β ( ) |S0 |−1/2 exp − (w − m 0 )T S0 −1 (w − m 0 ) dw 2π 2 Γ(a 0 )−1 b 0a0 βa0 −1 exp(− b 0 β) d β ∫ ∫ a b0 0 { β } = exp − (t − Φw)T (t − Φw) ( M + N )/2 1/2 2 (2π) |S0 | { β } exp − (w − m 0 )T S0 −1 (w − m 0 ) dw 2 βa0 −1+ N /2+ M /2 exp(− b 0 β) d β ∫ ∫ a b0 0 { β } exp − (w − m N )T SN −1 (w − m N ) dw = ( M + N )/2 1/2 2 (2π) |S0 | { β T } exp − (t t + m 0 T S0 −1 m 0 − m N T SN −1 m N ) 2 a N −1+ M /2 exp(− b 0 β) d β β Where we have defined m N = SN (S0 −1 m 0 + ΦT t) S N −1 = S0 −1 + ΦT Φ a N = a0 + N 2 N ∑ 1 b N = b 0 + ( m 0 T S0 −1 m 0 − m N T SN −1 m N + t2n ) 2 n=1 Which are exactly the same as those in Prob.3.12, and then we evaluate the integral, taking advantage of the normalized property of multivariate Gaussian Distribution and Gamma Distribution. ∫ 2π M /2 1/2 ( ) | SN | βa N −1+ M /2 exp(− b N β) d β (2π)( M + N )/2 |S0 |1/2 β ∫ a b0 0 M /2 1/2 (2π) |SN | βa N −1 exp(− b N β) d β (2π)( M + N )/2 |S0 |1/2 a0 |SN |1/2 b 0 Γ(a N ) 1 a p(t) = = = b0 0 a (2π) N /2 |S0 |1/2 b NN Γ( b N ) 89 Just as required. Problem 3.24 Solution Let’s just follow the hint and we begin by writing down expression for the likelihood, prior and posterior PDF. We know that p(t|w, β) = N (t|Φw, β−1 I). What’s more, the form of the prior and posterior are quite similar: p(w, β) = N (w|m0 , β−1 S0 ) Gam(β|a 0 , b 0 ) And p(w, β|t) = N (w|mN , β−1 SN ) Gam(β|a N , b N ) Where the relationships among those parameters are shown in Prob.3.12, Prob.3.23. Now according to (3.119), we can write: p(t) = N (t|Φw, β−1 I) N (w|m0 , β−1 S0 ) Gam(β|a 0 , b 0 ) N (w|mN , β−1 SN ) Gam(β|a N , b N ) 0 a −1 N (w|m0 , β−1 S0 ) b 0 β 0 exp(− b 0 β) / Γ(a 0 ) I) N (w|mN , β−1 SN ) b aNN βa N −1 exp(− b N β) / Γ(a N ) a −1 = N (t|Φw, β a = N (t|Φw, β−1 I) 0 { } N (w|m0 , β−1 S0 ) b 0 Γ(a N ) a0 −a N exp − ( b 0 − b N )β β aN − 1 N (w|mN , β SN ) b N Γ(a 0 ) = N (t|Φw, β−1 I) { } b 0 0 Γ(a N ) − N /2 N (w|m0 , β−1 S0 ) exp − ( b − b ) β β 0 N a N (w|mN , β−1 SN ) b NN Γ(a 0 ) a Where we have used a N = a 0 + N 2 . Now we deal with the terms expressed in the form of Gaussian Distribution: N (w|m0 , β−1 S0 ) N (w|mN , β−1 SN ) { β } β = ( ) N /2 exp − (t − Φw)T (t − Φw) · 2π 2 { } −1 1/2 exp − β (w − m )T S −1 (w − m ) | β SN | 0 0 0 2 { } |β−1 S0 |1/2 exp − β (w − mN )T SN −1 (w − mN ) 2 Gaussian terms = N (t|Φw, β−1 I) { β } T exp − (t − Φ w ) (t − Φ w ) · 2π 2 |S0 |1/2 { β } exp − 2 (w − m0 )T S0 −1 (w − m0 ) { β } exp − 2 (w − mN )T SN −1 (w − mN ) = ( β ) N /2 |SN |1/2 We look back to the previous problem and we notice that at the last step in the deduction of p(t), we complete the square according to w. And if we carefully compare the left and right side at the last step, we can obtain : { β } { β } exp − (t − Φw)T (t − Φw) exp − (w − m 0 )T S0 −1 (w − m 0 ) 2 2 { β } { } = exp − (w − m N )T SN −1 (w − m N ) exp − ( b N − b 0 )β 2 90 Hence, we go back to deal with the Gaussian terms: Gaussian terms = ( β 2π ) N /2 |SN |1/2 |S0 |1/2 { } exp − ( b N − b 0 )β If we substitute the expressions above into p(t), we will obtain (3.118) immediately. 0.4 Linear Models Classification Problem 4.1 Solution If the convex hull of {xn } and {yn } intersects, we know that there will be a ∑ ∑ point z which can be written as z = n αn xn and also z = n βn yn . Hence we can obtain: b T z + w0 w b T( = w = ( ∑ ∑ α n xn ) + w 0 n ∑ b T xn ) + ( α n ) w 0 αn w n = ∑ n T b xn + w 0 ) αn (w (∗) n ∑ Where we have used n αn = 1. And if {xn } and {yn } are linearly separab T xn + w0 > 0 and w b T yn + w0 < 0, for ∀xn , yn . Together with ble, we have w T b z + w0 > 0. And if we calculate w b T z + w0 αn ≥ 0 and (∗), we know that w from the perspective of {yn } following the same procedure, we can obtain b T z + w0 < 0. Hence contradictory occurs. In other words, they are not linw early separable if their convex hulls intersect. We have already proved the first statement, i.e., "convex hulls intersect" gives "not linearly separable", and what the second part wants us to prove is that "linearly separable" gives "convex hulls do not intersect". This can be done simply by contrapositive. The true converse of the first statement should be if their convex hulls do not intersect, the data sets should be linearly separable. This is exactly what Hyperplane Separation Theorem shows us. Problem 4.2 Solution e on w0 explicitly: Let’s make the dependency of E D (W) e = E D (W) } 1 { Tr (XW + 1w0 T − T)T (XW + 1w0 T − T) 2 e with respect to w0 : Then we calculate the derivative of E D (W) e ∂E D (W) ∂w0 = 2 N w0 + 2(XW − T)T 1 91 Where we have used the property: ∂ ∂X [ ] Tr (AXB + C)(AXB + C)T = 2AT (AXB + C)BT We set the derivative equals to 0, which gives: w0 = − 1 (XW − T)T 1 = t̄ − WT x̄ N Where we have denoted: t̄ = 1 T T 1, N and x̄ = 1 T X 1 N e we can obtain: If we substitute the equations above into E D (W), { } e = 1 Tr (XW + T̄ − X̄W − T)T (XW + T̄ − X̄W − T) E D (W) 2 Where we further denote T̄ = 1t̄T , and X̄ = 1x̄T e with regard to W to 0, which gives: Then we set the derivative of E D (W) b †T b W=X Where we have defined: b = X − X̄ , X b = T − T̄ and T Now consider the prediction for a new given x, we have: y(x) = WT x + w0 = WT x + t̄ − WT x̄ = t̄ + WT (x − x̄) If we know that aT tn + b = 0 holds for some a and b, we can obtain: aT t̄ = N 1 ∑ 1 T T a T 1= a T tn = − b N N n=1 Therefore, [ ] aT y(x) = aT t̄ + WT (x − x̄) = aT t̄ + aT WT (x − x̄) b T (X b † )T (x − x̄) = − b + aT T = −b 92 Where we have used: bT aT T = aT (T − T̄)T = aT (T − = aT TT − 1 T T 11 T) N 1 T T T a T 11 = − b1T + b1T N = 0T Problem 4.3 Solution Suppose there are Q constraints in total. We can write aq T tn + b q = 0 , q = 1, 2, ...,Q for all the target vector tn , n = 1, 2..., N . Or alternatively, we can group them together: A T tn + b = 0 Where A is a Q × Q matrix, and the qth column of A is aq , and meanwhile b is a Q × 1 column vector, and the qth element is bq . for every pair of {aq , b q } we can follow the same procedure in the previous problem to show that aq y(x) + b q = 0. In other words, the proofs will not affect each other. Therefore, it is obvious : AT y(x) + b = 0 Problem 4.4 Solution We use Lagrange multiplier to enforce the constraint wT w = 1. We now need to maximize : L(λ, w) = wT (m2 − m1 ) + λ(wT w − 1) We calculate the derivatives: ∂L(λ, w) ∂λ And ∂L(λ, w) ∂w = wT w − 1 = m2 − m1 + 2λw We set the derivatives above equals to 0, which gives: w=− 1 (m2 − m1 ) ∝ (m2 − m1 ) 2λ Problem 4.5 Solution We expand (4.25) using (4.22), (4.23) and (4.24). J (w) = = ( m 2 − m 1 )2 s21 + s22 ∑ ||wT (m2 − m1 )||2 ∑ T 2 T 2 n∈C 1 (w xn − m 1 ) + n∈C 2 (w xn − m 2 ) 93 The numerator can be further written as: [ ][ ]T numerator = wT (m2 − m1 ) wT (m2 − m1 ) = wT SB w Where we have defined: SB = (m2 − m1 )(m2 − m1 )T And ti is the same for the denominator: ∑ ∑ denominator = [wT (xn − m1 )]2 + [wT (xn − m2 )]2 n∈C 1 n∈C 2 T T = w Sw1 w + w Sw2 w = w T Sw w Where we have defined: ∑ ∑ Sw = (xn − m1 )(xn − m1 )T + (xn − m2 )(xn − m2 )T n∈C 1 n∈C 2 Just as required. Problem 4.6 Solution Let’s follow the hint, beginning by expanding (4.33). (4.33) = = = = = N ∑ n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 = [ w T xn xn + w 0 N ∑ n=1 xn xn T w − w T m xn − N ∑ n=1 N ∑ n=1 xn − ( t n xn ∑ n∈C 1 xn xn T w − wT m · ( N m) − ( xn xn T w − N wT mm − N ( t n xn + ∑ ∑ N ∑ −N xn + xn ) n ∈ C 1 N1 n ∈ C 2 N2 ∑ n∈C 1 ∑ 1 1 xn − xn ) N1 n ∈ C 2 N2 xn xn T w − N mmT w − N (m1 − m2 ) N ∑ n=1 (xn xn T ) − N mmT ]w − N (m1 − m2 ) If we let the derivative equal to 0, we will see that: [ N ∑ (xn xn T ) − N mmT ]w = N (m1 − m2 ) n=1 Therefore, now we need to prove: N ∑ n=1 t n xn ) n∈C 2 (xn xn T ) − N mmT = Sw + N1 N2 SB N 94 Let’s expand the left side of the equation above: left = = = = = = = = N ∑ n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 = xn xn T − N ( xn xn T − N1 N2 m1 + m2 )2 N N N12 N22 N N2 N12 N ||m1 ||2 + 2 ||m1 ||2 − xn xn T + ( N 1 + N22 N ||m2 ||2 + 2 ||m2 ||2 − 2 N1 N2 m1 m2 T ) N2 N1 N2 m1 m2 T N N1 N2 N1 N2 N1 N2 − 2 N1 )||m1 ||2 + ( N2 + − 2 N2 )||m2 ||2 − 2 m1 m2 T N N N xn xn T + ( N1 − 2 N1 )||m1 ||2 + ( N2 − 2 N2 )||m2 ||2 + N1 N2 ||m1 − m2 ||2 N xn xn T + N1 ||m1 ||2 − 2m1 · ( N1 m1 T ) + N2 ||m2 ||2 − 2m2 · ( N2 m2 T ) + xn xn T + N1 ||m1 ||2 − 2m1 ∑ n∈C 1 + = xn xn T − N ( T n∈C 2 n∈C 1 ∑ n∈C 1 n∈C 1 xn xn + N1 ||m1 || − 2m1 ∑ ∑ 2 ∑ ∑ n∈C 1 xn xn T + N2 ||m2 ||2 − 2m2 ||xn − m1 ||2 + = Sw + n∈C 2 ∑ n∈C 2 xnT + N1 N2 SB N xnT ∑ n∈C 2 (xn xn T + ||m1 ||2 − 2m1 xnT ) + ∑ xnT + N2 ||m2 ||2 − 2m2 xnT + ∑ n∈C 2 ||xn − m2 ||2 + N1 N2 SB N (xn xn T + ||m2 ||2 − 2m2 xn T ) + N1 N2 SB N N1 N2 SB N N1 N2 SB N Just as required. Problem 4.7 Solution This problem is quite simple. We can solve it by definition. We know that logistic sigmoid function has the form: σ( a ) = 1 1 + exp(−a) Therefore, we can obtain: σ ( a ) + σ (− a ) = = = N1 N2 SB N 1 1 + 1 + exp(−a) 1 + exp(a) 2 + exp(a) + exp(−a) [1 + exp(−a)][1 + exp(a)] 2 + exp(a) + exp(−a) =1 2 + exp(a) + exp(−a) 95 Next we exchange the dependent and independent variables to obtain its inverse. 1 a= 1 + exp(− y) We first rearrange the equation above, which gives: exp(− y) = 1−a a Then we calculate the logarithm for both sides, which gives: y = ln( a ) 1−a Just as required. Problem 4.8 Solution According to (4.58) and (4.64), we can write: a = ln p(x|C 1 ) p(C 1 ) p(x|C 2 ) p(C 2 ) p (C 1 ) p (C 2 ) 1 1 p (C 1 ) = − (x − µ1 )T Σ−1 (x − µ1 ) + (x − µ2 )T Σ−1 (x − µ2 ) + ln 2 2 p (C 2 ) 1 1 p ( C ) 1 = Σ−1 (µ1 − µ2 )x − µ1 T Σ−1 µ1 + µ2 T Σ−1 µ2 + ln 2 2 p (C 2 ) = ln p(x|C 1 ) − ln p(x|C 2 ) + ln = w T x + w0 Where in the last second step, we rearrange the term according to x, i.e., its quadratic, linear, constant term. We have also defined : w = Σ−1 (µ1 − µ2 ) And 1 1 p (C 1 ) w0 = − µ1 T Σ−1 µ1 + µ2 T Σ−1 µ2 + ln 2 2 p (C 2 ) Finally, since p(C 1 |x) = σ(a) as stated in (4.57), we have p(C 1 |x) = σ(wT x+ w0 ) just as required. Problem 4.9 Solution We begin by writing down the likelihood function. p({ϕn , t n }|π1 , π2 , ..., πK ) = = N ∏ K ∏ [ p(ϕn |C k ) p(C k )] t nk n=1 k=1 N ∏ K ∏ [πk p(ϕn |C k )] t nk n=1 k=1 96 Hence we can obtain the expression for the logarithm likelihood: ln p = N ∑ K ∑ n=1 k=1 N ∑ K ∑ [ ] t nk ln πk + ln p(ϕn |C k ) ∝ t nk ln πk n=1 k=1 Since there is a constraint on πk , so we need to add a Lagrange Multiplier to the expression, which becomes: L= N ∑ K ∑ n=1 k=1 t nk ln πk + λ( K ∑ k=1 πk − 1) We calculate the derivative of the expression above with regard to πk : ∂L = ∂πk N t ∑ nk +λ π n=1 k And if we set the derivative equal to 0, we can obtain: πk = − ( N ∑ n=1 t nk ) / λ = − Nk λ (∗) And if we preform summation on both sides with regard to k, we can see that: K ∑ N Nk ) / λ = − 1 = −( λ k=1 Which gives λ = − N , and substitute it into (∗), we can obtain πk = Nk / N . Problem 4.10 Solution This time, we focus on the term which dependent on µk and Σ in the logarithm likelihood. ln p = N ∑ K ∑ n=1 k=1 N ∑ K ∑ [ ] t nk ln πk + ln p(ϕn |C k ) ∝ t nk ln p(ϕn |C k ) n=1 k=1 Provided p(ϕ|C k ) = N (ϕ|µk , Σ), we can further derive: ln p ∝ K N ∑ ∑ [ 1 ] 1 t nk − ln |Σ| − (ϕn − µk )Σ−1 (ϕn − µk )T 2 2 n=1 k=1 We first calculate the derivative of the expression above with regard to µk : ∂ ln p ∂µk = N ∑ n=1 t nk Σ−1 (ϕn − µk ) We set the derivative equals to 0, which gives: N ∑ n=1 t nk Σ−1 ϕn = N ∑ n=1 t nk Σ−1 µk = Nk Σ−1 µk 97 Therefore, if we multiply both sides by Σ / Nk , we will obtain (4.161). Now let’s calculate the derivative of ln p with regard to Σ, which gives: ∂ ln p ∂Σ = = = N ∑ K ∑ N ∑ K 1 1 ∂ ∑ t nk (− Σ−1 ) − t nk (ϕn − µk )Σ−1 (ϕn − µk )T 2 2 ∂ Σ n=1 k=1 n=1 k=1 N ∑ K ∑ n=1 k=1 − K ∑ N t nk −1 1 ∂ ∑ Σ − t nk (ϕn − µk )Σ−1 (ϕn − µk )T 2 2 ∂Σ k=1 n=1 N ∑ K 1 1 ∂ ∑ − Σ−1 − Nk Tr(Σ−1 Sk ) 2 ∂Σ k=1 n=1 2 = − K N −1 1 ∑ Σ + Nk Σ−1 Sk Σ−1 2 2 k=1 Where we have denoted Sk = N 1 ∑ t nk (ϕn − µk )(ϕn − µk )T Nk n=1 Now we set the derivative equals to 0, and rearrange the equation, which gives: K N ∑ k Σ= Sk N k=1 Problem 4.11 Solution Based on definition, we can write down p(ϕ|C k ) = M ∏ L ∏ m=1 l =1 ϕ ml µkml Note that here only one of the value among ϕm1 , ϕm2 , ... ϕmL is 1, and the others are all 0 because we have used a 1 − of − L binary coding scheme, and also we have taken advantage of the assumption that the M components of ϕ are independent conditioned on the class C k . We substitute the expression above into (4.63), which gives: ak = L M ∑ ∑ m=1 l =1 ϕml · ln µkml + ln p(C k ) Hence it is obvious that a k is a linear function of the components of ϕ. Problem 4.12 Solution Based on definition, i.e., (4.59), we know that logistic sigmoid has the form: 1 σ( a ) = 1 + exp(−a) 98 Now, we calculate its derivative with regard to a. d σ( a ) exp(a) exp(a) 1 = = · = [ 1 − σ( a ) ] · σ( a ) 2 da 1 + exp(−a) 1 + exp(−a) [ 1 + exp(−a) ] Just as required. Problem 4.13 Solution Let’s follow the hint. ∇E (w) = −∇ = − N ∑ n=1 N ∑ n=1 { t n ln yn + (1 − t n ) ln(1 − yn ) } ∇{ t n ln yn + (1 − t n ) ln(1 − yn ) } = − N d { t ln y + (1 − t ) ln(1 − y ) } d y da ∑ n n n n n n d y da d w n n n=1 = − N t ∑ 1 − tn n ( − ) · yn (1 − yn ) · ϕn y 1 − yn n=1 n = − = − = N ∑ n=1 N ∑ t n − yn · yn (1 − yn ) · ϕn yn (1 − yn ) ( t n − yn )ϕn n=1 N ∑ n=1 ( yn − t n )ϕn Where we have used yn = σ(a n ), a n = wT ϕn , the chain rules and (4.88). Problem 4.14 Solution According to definition, we know that if a dataset is linearly separable, we can find w, for some points xn , we have wT ϕ(xn ) > 0, and the others wT ϕ(xm ) < 0. Then the boundary is given by wT ϕ(x) = 0. Note that for any point x0 in the dataset, the value of wT ϕ(x0 ) should either be positive or negative, but it can not equal to 0. Therefore, the maximum likelihood solution for logistic regression is trivial. We suppose for those points xn belonging to class C 1 , we have wT ϕ(xn ) > 0 and wT ϕ(xm ) < 0 for those belonging to class C 2 . According to (4.87), if |w| → ∞, we have p(C 1 |ϕ(xn )) = σ(wT ϕ(xn )) → 1 Where we have used wT ϕ(xn ) → +∞. And since wT ϕ(xm ) → −∞, we can also obtain: p(C 2 |ϕ(xm )) = 1 − p(C 1 |ϕ(xm )) = 1 − σ(wT ϕ(xm )) → 1 99 In other words, for the likelihood function, i.e.,(4.89), if we have |w| → ∞, and also we label all the points lying on one side of the boundary as class C 1 , and those on the other side as class C 2 , the every term in (4.89) can achieve its maximum value, i.e., 1, finally leading to the maximum of the likelihood. Hence, for a linearly separable dataset, the learning process may prefer to make |w| → ∞ and use the linear boundary to label the datasets, which can cause severe over-fitting problem. Problem 4.15 Solution(Waiting for update) Since yn is the output of the logistic sigmoid function, we know that 0 < yn < 1 and hence yn (1 − yn ) > 0. Then we use (4.97), for an arbitrary non-zero real vector a ̸= 0, we have: aT Ha = aT = = N [∑ N ∑ n=1 N ∑ n=1 n=1 ] yn (1 − yn )ϕn ϕT n a T T yn (1 − yn ) (ϕT n a) (ϕ n a) yn (1 − yn ) b2n Where we have denoted b n = ϕT n a. What’s more, there should be at least one of { b 1 , b 2 , ..., b N } not equal to zero and then we can see that the expression above is larger than 0 and hence H is positive definite. Otherwise, if all the b n = 0, a = [a 1 , a 2 , ..., a M ]T will locate in the null space of matrix Φ N × M . However, with regard to the rank-nullity theorem, we know that Rank(Φ) + Nullity(Φ) = M, and we have already assumed that those M features are independent, i.e., Rank(Φ) = M , which means there is only 0 in its null space. Therefore contradictory occurs. Problem 4.16 Solution We still denote yn = p( t = 1|ϕn ), and then we can write down the log likelihood by replacing t n with πn in (4.89) and (4.90). ln p(t|w) = N ∑ n=1 { πn ln yn + (1 − πn ) ln(1 − yn ) } Problem 4.17 Solution We should discuss in two situations separately, namely j = k and j ̸= k. When j ̸= k, we have: ∂ yk ∂a j = − exp(a k ) · exp(a j ) = − yk · y j ∑ [ j exp(a j ) ]2 And when j = k, we have: ∑ exp(a k ) j exp(a j ) − exp(a k ) exp(a k ) ∂ yk = = yk − yk2 = yk (1 − yk ) ∑ ∂a k [ j exp(a j ) ]2 100 Therefore, we can obtain: ∂ yk ∂a j = yk ( I k j − y j ) Where I k j is the elements of the indentity matrix. Problem 4.18 Solution We derive every term t nk ln ynk with regard to a j . ∂ t nk ln ynk ∂ t nk ln ynk ∂ ynk ∂a j = ∂wj ∂ ynk ∂a j ∂wj 1 · ynk ( I k j − yn j ) · ϕn t nk ynk t nk ( I k j − yn j ) ϕn = = Where we have used (4.105) and (4.106). Next we perform summation over n and k. ∇wj E = − = = = = K N ∑ ∑ n=1 k=1 K N ∑ ∑ n=1 k=1 t nk ( I k j − yn j ) ϕn t nk yn j ϕn − N ∑ K ∑ n=1 k=1 t nk I k j ϕn N [ ∑ K N ∑ ] ∑ t nk ) yn j ϕn − ( t n j ϕn n=1 N ∑ n=1 N ∑ n=1 k=1 yn j ϕn − N ∑ n=1 t n j ϕn ( yn j − t n j ) ϕn n=1 Where we have used the fact that for arbitrary n, we have Problem 4.19 Solution We write down the log likelihood. ln p(t|w) = N { ∑ n=1 t n ln yn + (1 − t n ) ln(1 − yn ) Therefore, we can obtain: ∇w ln p = = = ∂ ln p ∂ yn ∂a n · · ∂ yn ∂a n ∂w N t ∑ 1 − tn ′ n ( − )Φ (a n )ϕn 1 − yn n=1 yn N ∑ n=1 yn − t n Φ′ (a n )ϕn yn (1 − yn ) } ∑K k=1 t nk = 1. 101 Where we have used y = p( t = 1|a) = Φ(a) and a n = wT ϕn . According to (4.114), we can obtain: ¯ 1 1 Φ′ (a) = N (θ |0, 1)¯θ=a = p exp(− a2 ) 2 2π Hence, we can obtain: ∇w ln p = N ∑ n=1 a2 yn − t n exp(− 2n ) ϕn p yn (1 − yn ) 2π To calculate the Hessian Matrix, we need to first evaluate several derivatives. yn − t n } = ∂w yn (1 − yn ) ∂ { = yn − t n ∂ yn ∂a n }· · ∂ yn yn (1 − yn ) ∂a n ∂w yn (1 − yn ) − ( yn − t n )(1 − 2 yn ) ′ Φ (a n )ϕn [ yn (1 − yn ) ]2 ∂ { a2 = yn2 + t n − 2 yn t n exp(− 2n ) ϕn p yn2 (1 − yn )2 2π And a2 exp(− n ) { p 2 } = ∂w 2π ∂ a2 exp(− n ) ∂a n { p 2 } ∂a n ∂w 2π 2 a an = − p exp(− n )ϕn 2 2π ∂ Therefore, using the chain rule, we can obtain: a2 yn − t n exp(− 2n ) { } = p ∂w yn (1 − yn ) 2π ∂ a2 a2 yn − t n exp(− 2n ) yn − t n ∂ exp(− 2n ) { } p + { p } ∂w yn (1 − yn ) yn (1 − yn ) ∂w 2π 2π ∂ a2 a2 [ yn2 + t n − 2 yn t n exp(− 2n ) ] exp(− 2n ) − a n ( yn − t n ) p ϕn = p yn (1 − yn ) 2π 2π yn (1 − yn ) Finally if we perform summation over n, we can obtain the Hessian Matrix: H = ∇∇w ln p a2 = N ∂ ∑ yn − t n exp(− 2n ) { } · ϕn p 2π n=1 ∂w yn (1 − yn ) = N [ y2 + t − 2 y t exp(− n ) ∑ ] exp(− 2n ) n n n n 2 − a n ( yn − t n ) p ϕn ϕn T p (1 y ) y − 2 π 2 π y (1 − y ) n n n=1 n n a2 a2 102 Problem 4.20 Solution(waiting for update) We know that the Hessian Matrix is of size MK × MK , and the ( j, k) th block with size M × M is given by (4.110), where j, k = 1, 2, ..., K . Therefore, we can obtain: K ∑ K ∑ uT Hu = uT (∗) j Hj,k uk j =1 k=1 Where we use uk to denote the k th block vector of u with size M × 1, and Hj,k to denote the ( j, k) th block matrix of H with size M × M . Then based on (4.110), we further expand (4.110): (∗) = = = = K ∑ K ∑ j =1 k=1 uT j {− K ∑ K ∑ N ∑ j =1 k=1 n=1 K ∑ K ∑ N ∑ j =1 k=1 n=1 K ∑ N ∑ k=1 n=1 N ∑ n=1 ynk ( I k j − yn j ) ϕn ϕn T }uk T uT j {− ynk ( I k j − yn j ) ϕ n ϕ n }uk T uT j {− ynk I k j ϕ n ϕ n }uk + T uT k {− ynk ϕ n ϕ n }uk + K ∑ K ∑ N ∑ j =1 k=1 n=1 K ∑ K ∑ N ∑ j =1 k=1 n=1 T uT j { ynk yn j ϕ n ϕ n }uk T yn j uT j { ϕ n ϕ n } ynk uk Problem 4.21 Solution It is quite obvious. ∫ Φ( a ) = = = = = = = a N (θ |0, 1) d θ ∫ a 1 + N (θ |0, 1) d θ 2 ∫0 a 1 + N (θ |0, 1) d θ 2 0 ∫ a 1 1 exp(−θ 2 /2) d θ +p 2 2π 0 p ∫ a 2 1 π 1 +p p exp(−θ 2 /2) d θ 2 π 2π 2 0 ∫ a 2 1 1 (1 + p p exp(−θ 2 /2) d θ ) 2 π 2 0 { } 1 1 1 + p er f (a) 2 2 −∞ Where we have used ∫ 0 −∞ N (θ |0, 1) d θ = 1 2 103 Problem 4.22 Solution If we denote f (θ ) = p(D |θ ) p(θ ), we can write: ∫ ∫ p(D ) = p ( D | θ ) p (θ ) d θ = f (θ ) d θ (2π) M /2 = f (θ M AP ) = p(D |θ M AP ) p(θ M AP ) |A|1/2 (2π) M /2 |A|1/2 Where θ M AP is the value of θ at the mode of f (θ ), A is the Hessian Matrix of − ln f (θ ) and we have also used (4.135). Therefore, ln p(D ) = ln p(D |θ M AP ) + ln p(θ M AP ) + M 1 ln 2π − ln |A| 2 2 Just as required. Problem 4.23 Solution According to (4.137), we can write: 1 M ln 2π − ln |A| 2 2 M 1 1 = ln p(D |θ M AP ) − ln 2π − ln |V0 | − (θ M AP − m)T V0 −1 (θ M AP − m) 2 2 2 M 1 + ln 2π − ln |A| 2 2 1 1 1 = ln p(D |θ M AP ) − ln |V0 | − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |A| 2 2 2 ln p(D ) = ln p(D |θ M AP ) + ln p(θ M AP ) + Where we have used the definition of the multivariate Gaussian Distribution. Then, from (4.138), we can write: A = −∇∇ ln p(D |θ M AP ) p(θ M AP ) = −∇∇ ln p(D |θ M AP ) − ∇∇ ln p(θ M AP ) } { 1 = H − ∇∇ − (θ M AP − m)T V0 −1 (θ M AP − m) 2 { −1 } = H + ∇ V0 (θ M AP − m) = H + V0 −1 Where we have denoted H = −∇∇ ln p(D |θ M AP ). Therefore, the equation 104 above becomes: 1 (θ M AP − m)T V0 −1 (θ M AP − m) − 2 1 = ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − 2 1 ≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − 2 1 ≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − 2 ln p(D ) = ln p(D |θ M AP ) − } 1 { 1 ln |V0 | · |H + V− 0 | 2 } 1 { ln |V0 H + I| 2 1 1 ln |V0 | − ln |H| 2 2 1 ln |H| + const 2 Where we have used the property of determinant: |A|·|B| = |AB|, and the fact that the prior is board, i.e. I can be neglected with regard to V0 H. What’s more, since the prior is pre-given, we can view V0 as constant. And if the data is large, we can write: N ∑ b H= Hn = N H b = 1/ N Where H n=1 ∑N n=1 Hn , and then 1 (θ M AP − m)T V0 −1 (θ M AP − m) − 2 1 ≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − 2 1 ≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − 2 M ln N ≈ ln p(D |θ M AP ) − 2 ln p(D ) ≈ ln p(D |θ M AP ) − 1 ln |H| + const 2 1 b | + const ln | N H 2 M 1 b | + const ln N − ln |H 2 2 This is because when N >> 1, other terms can be neglected. Problem 4.24 Solution(Waiting for updating) Problem 4.25 Solution We first need to obtain the expression for the first derivative of probit function Φ(λa) with regard to a. According to (4.114), we can write down: d Φ (λ a ) = da = d Φ(λa) d λa · d (λa) da { 1 } λ p exp − (λa)2 2 2π Which further gives: ¯ d λ ¯ Φ(λa)¯ = p a=0 da 2π And for logistic sigmoid function, according to (4.88), we have dσ 1 = σ (1 − σ) = 0.5 × 0.5 = da 4 105 Where we have used σ(0) = 0.5. Let their derivatives at origin equals, we have: λ 1 = p 4 2π p / / 2 i.e., λ = 2π 4. And hence λ = π 8 is obvious. Problem 4.26 Solution We will prove (4.152) in a more simple and intuitive way. But firstly, we need to prove a trivial yet useful statement: Suppose we have a random variable satisfied normal distribution denoted as X ∼ N ( X |µ, σ2 ), the probability x−µ of X ≤ x is P ( X ≤ x) = Φ( σ ), and here x is a given real number. We can see this by writing down the integral: ∫ x [ ] 1 1 P ( X ≤ x) = exp − 2 ( X − µ)2 d X p 2σ −∞ 2πσ2 ∫ x−µ σ 1 1 exp(− γ2 ) σ d γ = p 2 2 −∞ 2πσ ∫ x−µ σ 1 1 = p exp(− γ2 ) d γ 2 −∞ 2π x−µ ) = Φ( σ Where we have changed the variable X = µ + σγ. Now consider two random variables X ∼ N (0, λ−2 ) and Y ∼ N (µ, σ2 ). We first calculate the conditional probability P ( X ≤ Y | Y = a): a−0 ) = Φ(λa) λ−1 Together with Bayesian Formula, we can obtain: ∫ +∞ P(X ≤ Y ) = P ( X ≤ Y | Y = a) pd f (Y = a) dY −∞ ∫ +∞ Φ(λa) N (a|µ, σ2 ) da = P ( X ≤ Y | Y = a ) = P ( X ≤ a ) = Φ( −∞ Where pd f (·) denotes the probability density function and we have also used pd f (Y ) = N (µ, σ2 ). What’s more, we know that X − Y should also satisfy normal distribution, with: E [ X − Y ] = E [ X ] − E [Y ] = 0 − µ = − µ And var [ X − Y ] = var [ X ] + var [Y ] = λ−2 + σ2 Therefore, X − Y ∼ N (−µ, λ−2 + σ2 ) and it follows that: 0 − (−µ) µ P ( X − Y ≤ 0) = Φ( p ) = Φ( p ) λ−2 + σ2 λ−2 + σ2 Since P ( X ≤ Y ) = P ( X − Y ≤ 0), we obtain what have been required. 106 0.5 Neural Networks Problem 5.1 Solution Based on definition of tanh(·), we can obtain: e a − e −a e a + e −a 2 ea = −1 + a e + e −a 1 = −1 + 2 1 + e−2a = 2σ(2a) − 1 tanh(a) = s) s) s) s) , w(2 for a network whose If we have parameters w(1 , w(1 and w(2 ji j0 k0 kj t) t) hidden units use logistic sigmoid function as activation and w(1 , w(1 and ji j0 t) t) w(2 , w(2 for another one using tanh(·), for the network using tanh(·) as kj k0 activation, we can write down the following expression by using (5.4): a(kt) = = = M ∑ j =1 M ∑ j =1 M ∑ j =1 t) t) w(2 tanh(a(jt) ) + w(2 kj k0 t) t) w(2 [ 2σ(2a(jt) ) − 1 ] + w(2 kj k0 M [ ∑ t) t) ] t) w(2 + w(2 2 w(2 σ(2a(jt) ) + − kj k0 kj j =1 What’s more, we also have : a(ks) = M ∑ j =1 s) s) w(2 σ(a(js) ) + w(2 kj k0 To make the two networks equivalent, i.e., a(ks) = a(kt) , we should make sure:   a(s) = 2a(jt)   j s) t) w(2 = 2w(2 kj kj   ∑ w(2s) = − M w(2 t) + w(2 t) k0 j =1 kj k0 Note that the first condition can be achieved by simply enforcing: s) t) w(1 = 2w(1 , ji ji and s) t) w(1 = 2w(1 j0 j0 Therefore, these two networks are equivalent under a linear transformation. Problem 5.2 Solution 107 It is obvious. We write down the likelihood. p(T|X, w) = N ∏ n=1 N (tn |y(xn , w), β−1 I) Taking the negative logarithm, we can obtain: E (w, β) = − ln p(T|X, w) = N [ β ∑ 2 n=1 ] NK ln β+const ( y(xn , w)−tn )T ( y(xn , w)−tn ) − 2 Here we have used const to denote the term independent of both w and β. Note that here we have used the definition of the multivariate Gaussian Distribution. What’s more, we see that the covariance matrix β−1 I and the weight parameter w have decoupled, which is distinct from the next problem. We can first solve wML by minimizing the first term on the right of the equation above or equivalently (5.11), i.e., imaging β is fixed. Then according to the derivative of E (w, β) with regard to β, we can obtain (5.17) and hence β ML . Problem 5.3 Solution Following the process in the previous question, we first write down the negative logarithm of the likelihood function. E (w, Σ) = N { } N 1 ∑ [ y(xn , w) − tn ]T Σ−1 [ y(xn , w) − tn ] + ln |Σ| + const (∗) 2 n=1 2 Note here we have assumed Σ is unknown and const denotes the term independent of both w and Σ. In the first situation, if Σ is fixed and known, the equation above will reduce to: E (w) = N { } 1 ∑ [ y(xn , w) − tn ]T Σ−1 [ y(xn , w) − tn ] + const 2 n=1 We can simply solve wML by minimizing it. If Σ is unknown, since Σ is in the first term on the right of (∗), solving wML will involve Σ. Note that in the previous problem, the main reason that they can decouple is due to the independent assumption, i.e., Σ reduces to β−1 I, so that we can bring β to the front and view it as a fixed multiplying factor when solving wML . Problem 5.4 Solution Based on (5.20), the current conditional distribution of targets, considering mislabel, given input x and weight w is: p( t = 1|x, w) = (1 − ϵ) · p( t r = 1|x, w) + ϵ · p( t r = 0|x, w) 108 Note that here we use t to denote the observed target label, t r to denote its real label, and that our network is aimed to predict the real label t r not t, i.e., p( t r = 1|x, w) = y(x, w), hence we see that: [ ] p( t = 1|x, w) = (1 − ϵ) · y(x, w) + ϵ · 1 − y(x, w) (∗) Also, it is the same for p( t = 0|x, w): [ ] p( t = 0|x, w) = (1 − ϵ) · 1 − y(x, w) + ϵ · y(x, w) (∗∗) Combing (∗) and (∗∗), we can obtain: p( t|x, w) = (1 − ϵ) · y t (1 − y)1− t + ϵ · (1 − y) t y1− t Where y is short for y(x, w). Therefore, taking the negative logarithm, we can obtain the error function: E (w) = − N ∑ n=1 { 1− t } t ln (1 − ϵ) · ynn (1 − yn )1− t n + ϵ · (1 − yn ) t n yn n When ϵ = 0, it is obvious that the equation above will reduce to (5.21). Problem 5.5 Solution It is obvious by using (5.22). E (w) = − ln = − ln = − = − = − N ∏ n=1 p(t|xn , w) K N ∏ ∏ n=1 k=1 K N ∑ ∑ n=1 k=1 N ∑ K ∑ n=1 k=1 { [ ]1− t nk } ln yk (xn , w) t nk 1 − yk (xn , w) [ t nk ] ln ynk ( 1 − ynk )1− t nk N ∑ K { ∑ n=1 k=1 [ ]1− t nk yk (xn , w) t nk 1 − yk (xn , w) } t nk ln ynk + (1 − t nk ) ln( 1 − ynk ) Where we have denoted ynk = yk (xn , w) Problem 5.6 Solution We know that yk = σ(a k ), where σ(·) represents the logistic sigmoid function. Moreover, dσ = σ(1 − σ) da 109 dE (w) da k ] ] 1[ 1 [ yk (1 − yk ) + (1 − t k ) yk (1 − yk ) yk 1 − yk [ ] [ 1 − tk tk ] − = yk (1 − yk ) 1 − yk yk = (1 − t k ) yk − t k (1 − yk ) = −t k = yk − t k Just as required. Problem 5.7 Solution It is similar to the previous problem. First we denote ykn = yk (xn , w). If we use softmax function as activation for the output unit, according to (4.106), we have: d ykn = ykn ( I k j − y jn ) da j Therefore, dE (w) da j = N ∑ K } d { ∑ − t kn ln yk (xn , w) da k n=1 k=1 = − = − = − = − = − = N ∑ K ∑ } d { t kn ln ykn n=1 k=1 da j N ∑ K ∑ n=1 k=1 N ∑ K ∑ n=1 k=1 N ∑ K ∑ n=1 k=1 N ∑ n=1 N ∑ n=1 t kn ] 1 [ ykn ( I k j − y jn ) ykn ( t kn I k j − t kn y jn ) t kn I k j + t jn + N ∑ n=1 N ∑ K ∑ n=1 k=1 t kn y jn y jn ( y jn − t jn ) Where we have used the fact that only when k = j , I k j = 1 ̸= 0 and that k=1 t kn = 1. ∑K Problem 5.8 Solution It is obvious based on definition of ’tanh’, i.e., (5.59). ( e a + e−a )( e a + e−a ) − ( e a − e−a )( e a − e−a ) ( e a + e − a )2 ( e a − e−a )2 = 1− a ( e + e−a )2 = 1 − tanh(a)2 d tanh(a) = da 110 Problem 5.9 Solution We know that the logistic sigmoid function σ(a) ∈ [0, 1], therefore if we perform a linear transformation h(a) = 2σ(a) − 1, we can find a mapping function h(a) from (−∞, +∞) to [−1, 1]. In this case, the conditional distribution of targets given inputs can be similarly written as: p( t|x, w) = [ 1 + y(x, w) ](1+ t)/2 [ 1 − y(x, w) ](1− t)/2 2 2 [ ] Where 1 + y(x, w) /2 represents the conditional probability p(C 1 | x). Since now y(x, w) ∈ [−1, 1], we also need to perform the linear transformation to make it satisfy the constraint for probability.Then we can further obtain: E (w) = − = − N {1+ t ∑ n n=1 2 ln 1 + yn 1 − t n 1 − yn } + ln 2 2 2 N ∑ { } 1 (1 + t n ) ln(1 + yn ) + (1 − t n ) ln(1 − yn ) + N ln 2 2 n=1 Problem 5.10 Solution It is obvious. Suppose H is positive definite, i.e., (5.37) holds. We set v equals to the eigenvector of H, i.e., v = ui which gives: vT Hv = vT (Hv) = ui T λ i ui = λ i ||ui ||2 Therefore, every λ i should be positive. On the other hand, If all the eigenvalues λ i are positive, from (5.38) and (5.39), we see that H is positive definite. Problem 5.11 Solution It is obvious. We follow (5.35) and then write the error function in the form of (5.36). To obtain the contour, we enforce E (w) to equal to a constant C. 1∑ E (w) = E (w∗ ) + λ i α2i = C 2 i We rearrange the equation above, and then obtain: ∑ λ i α2i = B i Where B = 2C − 2E (w∗ ) is a constant. Therefore, the contours of constant error are ellipses whose axes are aligned with the eigenvector ui of the Hessian Matrix H. The length for the j th axis is given by setting all α i = 0, s.t.i ̸= j : √ B αj = λj 111 In other words, the length is inversely proportional to the square root of the corresponding eigenvalue λ j . Problem 5.12 Solution If H is positive definite, we know the second term on the right side of (5.32) will be positive for arbitrary w. Therefore, E (w∗ ) is a local minimum. On the other hand, if w∗ is a local minimum, we have 1 E (w∗ ) − E (w) = − (w − w∗ )T H(w − w∗ ) < 0 2 In other words, for arbitrary w, (w − w∗ )T H(w − w∗ ) > 0, according to the previous problem, we know that this means H is positive definite. Problem 5.13 Solution It is obvious. Suppose that there are W adaptive parameters in the network. Therefore, b has W independent parameters. Since H is symmetric, there should be W (W + 1)/2 independent parameters in it. Therefore, there are W + W (W + 1)/2 = W (W + 3)/2 parameters in total. Problem 5.14 Solution It is obvious. Since we have E n (w ji + ϵ) = E n (w ji ) + ϵE ′n (w ji ) + And E n (w ji − ϵ) = E n (w ji ) − ϵE ′n (w ji ) + ϵ2 2 E ′′n (w ji ) + O (ϵ3 ) ϵ2 E ′′ (w ji ) + O (ϵ3 ) 2 n We combine those two equations, which gives, E n (w ji + ϵ) − E n (w ji − ϵ) = 2ϵE ′n (w ji ) + O (ϵ3 ) Rearrange the equation above, we obtain what has been required. Problem 5.15 Solution It is obvious. The back propagation formalism starts from performing summation near the input, as shown in (5.73). By symmetry, the forward propagation formalism should start near the output. Jki = ∂ yk ∂xi = ∂ h( a k ) = h′ ( a k ) ∂xi ∂a k ∂xi (∗) Where h(·) is the activation function at the output node a k . Considering all the units j , which have links to unit k: ∂a k ∂xi = ∑ ∂a k ∂a j j ∂a j ∂ x i = ∑ j w k j h′ ( a j ) ∂a j ∂xi (∗∗) 112 Where we have used: ak = ∑ z j = h( a j ) wk j z j , j It is similar for ∂a j /∂ x i . In this way we have obtained a recursive formula starting from the input node: { wl i , if there is a link from input unit i to l ∂a l = ∂xi 0, if there isn’t a link from input unit i to l Using recursive formula (∗∗) and then (∗), we can obtain the Jacobian Matrix. Problem 5.16 Solution It is obvious. We begin by writing down the error function. E= N N ∑ M 1 ∑ 1 ∑ ||yn − tn ||2 = ( yn,m − t n,m )2 2 n=1 2 n=1 m=1 Where the subscript m denotes the mthe element of the vector. Then we can write down the Hessian Matrix as before. H = ∇∇E = N ∑ M ∑ n=1 m=1 ∇yn,m ∇yn,m + N ∑ M ∑ n=1 m=1 ( yn,m − t n,m )∇∇yn,m Similarly, we now know that the Hessian Matrix can be approximated as: H≃ N ∑ M ∑ n=1 m=1 bn,m bT n,m Where we have defined: bn,m = ∇ yn,m Problem 5.17 Solution It is obvious. ∂2 E ∂wr ∂ws = = ∂ 1 ∫ ∫ 2( y − t) ∂wr 2 ∫ ∫ [ ( y − t) Since we know that ∫ ∫ ∂ y2 p(x, t) d x dt ( y − t) ∂wr ∂ws ∂y p(x, t) d x dt ∂ws ∂ y2 ∂wr ∂ws + ∂y ∂y ] ∂ws ∂wr ∫ ∫ = ( y − t) ∫ = = 0 ∂ y2 ∂wr ∂ws ∂ y2 p(x, t) d x dt p( t|x) p(x) d x dt ∂wr ∂ws ∫ { } ( y − t) p( t|x) dt p(x) d x 113 Note that in the last step, we have used y = tute it into the second derivative, which gives, ∫ ∫ ∂2 E = ∂wr ∂ws ∫ = ∫ tp( t|x) dt. Then we substi- ∂y ∂y p(x, t) d x dt ∂ws ∂wr ∂y ∂y p(x) d x ∂ws ∂wr Problem 5.18 Solution skip By analogy with section 5.3.2, we denote wki as those parameters corresponding to skip-layer connections, i.e., it connects the input unit i with the output unit k. Note that the discussion in section 5.3.2 is still correct and now we only need to obtain the derivative of the error function with respect skip to the additional parameters wki . ∂E n = skip ∂wki ∂E n ∂a k ∂a k ∂wskip ki = δk x i Where we have used a k = yk due to linear activation at the output unit and: M ∑ ∑ skip yk = w(2) z + wki x i j kj j =0 i Where the first term on the right side corresponds to those information conveying from the hidden unit to the output and the second term corresponds to the information conveying directly from the input to output. Problem 5.19 Solution The error function is given by (5.21). Therefore, we can obtain: ∇E (w) = N ∂E ∑ ∇a n n=1 ∂a n = − = − = − = − = N ∑ ∂ [ n=1 ∂a n ] t n ln yn + (1 − t n ) ln(1 − yn ) ∇a n N { ∂( t ln y ) ∂ y ∑ n n n ∂ yn n=1 N [t ∑ n ∂a n + ∂(1 − t n ) ln(1 − yn ) ∂ yn } ∇a n ∂ yn ∂a n · yn (1 − yn ) + (1 − t n ) ] −1 · yn (1 − yn ) ∇a n 1 − yn n=1 yn N [ ∑ ] t n (1 − yn ) − (1 − t n ) yn ∇a n n=1 N ∑ n=1 ( yn − t n )∇a n 114 Where we have used the conclusion of problem 5.6. Now we calculate the second derivative. ∇∇E (w) = N { ∑ n=1 yn (1 − yn )∇a n ∇a n + ( yn − t n )∇∇a n } Similarly, we can drop the last term, which gives exactly what has been asked. Problem 5.20 Solution(waiting for update) We begin by writing down the error function. E (w) = − N ∑ K ∑ n=1 k=1 t nk ln ynk Here we assume that the output of the network has K units in total and there are W weights parameters in the network. WE first calculate the first derivative: ∇E = N dE ∑ · ∇an n=1 d a n = − = N [ d K ∑ ∑ ] ( t nk ln ynk ) · ∇an n=1 d a n k=1 N ∑ n=1 cn · ∇an Note that here cn = − dE / d an is a vector with size K × 1, ∇an is a matrix with size K × W . Moreover, the operator · means inner product, which gives ∇E as a vector with size 1 × W . According to (4.106), we can obtain the j th element of cn : c n, j = − ∂ ∂a j ( K ∑ t nk ln ynk ) k=1 = − K ∑ ∂ ( t nk ln ynk ) k=1 ∂a j = − K t ∑ nk ynk ( I k j − yn j ) y k=1 nk = − K ∑ k=1 t nk I k j + = − t n j + yn j ( = yn j − t n j K ∑ t nk yn j k=1 K ∑ k=1 t nk ) 115 Now we calculate the second derivative: ∇∇E N dc ∑ n ( ∇an ) · ∇an + cn ∇∇an d a n n=1 = Here d cn / d an is a matrix with size K × K . Therefore, the second term can be neglected as before, which gives: H= N dc ∑ n ∇an ) · ∇an ( d a n n=1 Problem 5.21 Solution We first write down the expression of Hessian Matrix in the case of K outputs. N ∑ K ∑ H N,K = bn,k bT n,k n=1 k=1 Where bn,k = ∇w an,k . Therefore, we have: H N +1,K = H N,K + K ∑ k=1 T b N +1,k bT N +1,k = H N,K + B N +1 B N +1 Where B N +1 = [b N +1,1 , b N +1,2 , ..., b N +1,K ] is a matrix with size W × K , and here W is the total number of the parameters in the network. By analogy with (5.88)-(5.89), we can obtain: 1 −1 H− N +1,K = H N,K − 1 H− B BT H−1 N,K N +1 N +1 N,K 1 + BT H−1 B N +1 N,K N +1 (∗) Furthermore, similarly, we have: H N +1,K +1 = H N +1,K + N∑ +1 n=1 T bn,K +1 bT n,K +1 = H N +1,K + BK +1 BK +1 Where BK +1 = [b1,K +1 , b2,K +1 , ..., b N +1,K +1 ] is a matrix with size W ×( N + 1). Also, we can obtain: 1 H− N +1,K +1 = 1 H− N +1,K − 1 H− B BT H−1 N +1,K K +1 K +1 N +1,K 1 + BT H−1 B K +1 N +1,K K +1 1 Where H− is defined by (∗). If we substitute (∗) into the expression N +1,K 1 1 above, we can obtain the relationship between H− and H− . N +1,K +1 N,K Problem 5.22 Solution 116 We begin by handling the first case. ∂2 E n ∂ = ∂w(2) ∂w(2) kj k′ j ′ ( (2) ∂E n ∂wk j ∂w(2) k′ j ′ ∂ = ∂w(2) kj ( ∂ = ) ∂ E n ∂ a k′ ) ∂a k′ ∂w(2)′ ′ k j ∑ ∂ E n ∂ j ′ w k′ j ′ z j ′ ( ∂ a k′ ∂w(2) kj ∂ = ∂w(2) kj ∂ = ∂w(2) kj ∂ = ∂a k ∂ = ( ( ( ( ∂E n ∂ a k′ ∂E n ∂a k′ ) z j′ ) ) z j′ + ∂E n ∂ z j′ ∂a k′ ∂w(2) kj ∂E n ∂a k ) z j′ + 0 ∂a k′ ∂w(2) kj ∂E n ∂ a k ∂ a k′ z j z j′ M kk′ = ∂w(2) k′ j ′ ) z j z j′ Then we focus on the second case, and if here j ̸= j ′ ∂2 E n ∂w(1) ∂w(1) ji j′ i′ = = = = = ∂ ∂w(1) ji ∂ ∂E n ( ( ∂w(1) j′ i′ ) ∑ ∂ E n ∂ a k′ ) ∂a k′ ∂w(1) ′ ′ ′ ∂w(1) ji k ∑ ∂ (1) k′ ∂w ji ∑ k′ ∑ k′ ji ( ∂E n ∂ a k′ h′ ( a j ′ ) x i ′ h′ ( a j ′ ) x i ′ w(2) h′ ( a j ′ ) x i ′ ) k′ j ′ ∂ ∂w(1) ji ( ∂E n ∂ a k′ w(2) ) k′ j ′ ∑ ∂ ∂E n (2) ∂a k ( wk′ j′ ) (1) ∂w k ∂ a k ∂ a k′ ji ∑ ∑ ∂ ∂E n (2) = h′ ( a j ′ ) x i ′ ( wk′ j′ ) · (w(2) h′ ( a j ) x i ) kj ′ ∂ a ∂ a ′ k k k k ∑ ′ ∑ = h (a j′ ) x i′ M kk′ w(2) · w(2) h′ ( a j ) x i k′ j ′ kj k′ = ′ k ′ x i′ x i h (a j′ ) h (a j ) ∑∑ k′ k · w(2) M kk′ w(2) k′ j ′ kj 117 When j = j ′ , similarly we have: ∂2 E n ∂w(1) ∂w(1) ji ji ′ = = = = = ∑ k′ x i′ ∂ ∂E n ∂w ji ∂ a k′ ( (1) ∑ k′ ∂ ( (1) w(2) h′ ( a j ) x i ′ ) k′ j ∂E n ∂w ji ∂a k′ x i ′ x i h′ ( a j ) h′ ( a j ) w(2) ) h′ ( a j ) + x i ′ k′ j ∑ ∂E n (2) ∂ h′ (a j ) ( w k′ j ) ∂w(1) k′ ∂ a k′ ji ∑∑ k′ k ∑ ∂E n (2) ∂ h′ (a j ) (2) ′ ′ · w M + x ( w(2) w k′ j ) i kk kj k′ j ∂w(1) k′ ∂ a k′ ji ∑∑ ∑ ∂E n (2) ′′ (2) ′ + x i′ x i ′ x i h′ ( a j ) h′ ( a j ) w(2) · w M ( w k′ j ) h ( a j ) x i ′ kk k j kj k′ k k′ ∂ a k′ ∑ ∑ (2) ∑ x i ′ x i h′ ( a j ) h′ ( a j ) wk′ j · w(2) M kk′ + h′′ (a j ) x i x i′ δk′ w(2) kj k′ j k′ k k′ It seems that what we have obtained is slightly different from (5.94) when j = j ′ . However this is not the case, since the summation over k′ in the second term of our formulation and the summation over k in the first term of (5.94) is actually the same (i.e., they both represent the summation over all the output units). Combining the situation when j = j ′ and j ̸= j ′ , we can obtain (5.94) just as required. Finally, we deal with the third case. Similarly we first focus on j ̸= j ′ : ∂2 E n ∂w(2) ∂w(1) ji k j′ = = = = ∂ ∂w(1) ji ( ∂ ( (1) ∂w ji ∂ ∂E n ∂w(2) k j′ ) ∂E n ∂a k ) ∂a k ∂w(2)′ kj ∑ ∂E n ∂ j′ wk j′ z j′ ( ∂a k ∂w(1) ji ∂ ∂w(1) ji ( ∂w(2) k j′ ∂E n ∂a k z j′ ) = ∑ ∂ ∂ E n ∂ a k′ ) (1) ( k′ ∂a k′ ∂a k ∂w ji ∑ z j′ M kk′ w(2) h′ ( a j ) x i k′ j = x i h (a j ) z j′ = ) z j′ k′ ′ ∑ k′ M kk′ w(2) k′ j Note that in (5.95), there are two typos: (i) H kk′ should be M kk′ . (ii) j should 118 exchange position with j ′ in the right side of (5.95). When j = j ′ , we have: ∂2 E n ∂w(1) ∂w(2) ji kj = = = = = = = ∂ ( (1) ∂E n ∂w ji ∂w(2) kj ∂ ∂w(1) ji ( ∂ ) ∂E n ∂a k ) ∂a k ∂w(2) kj ∑ ∂E n ∂ j wk j z j ( ∂a k ∂w(1) ji ∂ ∂w(1) ji ∂ ∂w(1) ji ( ( ∂w(2) kj ∂E n z j) ∂a k ∂E n )z j + ∂a k x i h′ ( a j ) z j x i h′ ( a j ) z j ) ∑ k′ ∑ k′ ∂E n ∂ z j ∂a k w(1) ji M kk′ w(2) + k′ j ∂E n ∂ z j ∂a k w(1) ji M kk′ w(2) + δ k h′ ( a j ) x i k′ j Combing these two situations, we obtain (5.95) just as required. Problem 5.23 Solution It is similar to the previous problem. ∂2 E n ∂ w k′ i ′ ∂ w k j = = = = ∂ ( ∂E n ) ∂ w k′ i ′ ∂ w k j ∂ ∂E n z j) ( ′ ′ ∂wk i ∂a k ∂ w k′ i ′ ∂ ∂ E n zj ( ) ∂ a k′ ∂ a k′ ∂ a k z j x i′ M kk′ 119 And ∂2 E n ∑ ∂E n ∂a k ( ) ∂wk′ i′ k ∂a k ∂w ji ∑ ∂E n ∂ ( w k j h′ ( a j ) x i ) ∂ w k′ i ′ k ∂ a k ∑ ′ ∂ ∂E n h (a j ) x i wk j ( ) ∂ w k′ i ′ ∂ a k k ∑ ′ ∂ ∂ E n a k′ ( ) h (a j ) x i wk j ∂ a k′ ∂ a k w k′ i ′ k ∑ ′ h (a j ) x i wk j M kk′ x i′ ∂ = ∂wk′ i′ ∂w ji = = = = k x i x i ′ h′ ( a j ) = ∑ wk j M kk′ k Finally, we have ∂2 E n = ∂wk′ i′ wki = = = ∂ ∂E n ( ) ∂wk′ i′ ∂wki ∂ ∂E n ( xi ) ∂ w k′ i ′ ∂ a k ∂ ∂ E n ∂ a k′ ( ) xi ∂ a k′ ∂ a k w k′ i ′ x i x i′ M kk′ Problem 5.24 Solution It is obvious. According to (5.113), we have: ∑ ae j = w e ji e xi + w e j0 i = ∑1 i = ∑ a w ji · (ax i + b) + w j0 − b∑ w ji a i w ji x i + w j0 = a j i Where we have used (5.115), (5.116) and (5.117). Currently, we have proved that under the transformation the hidden unit a j is unchanged. If the activation function at the hidden unit is also unchanged, we have e z j = z j. Now we deal with the output unit e yk : ∑ e yk = w ek j e zj + w e k0 j = ∑ cwk j · z j + cwk0 + d j ∑[ ] w k j · z j + w k0 + d = c = c yk + d j 120 Where we have used (5.114), (5.119) and (5.120). To be more specific, here we have proved that the linear transformation between e yk and yk can be achieved by making transformation (5.119) and (5.120). Problem 5.25 Solution Since we know the gradient of the error function with respect to w is: ∇E = H(w − w∗ ) Together with (5.196), we can obtain: w(τ) = w(τ−1) − ρ ∇E = w(τ−1) − ρ H(w(τ−1) − w∗ ) Multiplying both sides by uTj , using w j = wT u j , we can obtain: w(jτ) [ ] = uTj w(τ−1) − ρ H(w(τ−1) − w∗ ) = w(jτ−1) − ρ uTj H(w(τ−1) − w∗ ) = w(jτ−1) − ρη j uTj (w(τ−1) − w∗ ) = w(jτ−1) − ρη j (w(jτ−1) − w∗j ) = (1 − ρη j )w(jτ−1) + ρη j w∗j Where we have used (5.198). Then we use mathematical deduction to prove (5.197), beginning by calculating w(1) : j w(1) j + ρη j w∗j = (1 − ρη j )w(0) j = ρη j w∗j [ ] = 1 − (1 − ρη j ) w∗j Suppose (5.197) holds for τ, we now prove that it also holds for τ + 1. w(jτ+1) = (1 − ρη j w(jτ) + ρη j w∗j [ ] = (1 − ρη j ) 1 − (1 − ρη j )τ w∗j + ρη j w∗j { [ ] } = (1 − ρη j ) 1 − (1 − ρη j )τ + ρη j w∗j [ ] = 1 − (1 − ρη j )τ+1 w∗j Hence (5.197) holds for τ = 1, 2, .... Provided |1 − ρη j | < 1, we have (1 − ρη j )τ → 0 as τ → ∞ ans thus w(τ) = w∗ . If τ is finite and η j >> (ρτ)−1 , the above argument still holds since τ is still relatively large. Conversely, when η j << (ρτ)−1 , we expand the expression above: [ ] |w(jτ) | = | 1 − (1 − ρη j )τ w∗j | ≈ |τρη j w∗j | << |w∗j | 121 We can see that (ρτ)−1 works as the regularization parameter α in section 3.5.3. Problem 5.26 Solution Based on definition or by analogy with (5.128), we have: Ωn 1 ∑ ∂ ynk ¯¯ )2 ( 2 k ∂ξ ξ=0 1 ∑ ∑ ∂ ynk ∂ x i ¯¯ )2 ( 2 k i ∂ x i ∂ξ ξ=0 1∑ ∑ ∂ ynk )2 ( τi 2 k i ∂xi = = = Where we have denoted τi = ∂ x i ¯¯ ∂ξ ξ=0 And this is exactly the form given in (5.201) and (5.202) if the nth observation ynk is denoted as yk in short. Firstly, we define α j and β j as (5.205) shows, where z j and a j are given by (5.203). Then we will prove (5.204) holds: αj = ∑ τi i = ∑ τi ∂z j ∂xi = ∑ τi ∂ h( a j ) ∂xi i ∂ h( a j ) ∂ a j ∂a j ∂ x i ∑ ∂ h′ ( a j ) τ i a j = h ′ ( a j )β j ∂ x i i i = Moreover, βj = ∑ τi i = ∑ i = ∑ i′ τi ∂a j = ∑ τi ∂xi i ∑ ∂w ji′ z i′ i′ w ji′ ∂xi ∑ i τi ∂ z i′ ∂xi ∂ ∑ i ′ w ji ′ z i ′ ∂xi ∑ ∑ ∂ z i′ = τ i w ji′ ∂xi i i′ ∑ = w ji′ α i′ i′ So far we have proved that (5.204) holds and now we aim to find a forward propagation formula to calculate Ωn . We firstly begin by evaluating {β j } at the input units, and then use the first equation in (5.204) to obtain {α j } at the input units, and then the second equation to evaluate {β j } at the first hidden layer, and again the first equation to evaluate {α j } at the first hidden layer. We repeatedly evaluate {β j } and {α j } in this way until reaching the output 122 layer. Then we deal with (5.206): ∂Ωn ∂wrs = } 1 ∑ ∂(G yk )2 ∂ {1 ∑ (G yk )2 = ∂wrs 2 k 2 k ∂wrs ∂G yk 1 ∑ ∂(G yk )2 ∂(G yk ) ∑ G yk = 2 k ∂(G yk ) ∂wrs ∂wrs k ∑ [ ∂ yk ∂a r ] [ ∂ yk ] ∑ = = αk G G yk G ∂wrs ∂a r ∂wrs k k ∑ [ ] ∑ { } = αk G δkr z s = αk G [δkr ] z s + G [ z s ]δkr = k = ∑ k { } αk ϕkr z s + αs δkr k Provided with the idea in section 5.3, the backward propagation formula is easy to derive. We can simply replace E n with yk to obtain a backward equation, so we omit it here. Problem 5.27 Solution Following the procedure in section 5.5.5, we can obtain: ∫ 1 (τT ∇ y(x))2 p(x) d x Ω= 2 / Since we have τ = ∂s(x, ξ) ∂ξ and s = x + ξ, so we have τ = I. Therefore, substituting τ into the equation above, we can obtain: ∫ 1 Ω= (∇ y(x))2 p(x) d x 2 Just as required. Problem 5.28 Solution The modifications only affect derivatives with respect to the weights in the convolutional layer. The units within a feature map (indexed m) have different inputs, but all share a common weight vector, w(m) . Therefore, we can write: ( m) ∑ ∂E n ∂a j ∑ ( m) ( m) ∂E n = = δ j z ji ( m) ( m) ∂w(im) j ∂a j ∂w i j Here a(jm) denotes the activation of the j th unit in th mth feature map, whereas w(im) denotes the i th element of the corresponding feature vector ) and finally z(im denotes the i th input for the j th unit in the mth feature map. j Note that δ(jm) can be computed recursively from the units in the following layer. Problem 5.29 Solution 123 It is obvious. Firstly, we know that: ∂ { } wi − µ j π j N (w i |µ j , σ2j ) = −π j N (w i |µ j , σ2j ) ∂w i σ2j We now derive the error function with respect to w i : e ∂E ∂w i = = = = = = = = ∂E ∂w i ∂E ∂w i ∂E ∂w i ∂E ∂w i ∂E ∂w i ∂E ∂w i ∂E ∂w i ∂E ∂w i + ∂λΩ(w) −λ −λ ∂w i ∂ ∂w i ∂ { ∑ { ∂w i i ( ln ( ln j =1 M ∑ j =1 j =1 π j N + λ ∑M +λ +λ )} π j N (w i |µ j , σ2j ) π jN )} (w i |µ j , σ2j ) ∂ 1 − λ ∑M +λ M ∑ (w i |µ j , σ2j ) ∂w i { M ∑ 1 { M ∑ j =1 πj 2 j =1 π j N (w i |µ j , σ j ) j =1 ∑M w i −µ j 2 j =1 π j σ2 N (w i |µ j , σ j ) ∑ M ∑ j =1 M ∑ j =1 ∑ } π jN (w i |µ j , σ2j ) wi − µ j σ2j } N (w i |µ j , σ2j ) j k πk N (w i |µk , σ2k ) π j N (w i |µ j , σ2j ) k πk N γ j (w i ) (w i |µk , σ2k ) wi − µ j σ2j wi − µ j σ2j Where we have used (5.138) and defined (5.140). Problem 5.30 Solution Is is similar to the previous problem. Since we know that: } wi − µ j ∂ { π j N (w i |µ j , σ2j ) = π j N (w i |µ j , σ2j ) ∂µ j σ2j 124 We can derive: e ∂E ∂µ j = ∂λΩ(w) ∂µ j = −λ { ∑ ∂ ∂µ j = −λ i ∑M { ln ( M ∑ j =1 )} π jN M ∑ j =1 j =1 π j N 1 (w i |µ j , σ2j ) )} π j N (w i |µ j , σ2j ) ∂ (w i |µ j , σ2j ) ∂µ j { M ∑ j =1 } π jN (w i |µ j , σ2j ) wi − µ j πj N (w i |µ j , σ2j ) 2 2 σ π N ( w | µ , σ ) j i j i j j =1 j 2 π j N ( w i |µ j , σ j ) ∑ ∑ µ j − wi µ j − wi λ ∑K = λ γ j (w i ) 2 2 σj σ2j i i k=1 π k N (w i |µ k , σ k ) = −λ = ∑ ln i ∑ ∂ = −λ i ∂µ j ∑ ( ∑M 1 Note that there is a typo in (5.142). The numerator should be µ j − w i instead of µ i − w j . This can be easily seen through the fact that the mean and variance of the Gaussian Distribution should have the same subindex and since σ j is in the denominator, µ j should occur in the numerator instead of µi . Problem 5.31 Solution It is similar to the previous problem. Since we know that: ( ) } ∂ { 1 (w i − µ j )2 2 π j N ( w i |µ j , σ j ) = − + π j N (w i |µ j , σ2j ) ∂σ j σj σ3j 125 We can derive: e ∂E ∂σ j = ∂λΩ(w) ∂σ j = −λ { ∑ ∂ ∂σ j = −λ i = = { ∑M ln ( M ∑ j =1 M ∑ j =1 j =1 π j N 1 )} π jN (w i |µ j , σ2j ) )} π j N (w i |µ j , σ2j ) ∂ (w i |µ j , σ2j ) ∂σ j { M ∑ j =1 } π jN (w i |µ j , σ2j ) } ∂ { π j N (w i |µ j , σ2j ) 2 ∂σ j i j =1 π j N (w i |µ j , σ j ) ( ) ∑ 1 (w i − µ j )2 1 λ ∑M π j N (w i |µ j , σ2j ) − 3 2) σ σ π N ( w | µ , σ j i j i j j =1 j j ( ) 2 π j N ( w i |µ j , σ j ) ∑ 1 ( w i − µ j )2 λ ∑M − 2) σ σ3j π N ( w | µ , σ j i i k k k=1 k ) ( ∑ 1 (w i − µ j )2 − λ γ j (w i ) σj σ3j i = −λ = ln i ∑ ∂ = −λ i ∂σ j ∑ ( ∑ ∑M 1 Just as required. Problem 5.32 Solution It is trivial. We begin by verifying (5.208) when j ̸= k. } { ∂πk ∂ exp(η k ) = ∑ ∂η j ∂η j k exp(η k ) − exp(η k ) exp(η j ) = [∑ ]2 k exp(η k ) = −π j πk And if now we have j = k: { } ∂πk exp(η k ) ∂ = ∑ ∂η k ∂η k k exp(η k ) [∑ ] exp(η k ) k exp(η k ) − exp(η k ) exp(η k ) = [∑ ]2 k exp(η k ) = πk − πk πk If we combine these two cases, we can easily see that (5.208) holds. Now 126 we prove (5.147). e ∂E ∂η j = λ ∂Ω(w) = −λ ∂η j ∂ { ∑ ∂η j = −λ i = −λ ∑ i = −λ ∑ i = = { { j =1 ln M ∑ j =1 }} π j N (w i |µ j , σ2j ) }} π jN (w i |µ j , σ2j ) ∂ 1 ∑M j =1 π j N (w i |µ j , σ2j ) ∂η j { M ∑ k=1 } πk N (w i |µk , σ2k ) M ∂ { ∑ } πk N (w i |µk , σ2k ) 2 ∂η j j =1 π j N (w i |µ j , σ j ) k=1 1 ∑M M ∑ } ∂πk ∂ { πk N (w i |µk , σ2k ) 2 ∂πk ∂η j j =1 π j N (w i |µ j , σ j ) k=1 1 ∑M M ∑ 1 N (w i |µk , σ2k )(δ jk π j − π j πk ) 2) π N ( w | µ , σ i j i j =1 j j k=1 { } M ∑ ∑ 1 2 2 πk N (w i |µk , σk )) π j N ( w i |µ j , σ j ) − π j −λ ∑ M 2 i k=1 j =1 π j N (w i |µ j , σ j ) { } ∑ π j N (w i |µ j , σ2j ) ∑ π j kM=1 πk N (w i |µk , σ2k )) −λ − ∑M ∑M 2 2 i j =1 π j N (w i |µ j , σ j ) j =1 π j N (w i |µ j , σ j ) ∑{ ∑ } { } −λ γ j (w i ) − π j = λ π j − γ j (w i ) = −λ = ∑ M ∑ ln i ∑ ∂ = −λ i ∂η j ∑ { ∑M i i Just as required. Problem 5.33 Solution It is trivial. We set the attachment point of the lower arm with the ground as the origin of the coordinate. We first aim to find the vertical distance from the origin to the target point, and this is also the value of x2 . x2 = L 1 sin(π − θ1 ) + L 2 sin(θ2 − (π − θ1 )) = L 1 sin θ1 − L 2 sin(θ1 + θ2 ) Similarly, we calculate the horizontal distance from the origin to the target point. x1 = −L 1 cos(π − θ1 ) + L 2 cos(θ2 − (π − θ1 )) = L 1 cos θ1 − L 2 cos(θ1 + θ2 ) From these two equations, we can clearly see the ’forward kinematics’ of the robot arm. 127 Problem 5.34 Solution By analogy with (5.208), we can write: ∂πk (x) ∂aπj = δ jk π j (x) − π j (x)πk (x) Using (5.153), we can see that: { } K ∑ 2 E n = − ln πk N (tn |µk , σk ) k=1 Therefore, we can derive: { } K ∑ ∂E n ∂ 2 = − π ln πk N (tn |µk , σk ) ∂aπj ∂a j k=1 { } K ∑ 1 ∂ 2 πk N (tn |µk , σk ) = − ∑K π πk N (tn |µk , σ2 ) ∂a j k=1 k=1 k K ∂π ∑ k 2 π N (t n |µ k , σ k ) 2) ∂ a π N (t | µ , σ n k j k=1 k k k=1 = − ∑K = − ∑K 1 1 k=1 π k N K [ ∑ ] δ jk π j (xn ) − π j (xn )πk (xn ) N (tn |µk , σ2k ) (tn |µk , σ2k ) k=1 { } K ∑ 1 = − ∑K π j (xn )N (tn |µ j , σ2j ) − π j (xn ) πk (xn )N (tn |µk , σ2k ) 2) π N (t | µ , σ n k k = 1 k k=1 k { } K ∑ 1 2 2 = ∑K −π j (xn )N (tn |µ j , σ j ) + π j (xn ) πk (xn )N (tn |µk , σk ) 2) π N (t | µ , σ n k k = 1 k k=1 k And if we denoted (5.154), we will have: ∂E n ∂aπj = −γ j + π j Note that our result is slightly different from (5.155) by the subindex. But there are actually the same if we substitute index j by index k in the final expression. Problem 5.35 Solution We deal with the derivative of error function with respect to µk instead, which will give a vector as result. Furthermore, the l th element of this vector will be what we have been required. Since we know that: } tn − µk ∂ { πk N (tn |µk , σ2k ) = πk N (tn |µk , σ2k ) 2 ∂µk σk 128 One thing worthy noticing is that here we focus on the isotropic case as stated in page 273 of the textbook. To be more precise, N (tn |µk , σ2k ) should be N (tn |µk , σ2k I). Provided with the equation above, we can further obtain: ∂E n ∂µk = { ∂ − ln ∂µk K ∑ k=1 } πk N (tn |µk , σ2k ) ∂ 1 = − ∑K { K ∑ } (tn |µk , σ2k ) πk N (tn |µk , σ2k ) ∂µk k=1 tn − µk 1 · πk N (tn |µk , σ2k ) = − ∑K 2 2 σ π N (t | µ , σ ) n k k k=1 k k k=1 π k N = −γk tn − µk σ2k Hence noticing (5.152), the l th element of the result above is what we are required. ∂E n ∂E n µkl − tl = γk µ = ∂µkl σ2k ∂a kl Problem 5.36 Solution Similarly, we know that: ∂ { ∂σk πk N (tn |µk , σ2k ) } { = } D ||tn − µk ||2 − + πk N (tn |µk , σ2k ) 3 σk σk Therefore, we can obtain: } { K ∑ ∂E n ∂ 2 πk N (tn |µk , σk ) − ln = ∂σk ∂σk k=1 { } K ∑ 1 ∂ = − ∑K πk N (tn |µk , σ2k ) 2 ) ∂σ π N (t | µ , σ k n k k = 1 k k=1 k } { 1 D ||tn − µk ||2 + πk N (tn |µk , σ2k ) = − ∑K · − 3 2 σ σk k k=1 π k N (t n |µ k , σ k ) { } 2 D ||tn − µk || = −γk − + σk σ3k Note that there is a typo in (5.157) and the underlying reason is that: = (σ2k )D |σ2k ID ×D | Problem 5.37 Solution First we know two properties for the Gaussian distribution N (t|µ, σ2 I): ∫ E[t] = tN (t|µ, σ2 I) d t = µ 129 ∫ And E[||t||2 ] = ||t||2 N (t|µ, σ2 I) d t = Lσ2 + ||µ||2 Where we have used E[tT At] = Tr[Aσ2 I] + µT Aµ by setting A = I. This property can be found in Matrixcookbook eq(378). Here L is the dimension of t. Noticing (5.148), we can write: ∫ E[t|x] = t p(t|x) d t ∫ = = = t K ∑ k=1 K ∑ πk N (t|µk , σ2k ) d t ∫ tN (t|µk , σ2k ) d t πk k=1 K ∑ k=1 πk µk Then we prove (5.160). ( ) s2 (x) = E[||t − E[t|x]||2 |x] = E[ t2 − 2tE[t|x] + E[t|x]2 |x] = E[t2 |x] − E[2tE[t|x]|x] + E[t|x]2 = E[t2 |x] − E[t|x]2 ∫ K K ∑ ∑ = ||t||2 πk N (µk , σ2k ) d t − || πl µl ||2 k=1 = = K ∑ k=1 = L = L = = = = ||t||2 N (µk , σ2k ) d t − || πk k=1 K ∑ πk (Lσ2k + ||µk ||2 ) − || K ∑ k=1 K ∑ k=1 L K ∑ k=1 L K ∑ k=1 L K ∑ k=1 K ∑ k=1 l =1 ∫ πk σ2k + πk σ2k + πk σ2k + πk σ2k + πk σ2k + ( K ∑ k=1 K ∑ k=1 K ∑ k=1 K ∑ k=1 K ∑ k=1 K ∑ l =1 K ∑ l =1 πl µl ||2 πk ||µk ||2 − || K ∑ l =1 πl µl ||2 πk ||µk ||2 − 2 × || 2 πk ||µk || − 2( πk ||µk ||2 − 2( πk ||µk − πk Lσ2k + ||µk − K ∑ l =1 πl µl ||2 K ∑ l =1 K ∑ l =1 K ∑ l =1 K ∑ l =1 πl µl ||2 + 1 × || πl µl )( πl µl )( K ∑ k=1 K ∑ k=1 K ∑ l =1 ( πk µk ) + πk µk ) + πl µl ||2 K ∑ k=1 K ∑ k=1 ) πk || πk || K ∑ l =1 K ∑ l =1 πl µl ||2 πl µl ||2 πl µl ||2 ) πl µl ||2 Note that there is a typo in (5.160), i.e., the coefficient L in front of σ2k is missing. 130 Problem 5.38 Solution From (5.167) and (5.171), we can write down the expression for the predictive distribution: ∫ p( t|x, D, α, β) = p(w|D, α, β) p( t|x, w, β) d w ∫ ≈ q(w|D ) p( t|x, w, β) d w ∫ = N (w|wMAP , A−1 )N ( t|gT w − gT wMAP + y(x, wMAP ), β−1 ) d w Note here p( t|x, w, β) is given by (5.171) and q(w|D ) is the approximation to the posterior p(w|D, α, β), which is given by (5.167). Then by analogy with (2.115), we first deal with the mean of the predictive distribution: mean = gT w − gT wMAP + y(x, wMAP )|w = wMAP = y(x, wMAP ) Then we deal with the covariance matrix: Covariance matrix = β−1 + gT A−1 g Just as required. Problem 5.39 Solution Using Laplace Approximation, we can obtain: p(D |w, β) p(w|α) = { } p(D |wMAP , β) p(wMAP |α) exp −(w − wMAP )T A(w − wMAP ) Then using (5.174), (5.162) and (5.163), we can obtain: ∫ p(D |α, β) = p(D |w, β) p(w, α) d w ∫ { } = p(D |wMAP , β) p(wMAP |α) exp −(w − wMAP )T A(w − wMAP ) d w = = p(D |wMAP , β) p(wMAP |α) N ∏ n=1 (2π)W /2 |A|1/2 N ( t n | y(xn , wMAP ), β−1 )N (wMAP |0, α−1 I) (2π)W /2 |A|1/2 If we take logarithm of both sides, we will obtain (5.175) just as required. Problem 5.40 Solution For a k-class classification problem, we need to use softmax activation function and also the error function is now given by (5.24). Therefore, the 131 Hessian matrix should be derived from (5.24) and the cross entropy in (5.184) will also be replaced by (5.24). Problem 5.41 Solution By analogy to Prob.5.39, we can write: p(D |α) = p(D |wMAP ) p(wMAP |α) (2π)W /2 |A|1/2 Since we know that the prior p(w|α) follows a Gaussian distribution, i.e., (5.162), as stated in the text. Therefore we can obtain: 1 ln |A| + const 2 W 1 α ln α − ln |A| + const = ln p(D |wMAP ) − wT w + 2 2 2 W 1 = −E (wMAP ) + ln α − ln |A| + const 2 2 ln p(D |α) = ln p(D |wMAP ) + ln p(wMAP |α) − Just as required. 0.6 Kernel Methods Problem 6.1 Solution Recall that in section.6.1, a n can be written as (6.4). We can derive: an 1 = − {wT ϕ(xn ) − t n } λ 1 = − {w1 ϕ1 (xn ) + w2 ϕ2 (xn ) + ... + w M ϕ M (xn ) − t n } λ w1 w2 wM tn ϕ1 (xn ) − ϕ2 (xn ) − ... − ϕ M (xn ) + = − λ λ λ λ w1 w2 wM = (cn − )ϕ1 (xn ) + ( c n − )ϕ2 (xn ) + ... + ( c n − )ϕ M (xn ) λ λ λ Here we have defined: cn = t n /λ ϕ1 (xn ) + ϕ2 (xn ) + ... + ϕ M (xn ) From what we have derived above, we can see that a n is a linear combination of ϕ(xn ). What’s more, we first substitute K = ΦΦT into (6.7), and then we will obtain (6.5). Next we substitute (6.3) into (6.5) we will obtain (6.2) just as required. Problem 6.2 Solution 132 If we set w(0) = 0 in (4.55), we can obtain: w(τ+1) = N ∑ n=1 η c n t n ϕn where N is the total number of samples and c n is the times that t n ϕn has been added from step 0 to step τ + 1. Therefore, it is obvious that we have: w= N ∑ n=1 αn t n ϕn We further substitute the expression above into (4.55), which gives: N ∑ n=1 α(nτ+1) t n ϕn = N ∑ n=1 α(nτ) t n ϕn + η t n ϕn In other words, the update process is to add learning rate η to the coefficient αn corresponding to the misclassified pattern xn , i.e., α(nτ+1) = α(nτ) + η Now we similarly substitute it into (4.52): = f ( wT ϕ(x) ) N ∑ α n t n ϕT f( n ϕ(x) ) = f( y(x) = n=1 N ∑ n=1 αn t n k(xn , x) ) Problem 6.3 Solution We begin by expanding the Euclidean metric. ||x − xn ||2 = (x − xn )T (x − xn ) = (xT − xT n )(x − x n ) T = xT x − 2xT n x + xn xn Similar to (6.24)-(6.26), we use a nonlinear kernel k(xn , x) to replace xT n x, which gives a general nonlinear nearest-neighbor classifier with cost function defined as: k(x, x) + k(xn , xn ) − 2 k(xn , x) Problem 6.4 Solution To construct such a matrix, let us suppose the two eigenvalues are 1 and 2, and the matrix has form: [ ] a b c d 133 Therefore, based on the definition of eigenvalue, we have two equations: { (a − 2)( d − 2) = bc (1) (a − 1)( d − 1) = bc (2) (2)-(1), yielding: a+d =3 Therefore, we set a = 4 and d = −1. Then we substitute them into (1), and thus we see: bc = −6 Finally, we choose b = 3 and c = −2. The constructed matrix is: [ ] 4 3 −2 −1 Problem 6.5 Solution Since k 1 (x, x′ ) is a valid kernel, it can be written as: k 1 (x, x′ ) = ϕ(x)T ϕ(x′ ) We can obtain: k(x, x′ ) = ck 1 (x, x′ ) = [p ]T [p ] cϕ(x) cϕ(x′ ) Therefore, (6.13) is a valid kernel. It is similar for (6.14): [ ]T [ ] k(x, x′ ) = f (x) k 1 (x, x′ ) f (x′ ) = f (x)ϕ(x) f (x′ )ϕ(x′ ) Just as required. Problem 6.6 Solution We suppose q( x) can be written as: q( x) = a n x n + a n−1 x n−1 + ... + a 1 x + a 0 We now obtain: k(x, x′ ) = a n k 1 (x, x′ ) + a n−1 k 1 (x, x′ ) n n−1 + ... + a 1 k 1 (x, x′ ) + a 0 By repeatedly using (6.13), (6.17) and (6.18), we can easily verify k(x, x′ ) is a valid kernel. For (6.16), we can use Taylor expansion, and since the coefficients of Taylor expansion are all positive, we can similarly prove its validity. Problem 6.7 Solution To prove (6.17), we will use the property stated below (6.12). Since we know k 1 (x, x′ ) and k 2 (x, x′ ) are valid kernels, their Gram matrix K1 and K2 134 are both positive semidefinite. Given the relation (6.12), it can be easily shown K = K1 + K2 is also positive semidefinite and thus k(x, x′ ) is also a valid kernel. To prove (6.18), we assume the map function for kernel k 1 (x, x′ ) is ϕ(1) (x), and similarly ϕ(2) (x) for k 2 (x, x′ ). Moreover, we further assume the dimension of ϕ(1) (x) is M , and ϕ(2) (x) is N . We expand k(x, x′ ) based on (6.18): k(x, x′ ) = k 1 (x, x′ ) k 2 (x, x′ ) = ϕ(1) (x)T ϕ(1) (x′ )ϕ(2) (x)T ϕ(2) (x′ ) M N ∑ ∑ (1) ′ = ϕ(1) (x) ϕ (x ) ϕ(2) (x)ϕ(2) (x′ ) i i j j i =1 = = j =1 M ∑ N [ ∑ i =1 j =1 MN ∑ k=1 ϕ(1) (x)ϕ(2) (x) i j ][ ] ′ (2) ′ ϕ(1) (x ) ϕ (x ) i j ϕk (x)ϕk (x′ ) = ϕ(x)T ϕ(x′ ) where ϕ(1) (x) is the i th element of ϕ(1) (x), and ϕ(2) (x) is the j th element i j of ϕ(2) (x). To be more specific, we have proved that k(x, x′ ) can be written as ϕ(x)T ϕ(x′ ). Here ϕ(x) is a MN ×1 column vector, and the kth ( k = 1, 2, ..., MN ) element is given by ϕ(1) (x) × ϕ(2) (x). What’s more, we can also express i, j in i j terms of k: i = ( k − 1) ⊘ N + 1 and j = ( k − 1) ⊙ N + 1 where ⊘ and ⊙ means integer division and remainder, respectively. Problem 6.8 Solution For (6.19) we suppose k 3 (x, x′ ) = g(x)T g(x′ ), and thus we have: k(x, x′ ) = k 3 (ϕ(x), ϕ(x′ )) = g(ϕ(x))T g(ϕ(x′ )) = f (x)T f (x′ ) where we have denoted g(ϕ(x)) = f (x) and now it is obvious that (6.19) holds. To prove (6.20), we suppose x is a N × 1 column vector and A is a N × N symmetric positive semidefinite matrix. We know that A can be decomposed to QBQT . Here Q is a N × N orthogonal matrix, and B is a N × N diagonal matrix whose elements are no less than 0. Now we can derive: k(x, x′ ) = xT Ax′ = xT QBQT x′ = (QT x)T B(QT x′ ) = yT By′ N N √ √ ∑ ∑ ′ ′ B ii yi yi = ( B ii yi )( B ii yi ) = ϕ(x)T ϕ(x′ ) = i =1 i =1 To be more specific, we have proved that k(x, x′ ) = ϕ(x)T ϕ(x′ ), and here ϕ vector, whose i th ( i = 1, 2, ..., N ) element is given by √(x) is a N ×√1 column B ii yi , i.e., B ii (QT x) i . 135 Problem 6.9 Solution To prove (6.21), let’s first expand the expression: ′ k(x, x′ ) = ′ k a (xa , xa ) + k b (xb , xb ) M ∑ = i =1 M∑ +N = ′ ϕ(ia) (xa )ϕ(ia) (xa ) + k=1 N ∑ ′ j =1 ϕ(ib) (xb )ϕ(ib) (xb ) ′ ϕk (x)ϕk (x ) = ϕ(x)T ϕ(x′ ) where we have assumed the dimension of xa is M and the dimension of xb is N . The mapping function ϕ(x) is a ( M + N ) × 1 column vector, whose kth ( k = 1, 2, ..., M + N ) element ϕk (x) is: { 1≤k≤M ϕ(ka) (x) ϕk (x) = ϕ(kb−) M (xa ) M + 1 ≤ k ≤ M + N (6.22) is quite similar to (6.18). We follow the same procedure: k(x, x′ ) = = = = ′ ′ k a (xa , xa ) k b (xb , xb ) M ∑ i =1 ′ ϕ(ia) (xa )ϕ(ia) (xa ) M ∑ N [ ∑ i =1 j =1 MN ∑ k=1 N ∑ j =1 ′ ϕ(jb) (xb )ϕ(jb) (xb ) ϕ(ia) (xa )ϕ(jb) (xb ) ][ ′ ′ ϕ(ia) (xa )ϕ(jb) (xb ) ] ϕk (x)ϕk (x′ ) = ϕ(x)T ϕ(x′ ) By analogy to (6.18), the mapping function ϕ(x) is a MN ×1 column vector, whose kth ( k = 1, 2, ..., MN ) element ϕk (x) is: ϕk (x) = ϕ(ia) (xa ) × ϕ(jb) (xb ) To be more specific, xa is the sub-vector of x made up of the first M element of x, and xb is the sub-vector of x made up of the last N element of x. What’s more, we can also express i, j in terms of k: i = ( k − 1) ⊘ N + 1 and j = ( k − 1) ⊙ N + 1 where ⊘ and ⊙ means integer division and remainder, respectively. Problem 6.10 Solution According to (6.9), we have: T −1 y(x) = k(x) (K + λI N ) T t = k(x) a = N ∑ n=1 [ f (xn ) · f (x) · a n = N ∑ n=1 ] f (xn ) · ·a n f (x) 136 We see that if we choose k(x, x′ ) = f (x) f (x′ ) we will always find a solution y(x) proportional to f (x). Problem 6.11 Solution We follow the hint. k(x, x′ ) = exp(−xT x/2σ2 ) · exp(xT x′ /σ2 ) · exp(−(x′ )T x′ /2σ2 )   x T x′ 2 T ′ ) ( x x 2 = exp(−xT x/2σ2 ) · 1 + 2 + σ + · · · · exp(−(x′ )T x′ /2σ2 ) 2! σ = ϕ(x)T ϕ(x′ ) where ϕ(x) is a column vector with infinite dimension. To be more specific, (6.12) gives a simple example on how to decompose (xT x′ )2 . In our case, we can also decompose (xT x′ )k , k = 1, 2, ..., ∞ in the similar way. However, since k → ∞, i.e., the decomposition will consist monomials with infinite degree. Thus, there will be infinite terms in the decomposition and the feature mapping function ϕ(x) will have infinite dimension. Problem 6.12 Solution First, let’s explain the problem a little bit. According to (6.27), what we need to prove here is: k ( A 1 , A 2 ) = 2| A 1 ∩ A 2 | = ϕ ( A 1 ) T ϕ ( A 2 ) The biggest difference from the previous problem is that ϕ( A ) is a 2|D | × 1 column vector and instead of indexed by 1, 2, ..., 2|D | here we index it by {U |U ⊆ D } (Note that {U |U ⊆ D } is all the possible subsets of D and thus there are 2|D | elements in total). Therefore, according to (6.95), we can obtain: ϕ( A 1 ) T ϕ ( A 2 ) = ∑ U ⊆D ϕU ( A 1 )ϕU ( A 2 ) By using the summation, we actually iterate through all the possible subsets of D . If and only if the current iterating subset U is a subset of both A 1 and A 2 simultaneously, the current adding term equals to 1. Therefore, we actually count how many subsets of D is in the intersection of A 1 and A 2 . Moreover, since A 1 and A 2 are both defined in the subset space of D , what we have deduced above can be written as: ϕ ( A 1 ) T ϕ ( A 2 ) = 2| A 1 ∩ A 2 | Just as required. Problem 6.13 SolutionWait for update Problem 6.14 Solution 137 Since the covariance matrix S is fixed, according to (6.32) we can obtain: ( ) ∂ 1 T −1 g(µ, x) = ∇µ ln p(x|µ) = − (x − µ) S (x − µ) = S−1 (x − µ) ∂µ 2 Therefore, according to (6.34), we can obtain: [ ] [ ] F = Ex g(µ, x)g(µ, x)T = S−1 Ex (x − µ)(x − µ)T S−1 Since x ∼ N (x|µ, S), we have: [ ] Ex (x − µ)(x − µ)T = S So we obtain F = S−1 and then according to (6.33), we have: k(x, x′ ) = g(µ, x)T F−1 g(µ, x′ ) = (x − µ)T S−1 (x′ − µ) Problem 6.15 Solution We rewrite the problem. What we are required to prove is that the Gram matrix K: [ ] k 11 k 12 K= , k 21 k 22 where k i j ( i, j = 1, 2) is short for k( x i , x j ), should be positive semidefinite. A positive semidefinite matrix should have positive determinant, i.e., k 12 k 21 ≤ k 11 k 22 . Using the symmetric property of kernel, i.e., k 12 = k 21 , we obtain what has been required. Problem 6.16 Solution Based on the total derivative of function f , we have: ) ∑ ( N f (w + ∆w)T ϕ1 , (w + ∆w)T ϕ2 , ..., (w + ∆w)T ϕ N = ∂f T n=1 ∂(w ϕ n ) · ∆wT ϕn Which can be further written as: ( T T T ) f (w + ∆w) ϕ1 , (w + ∆w) ϕ2 , ..., (w + ∆w) ϕ N = [ N ∑ ] ∂f n=1 ∂(w Tϕ n) · ϕT n ∆w Note that here ϕn is short for ϕ(xn ). Based on the equation above, we can obtain: N ∑ ∂f ∇w f = · ϕT n Tϕ ) ∂ (w n=1 n 138 Now we focus on the derivative of function g with respect to w: ∇w g = ∂g ∂(wT w) · 2wT In order to find the optimal w, we set the derivative of J with respect to w equal to 0, yielding: ∇w J = ∇w f + ∇w g = N ∑ n=1 ∂f ∂(wT ϕ n) · ϕT n + ∂g ∂(wT w) · 2wT = 0 Rearranging the equation above, we can obtain: w= N 1 ∑ ∂f ·ϕ 2a n=1 ∂(wT ϕn ) n ∂g Where we have defined: a = 1 ÷ ∂(wT w) , and since g is a monotonically increasing function, we have a > 0. Problem 6.17 Solution We consider a variation in the function y(x) of the form: y(x) → y(x) + ϵη(x) Substituting it into (6.39) yields: E [ y + ϵη] = = = N ∫ { }2 1 ∑ y + ϵη − t n v(ξ) d ξ 2 n=1 N ∫ { } 1 ∑ ( y − t n )2 + 2 · (ϵη) · ( y − t n ) + (ϵη)2 v(ξ) d ξ 2 n=1 N ∫ ∑ E [ y] + ϵ { y − t n } ηvd ξ + O (ϵ2 ) n=1 Note that here y is short for y(xn + ξ), η is short for η(xn + ξ) and v is short for v(ξ) respectively. Several clarifications must be made here. What we have done is that we vary the function y by a little bit (i.e., ϵη) and then we expand the corresponding error with respect to the small variation ϵ. The coefficient before ϵ is actually the first derivative of the error E [ y + ϵη] with respect to ϵ at ϵ = 0. Since we know that y is the optimal function that can make E the smallest, the first derivative of the error E [ y + ϵη] should equal to zero at ϵ = 0, which gives: N ∫ ∑ n=1 { y(xn + ξ) − t n } η(xn + ξ)v(ξ) d ξ = 0 139 Now we are required to find a function y that can satisfy the equation above no matter what η is. We choose: η(x) = δ(x − z) This allows us to evaluate the integral: N ∫ ∑ n=1 { y(xn + ξ) − t n } η(xn + ξ)v(ξ) d ξ = N ∑ n=1 { y(z) − t n } v(z − xn ) We set it to zero and rearrange it, which finally gives (6.40) just as required. Problem 6.18 Solution According to the main text below Eq (6.48), we know that f ( x, t), i.e., f (z), follows a zero-mean isotropic Gaussian: f (z) = N (z|0, σ2 I) Then f ( x − xm , t − t m ), i.e., f (z − zm ) should also satisfy a Gaussian distribution: f (z − zm ) = N (z|zm , σ2 I) Where we have defined: zm = ( xm , t m ) ∫ The integral f (z − zm ) dt corresponds to the marginal distribution with respect to the remaining variable x and, thus, we obtain: ∫ f (z − zm ) dt = N ( x| xm , σ2 ) We substitute all the expressions into Eq (6.48), which gives: ∑ 2 p( t, x) n N (z|z m , σ I) p ( t| x) = ∫ = ∑ 2 p( t, x) dt m N ( x| x m , σ ) ( 1 ) ∑ 1 T 2 −1 n 2πσ2 exp − 2 (z − z n ) (σ I) (z − z n ) ( ) = ∑ 1 1 2 exp − ( x − x ) m m (2πσ2 )1/2 2σ 2 ( ) ( ) ∑ 1 1 1 2 2 n 2πσ2 exp − 2σ2 ( x − x n ) exp − 2σ2 ( t − t n ) ( ) = ∑ 1 1 2 exp − ( x − x ) m m (2πσ2 )1/2 2σ2 ( ) ) ( p 1 exp − 2σ1 2 ( x − xn )2 ∑ 1 1 2πσ2 2 ( ) ( t − t ) exp − = · p n ∑ 1 1 2σ 2 2 n 2πσ2 m (2πσ2 )1/2 exp − 2σ2 ( x − x m ) ∑ = π n · N ( t | t n , σ2 ) n 140 Where we have defined: ( ) exp − 2σ1 2 ( x − xn )2 ( ) πn = ∑ 1 2 − exp ( x − x ) m m 2σ 2 We also observe that: ∑ πn = 1 n Therefore, the conditional distribution p( t| x) is given by a Gaussian Mixture. Similarly, we attempt to find a specific form for Eq (6.46): ∫ f ( x − xn , t) dt k( x, xn ) = ∑ ∫ m f ( x − x m , t) dt = ∑ N ( x | x n , σ2 ) 2 m N ( x| x m , σ ) = πn In other words, the conditional distribution can be more precisely written as: p ( t| x) = ∑ k( x, xn ) · N ( t| t n , σ2 ) n Thus its mean is given by: E[ t | x ] = ∑ k( x, xn ) · t n n Its variance is given by: var[ t| x] = E[( t| x)2 ] − E[ t| x]2 ( )2 ∑ ∑ k( x, xn ) · t n = k( x, xn ) · ( t2n + σ2 ) − n n Problem 6.19 Solution Similar to Prob.6.17, it is straightforward to show that: y(x) = ∑ t n k(x, xn ) n Where we have defined: g(xn − x) k(x, xn ) = ∑ n g(x n − x) Problem 6.20 Solution Since we know that t N +1 = ( t 1 , t 2 , ..., t N , t N +1 )T follows a Gaussian distribution, i.e., t N +1 ∼ N (t N +1 |0, C N +1 ) given in Eq (6.64), if we rearrange its 141 order by putting the last element (i.e., t N +1 ) to the first position, denoted as t̄ N +1 , it should also satisfy a Gaussian distribution: t̄ N +1 = ( t N +1 , t 1 , ..., t 2 , t N )T ∼ N (t̄ N +1 |0, C̄ N +1 ) Where we have defined: ( C̄ N +1 = c kT k CN ) Where k and c have been given in the main text below Eq (6.65). The conditional distribution p( t N +1 |t N ) should also be a Gaussian. By analogy to Eq (2.94)-(2.98), we can simply treat t N +1 as xa , t N as xb , c as Σaa , k as Σba , kT as Σab and C N as Σbb . Substituting them into Eq (2.79) and Eq (2.80) yields: Λaa = ( c − kT C−N1 k)−1 And: Λab = −( c − kT C−N1 k)−1 kT C−N1 Then we substitute them into Eq (2.96) and (2.97), yields: 1 p( t N +1 |t N ) = N (µa|b , Λ− aa ) For its mean µa|b , we have: ( ) [ ] 1 T −1 −1 T −1 µa|b = 0 − c − kT C− k · − ( c − k C k) k C N N N · (t N − 0) 1 = kT C− N t N = m(x N +1 ) 1 Similarly, for its variance Λ− aa (Note that here since t N +1 is a scalar, the mean and the covariance matrix actually degenerate to one dimension case), we have: 1 T −1 2 Λ− aa = c − k C N k = σ (x N +1 ) Problem 6.21 Solution We follow the hint beginning by verifying the mean. We write Eq (6.62) in a matrix form: 1 C N = ΦΦT + β−1 I N α Where we have used Eq (6.54). Here Φ is the design matrix defined below Eq (6.51) and I N is an identity matrix. Before we use Eq (6.66), we need to obtain k: k = [ k(x1 , x N +1 ), k(x2 , x N +1 ), ..., k(x N , x N +1 )]T 1 = [ϕ(x1 )T ϕ(x N +1 ), ϕ(x2 )T ϕ(x N +1 ), ..., ϕ(xn )T ϕ(x N +1 )]T α 1 Φϕ(x N +1 )T = α 142 Now we substitute all the expressions into Eq (6.66), yielding: [ ]−1 m(x N +1 ) = α−1 ϕ(x N +1 )T ΦT α−1 ΦΦT + β−1 I N t Next using matrix identity (C.6), we obtain: [ ]−1 [ ]−1 ΦT α−1 ΦΦT + β−1 I N = αβ βΦT Φ + αI M ΦT = αβS N ΦT Where we have used Eq (3.54). Substituting it into m(x N +1 ), we obtain: m(x N +1 ) = βϕ(x N +1 )T S N ΦT t = < ϕ(x N +1 )T , βS N ΦT t > Where < ·, · > represents the inner product. Comparing the result above with Eq (3.58), (3.54) and (3.53), we conclude that the means are equal. It is similar for the variance. We substitute c, k and C N into Eq (6.67). Then we simplify the expression using matrix identity (C.7). Finally, we will observe that it is equal to Eq (3.59). Problem 6.22 Solution Based on Eq (6.64) and (6.65), We first write down the joint distribution for t N +L = [ t 1 (x), t 2 (x), ..., t N +L (x)]T : p(t N +L ) = N (t N +L |0, C N +L ) Where C N +L is similarly given by: ( C1,N C N +L = KT K ) C N +1,N +L The expression above has already implicitly divided the vector t N +L into two parts. Similar to Prob.6.20, for later simplicity we rearrange the order of t N +L denoted as t̄ N +L = [ t N +1 , ..., t N +L , t 1 , ..., t N ]T . Moreover, t̄ N +L should also follows a Gaussian distribution: p(t̄ N +L ) = N (t̄ N +L |0, C̄ N +L ) Where we have defined: ( C̄ N +L = C N +1,N +L K KT C1,N ) Now we use Eq (2.94)-(2.98) and Eq (2.79)-(2.80) to derive the conditional distribution, beginning by calculate Λaa : 1 −1 Λaa = (C N +1,N +L − KT · C− 1,N · K) and Λab : 1 −1 1 Λab = − (C N +1,N +L − KT · C− · KT · C− 1,N · K) 1,N 143 Now we can obtain: 1 p( t N +1 , ..., t N +L |t N ) = N (µa|b , Λ− aa ) Where we have defined: T −1 1 µa|b = 0 + KT · C− 1,N · t N = K · C1,N · t N If now we want to find the conditional distribution p( t j |t N ), where N + 1 ≤ j ≤ N + L, we only need to find the corresponding entry in the mean (i.e., the ( j − N )-th entry) and covariance matrix (i.e., the ( j − N )-th diagonal entry) of p( t N +1 , ..., t N +L |t N ). In this case, it will degenerate to Eq (6.66) and (6.67) just as required. Problem 6.24 Solution By definition, we only need to prove that for arbitrary vector x ̸= 0, xT Wx is positive. Here suppose that W is a M × M matrix. We expand the multiplication: M ∑ M M ∑ ∑ xT Wx = Wi j · x i · x j = Wii · x2i i =1 j =1 i =1 where we have used the fact that W is a diagonal matrix. Since Wii > 0, we obtain xT Wx > 0 just as required. Suppose we have two positive definite matrix, denoted as A1 and A2 , i.e., for arbitrary vector x, we have xT A1 x > 0 and xT A2 x > 0. Therefore, we can obtain: xT (A1 + A2 )x = xT A1 x + xT A2 x > 0 Just as required. Problem 6.25 Solution Based on Newton-Raphson formula, Eq(6.81) and Eq(6.82), we have: anew N 1 −1 −1 = a N − (−W N − C− N ) (t N − σ N − C N a N ) 1 −1 −1 = a N + (W N + C− N ) (t N − σ N − C N a N ) [ ] 1 −1 1 −1 = (W N + C− (W N + C− N ) N )a N + t N − σ N − C N a N 1 −1 −1 = C N C− N (W N + C N ) (t N − σ N + W N a N ) = C N (C N W N + I)−1 (t N − σ N + W N a N ) Just as required. Problem 6.26 Solution Using Eq(6.77), (6.78) and (6.86), we can obtain: ∫ p(a N +1 |t N ) = p(a N +1 |a N ) p(a N |t N ) d a N ∫ 1 T −1 ⋆ −1 = N (a N +1 |kT C− N a N , c − k C N k) · N (a N |a N , H ) d a N 144 By analogy to Eq (2.115), i.e., ∫ p(y) = p(y|x) p(x) d x We can obtain: p(a N +1 |t N ) = N (Aµ + b, L−1 + AΛ−1 AT ) (∗) Where we have defined: 1 −1 1 A = kT C− = c − kT C− N , b = 0, L N k And µ = a⋆ N, Λ = H Therefore, the mean is given by: 1 ⋆ T −1 T Aµ + b = kT C− N a N = k C N C N (t N − σ N ) = k (t N − σ N ) Where we have used Eq (6.84). The covariance matrix is given by: L−1 + AΛ−1 AT = = = = 1 T −1 −1 T −1 T c − kT C− N k + k C N H (k C N ) 1 1 −1 −1 c − kT (C− − C− N H C N )k ( N ) 1 −1 −1 −1 −1 c − kT C− − C (W + C ) C N N N N N k ( ) 1 −1 −1 c − kT C− − (C W C + C ) k N N N N N Where we have used Eq (6.85) and the fact that C N is symmetric. Then we use matrix identity (C.7) to further reduce the expression, which will finally give Eq (6.88). Problem 6.27 Solution(Wait for update) This problem is really complicated. What’s more, I find that Eq (6.91) seems not right. 0.7 Sparse Kernel Machines Problem 7.1 Solution By analogy to Eq (2.249), we can obtain:  +1  1 N∑ 1   · k(x, xn )   N+1 n=1 Z k p(x| t) =  −1  1 N∑ 1   · k(x, xn )  N−1 n=1 Z k t = +1 t = −1 145 where N+1 represents the number of samples with label t = +1 and it is the same for N−1 . Z k is a normalization constant representing the volume of the hypercube. Since we have equal prior for the class, i.e., { 0. 5 t = + 1 p ( t) = 0. 5 t = − 1 Based on Bayes’ Theorem, we have p( t|x) ∝ p(x| t) · p( t), yielding:  +1  1 1 N∑   · · k(x, xn ) t = +1   Z N+1 n=1 p( t|x) =  −1  1 1 N∑   · k(x, xn ) t = −1  · Z N−1 n=1 Where 1/ Z is a normalization constant to guarantee the integration of the posterior equal to 1. To classify a new sample x⋆ , we try to find the value t⋆ that can maximize p( t|x). Therefore, we can obtain:  +1 −1  1 N∑ 1 N∑   · k (x , x ) ≥ · k(x, xn ) + 1 if  n  N+1 n=1 N−1 n=1 ⋆ (∗) t =  +1 −1  1 N∑ 1 N∑   · k(x, xn ) ≤ · k(x, xn ) −1 if N+1 n=1 N−1 n=1 If we now choose the kernel function as k(x, x′ ) = xT x′ ,we have: +1 +1 1 N∑ 1 N∑ k(x, xn ) = xT xn = xT x̃+1 N+1 n=1 N+1 n=1 Where we have denoted: x̃+1 = +1 1 N∑ xn N+1 n=1 and similarly for x̃−1 . Therefore, the classification criterion (∗) can be written as: { +1 if x̃+1 ≥ x̃−1 ⋆ t = −1 if x̃+1 ≤ x̃−1 When we choose the kernel function as k(x, x′ ) = ϕ(x)T ϕ(x′ ), we can similarly obtain the classification criterion: { +1 if ϕ̃(x+1 ) ≥ ϕ̃(x−1 ) ⋆ t = −1 if ϕ̃(x+1 ) ≤ ϕ̃(x−1 ) Where we have defined: ϕ̃(x+1 ) = +1 1 N∑ ϕ(xn ) N+1 n=1 146 Problem 7.2 Solution Suppose we have find w0 and b 0 , which can let all points satisfy Eq (7.5) and simultaneously minimize Eq (7.3). This hyperlane decided by w0 and b 0 is the optimal classification margin. Now if the constraint in Eq (7.5) becomes: t n (wT ϕ(xn ) + b) ≥ γ We can conclude that if we perform change of variables: w0 − > γw0 and b− > γ b, the constraint will still satisfy and Eq (7.3) will be minimize. In other words, if the right side of the constraint changes from 1 to γ, The new hyperlane decided by γw0 and γ b 0 is the optimal classification margin. However, the minimum distance from the points to the classification margin is still the same. Problem 7.3 Solution Suppose we have x1 belongs to class one and we denote its target value t 1 = 1, and similarly x2 belongs to class two and we denote its target value t 2 = −1. Since we only have two points, they must have t i · y(x i ) = 1 as shown in Fig. 7.1. Therefore, we have an equality constrained optimization problem: 1 minimize ||w||2 2 s.t. { T w ϕ(x1 ) + b = 1 wT ϕ(x2 ) + b = −1 This is an convex optimization problem and it has been proved that global optimal exists. Problem 7.4 Solution Since we know that ρ= 1 ||w|| Therefore, we have: 1 = ||w||2 ρ2 In other words, we only need to prove that ||w||2 = N ∑ n=1 an When we find th optimal solution, the second term on the right hand side of Eq (7.7) vanishes. Based on Eq (7.8) and Eq (7.10), we also observe that its dual is given by: N ∑ 1 L̃(a) = a n − ||w||2 2 n=1 147 Therefore, we have: N ∑ 1 1 ||w||2 = L(a) = L̃(a) = a n − ||w||2 2 2 n=1 Rearranging it, we will obtain what we are required. Problem 7.5 Solution We have already proved this problem in the previous one. Problem 7.6 Solution If the target variable can only choose from {−1, 1}, and we know that p ( t = 1| y ) = σ ( y ) We can obtain: p( t = −1| y) = 1 − p( t = 1| y) = 1 − σ( y) = σ(− y) Therefore, combining these two situations, we can derive: p( t| y) = σ( yt) Consequently, we can obtain the negative log likelihood: − ln p(D) = − ln N ∏ n=1 σ( yn t n ) = − N ∑ n=1 ln σ( yn t n ) = N ∑ n=1 E LR ( yn t n ) Here D represents the dataset, i.e.,D = {(xn , t n ); n = 1, 2, ..., N }, and E LR ( yt) is given by Eq (7.48). With the addition of a quadratic regularization, we obtain exactly Eq (7.47). Problem 7.7 Solution The derivatives are easy to obtain. Our main task is to derive Eq (7.61) 148 using Eq (7.57)-(7.60). L = C − = C − = C − N ∑ N ∑ 1 (ξn + ξbn ) + ||w||2 − (µ n ξ n + µ bn ξbn ) 2 n=1 n=1 N ∑ n=1 a n (ϵ + ξn + yn − t n ) − N ∑ n=1 abn (ϵ + ξbn + yn − t n ) N N ∑ ∑ 1 (a n + µn )ξn − (abn + µ bn )ξbn (ξn + ξbn ) + ||w||2 − 2 n=1 n=1 n=1 N ∑ N ∑ n=1 a n (ϵ + yn − t n ) − N ∑ n=1 abn (ϵ + yn − t n ) N ∑ N N ∑ ∑ 1 (ξn + ξbn ) + ||w||2 − C ξn − C ξbn 2 n=1 n=1 n=1 N ∑ n=1 (a n + abn )ϵ − N ∑ n=1 (a n − abn )( yn − t n ) = N N ∑ ∑ 1 ||w||2 − (a n + abn )ϵ − (a n − abn )( yn − t n ) 2 n=1 n=1 = N N N ∑ ∑ ∑ 1 (a n + abn )ϵ + (a n − abn )(wT ϕ(xn ) + b − t n ) − ||w||2 − 2 n=1 n=1 n=1 = N N N ∑ ∑ ∑ 1 ||w||2 − (a n − abn )(wT ϕ(xn ) + b) − (a n + abn )ϵ + (a n − abn ) t n 2 n=1 n=1 n=1 = N N N ∑ ∑ ∑ 1 ||w||2 − (a n − abn )wT ϕ(xn ) − (a n + abn )ϵ + (a n − abn ) t n 2 n=1 n=1 n=1 = N N ∑ ∑ 1 ||w||2 − ||w||2 − (a n + abn )ϵ + (a n − abn ) t n 2 n=1 n=1 N N ∑ ∑ 1 = − ||w||2 − (a n + abn )ϵ + (a n − abn ) t n 2 n=1 n=1 Just as required. Problem 7.8 Solution This obviously follows from the KKT condition, described in Eq (7.67) and (7.68). Problem 7.9 Solution The prior is given by Eq (7.80). p(w|α) = M ∏ i =1 1 −1 N (0, α− i ) = N (w|0, A ) Where we have defined: A = diag(α i ) 149 The likelihood is given by Eq (7.79). N ∏ p(t|X, w, β) = n=1 N ∏ = n=1 p( t n |xn , w, β−1 ) N ( t n |wT ϕ(xn ), β−1 ) = N (t|Φw, β−1 I) Where we have defined: Φ = [ϕ(x1 ), ϕ(x2 ), ..., ϕ(xn )]T Our definitions of Φ and A as consistent with the main text. Therefore, according to Eq (2.113)-Eq (2.117), we have: p(w|t, X, α, β) = N (m, Σ) Where we have defined: Σ = (A + βΦT Φ)−1 And m = βΣΦT t Just as required. Problem 7.10&7.11 Solution It is quite similar to the previous problem. We begin by writting down the prior: p(w|α) = M ∏ i =1 1 −1 N (0, α− i ) = N (w|0, A ) Then we write down the likelihood: N ∏ p(t|X, w, β) = n=1 N ∏ = n=1 p( t n |xn , w, β−1 ) N ( t n |wT ϕ(xn ), β−1 ) = N (t|Φw, β−1 I) Since we know that: ∫ p(t|X, α, β) = p(t|X, w, β) p(w|α) d w 150 First as required by Prob.7.10, we will solve it by completing the square. We begin by write down the expression for p(t|X, w, β): ∫ p(t|X, α, β) = N (w|0, A−1 )N (t|Φw, β−1 I) d w = ( β 2π ) N /2 · 1 (2π) M /2 · M ∏ m=1 ∫ α1/2 i · exp{−E (w)} d w Where we have defined: β 1 T w Aw + ||t − Φw||2 2 2 We expand E (w) with respect to w: } 1{ T E (w) = w (A + βΦT Φ)w − 2βtT (Φw) + βtT t 2 } 1 { T −1 w Σ w − 2mT Σ−1 w + βtT t = 2 } 1{ = (w − m)T Σ−1 (w − m) + βtT t − mT Σ−1 m 2 Where we have used Eq (7.82) and Eq (7.83). Substituting E (w) into the integral, we will obtain: ∫ M ∏ β N /2 1 1/2 · α · exp{−E (w)} d w p(t|X, α, β) = ( ) · 2π (2π) M /2 m=1 i { 1 } M ∏ β 1 1/2 M /2 1/2 T T −1 = ( ) N /2 · · α · | Σ | exp − t − m Σ m) · (2 π ) ( β t 2π 2 (2π) M /2 m=1 i { } M ∏ β 1 T T −1 = ( ) N /2 · |Σ|1/2 · α1/2 · exp − ( β t t − m Σ m) i 2π 2 m=1 { } M ∏ β = ( ) N /2 · |Σ|1/2 · α1/2 i · exp − E (t) 2π m=1 E (w) = We further expand E (t): E (t) = = = = = = = 1 (βtT t − mT Σ−1 m) 2 1 (βtT t − (βΣΦT t)T Σ−1 (βΣΦT t)) 2 1 (βtT t − β2 tT ΦΣΣ−1 ΣΦT t) 2 1 (βtT t − β2 tT ΦΣΦT t) 2 1 T t (βI − β2 ΦΣΦT )t 2 ] 1 T[ t βI − βΦ(A + βΦT Φ)−1 ΦT β t 2 1 T −1 1 t (β I + ΦA−1 ΦT )−1 t = tT C−1 t 2 2 151 Note that in the last step we have used matrix identity Eq (C.7). Therefore, as we know that the pdf is Gaussian and the exponential term has been given by E (t), we can easily write down Eq (7.85) considering those normalization constant. What’s more, as required by Prob.7.11, the evaluation of the integral can be easily performed using Eq(2.113)- Eq(2.117). Problem 7.12 Solution According to the previous problem, we can explicitly write down the log marginal likelihood in an alternative form: ln p(t|X, α, β) = M N 1 1∑ N ln β − ln 2π + ln |Σ| + ln α i − E (t) 2 2 2 2 i=1 We first derive: dE (t) dαi 1 d (mT Σ−1 m) 2 dαi 1 d − (β2 tT ΦΣΣ−1 ΣΦT t) 2 dαi 1 d (β2 tT ΦΣΦT t) − 2 dαi 1 [ d d Σ−1 2 T T − Tr ( β t ΦΣΦ t) · ] 2 dαi d Σ−1 1 2 [ 1 β T r Σ(ΦT t)(ΦT t)T Σ · I i ] = m2ii 2 2 = − = = = = In the last step, we have utilized the following equation: d T r (AX−1 B) = −X−T AT BT X−T dX Moreover, here I i is a matrix with all elements equal to zero, expect the i -th diagonal element, and the i -th diagonal element equals to 1. Then we utilize matrix identity Eq (C.22) to derive: d ln |Σ| dαi d ln |Σ−1 | dαi [ d ] = −T r Σ (A + βΦT Φ) dαi = −Σ ii = − Therefore, we can obtain: d ln p 1 1 1 = − m2 − Σ ii dαi 2α i 2 i 2 152 Set it to zero and obtain: αi = 1 − α i Σ ii γi = 2 mi mi Then we calculate the derivatives of ln p with respect to β beginning by: d ln |Σ| dβ d ln |Σ−1 | dβ [ d ] = −T r Σ (A + βΦT Φ) dβ [ ] = −T r ΣΦT Φ = − Then we continue: dE (t) dβ = = = = = = = = = = = 1 T 1 d t t− (mT Σ−1 m) 2 2 dβ 1 T 1 d 2 T t t− (β t ΦΣΣ−1 ΣΦT t) 2 2 dβ 1 T 1 d 2 T t t− (β t ΦΣΦT t) 2 2 dβ 1 T 1 d T t t − βtT ΦΣΦT t − β2 (t ΦΣΦT t) 2 2 dβ } 1{ T d T t t − 2βtT ΦΣΦT t − β2 (t ΦΣΦT t) 2 dβ } { 1 T d T t t − 2tT (Φm) − β2 (t ΦΣΦT t) 2 dβ { [ d 1 T d Σ−1 ]} T T (t ΦΣΦ t) · t t − 2tT (Φm) − β2 T r 2 dβ d Σ−1 { [ ]} 1 T t t − 2tT (Φm) + β2 T r Σ(ΦT t)(ΦT t)T Σ · ΦT Φ 2 } [ 1{ T t t − 2tT (Φm) + T r mmT · ΦT Φ] 2 } [ 1{ T t t − 2tT (Φm) + T r ΦmmT · ΦT ] 2 1 ||t − Φm||2 2 Therefore, we have obtained: ) 1( N d ln p = − ||t − Φm||2 − T r [ΣΦT Φ] dβ 2 β 153 Using Eq (7.83), we can obtain: ΣΦT Φ = ΣΦT Φ + β−1 ΣA − β−1 ΣA = Σ(βΦT Φ + A)β−1 − β−1 ΣA = Iβ−1 − β−1 ΣA = (I − ΣA)β−1 Setting the derivative equal to zero, we can obtain: β−1 = ||t − Φm||2 ||t − Φm||2 = ∑ N − T r (I − ΣA) N − i γi Just as required. Problem 7.13 Solution This problem is quite confusing. In my point of view, the posterior should be denoted as p(w|t, X, {a i , b i }, a β , b β ), where a β , b β controls the Gamma distribution of β, and a i , b i controls the Gamma distribution of α i . What we should do is to maximize the marginal likelihood p(t|X, {a i , b i }, a β , b β ) with respect to {a i , b i }, a β , b β . Now we do not have a point estimation for the hyperparameters β and α i . We have a distribution (controled by the hyper priors, i.e., {a i , b i }, a β , b β ) instead. Problem 7.14 Solution We begin by writing down p( t|x, w, β∗ ). Using Eq (7.76) and Eq (7.77), we can obtain: p( t|x, w, β∗ ) = N ( t|wT ϕ(x), (β∗ )−1 ) Then we write down p(w|X, t, α∗ , β∗ ). Using Eq (7.81), (7.82) and (7.83), we can obtain: p(w|X, t, α∗ , β∗ ) = N (w|m, Σ) Where m and Σ are evaluated using Eq (7.82) and (7.83) given α = α∗ and β = β∗ . Then we utilize Eq (7.90) and obtain: ∫ ∗ ∗ p ( t | x , X, t , α , β ) = N ( t|wT ϕ(x), (β∗ )−1 )N (w|m, Σ) d w ∫ = N ( t|ϕ(x)T w, (β∗ )−1 )N (w|m, Σ) d w Using Eq (2.113)-(2.117), we can obtain: p( t|x, X, t, α∗ , β∗ ) = N (µ, σ2 ) Where we have defined: µ = mT ϕ(x) 154 And σ2 = (β∗ )−1 + ϕ(x)T Σϕ(x) Just as required. Problem 7.15 Solution We just follow the hint. 1 L(α) = − { N ln 2π + ln |C| + tT C−1 t} 2 1{ 1 T −1 = − N ln 2π + ln |C− i | + ln |1 + α− i φ i C− i φ i | 2 1 C− φ φT C−1 } −i i i −i 1 +tT (C− − )t −i α i + φT C−1 φ i −i i = = = = L (α − i ) − T −1 −1 1 T C− i φ i φ i C− i 1 1 ln |1 + α−i 1 φTi C− φ | + t t −i i 1φ 2 2 α i + φTi C− −i i 2 1 qi 1 −1 L(α− i ) − ln |1 + α i s i | + 2 2 αi + s i 2 1 αi + s i 1 q i L(α− i ) − ln + 2 αi 2 αi + s i [ q2i ] 1 L(α− i ) + ln α i − ln(α i + s i ) + = L(α− i ) + λ(α i ) 2 αi + s i Where we have defined λ(α i ), s i and q i as shown in Eq (7.97)-(7.99). Problem 7.16 Solution We first calculate the first derivative of Eq(7.97) with respect to α i : q2i 1 1 1 = [ − − ] ∂α i 2 α i α i + s i (α i + s i )2 ∂λ Then we calculate the second derivative: ∂2 λ ∂α2i = 2 q2i 1 1 1 [− 2 + + ] 2 α i (α i + s i )2 (α i + s i )3 Next we aim to prove that when α i is given by Eq (7.101), i.e., setting the first derivative equal to 0, the second derivative (i.e., the expression above) is negative. First we can obtain: αi + s i = s2i q2i − s i + si = s i q2i q2i − s i 155 Therefore, substituting α i + s i and α i into the second derivative, we can obtain: ∂2 λ ∂α2i = 2 2 ( q2i − s i )2 2 q2i ( q2i − s i )3 1 (q i − s i ) [− + + ] 2 s4i s2i q4i s3i q6i = 4 2 2 s2i ( q2i − s i )2 2 s i ( q2i − s i )3 1 q i (q i − s i ) + + ] [− 2 q4i s4i s4i q4i s4i q4i = 2 2 1 (q i − s i ) [− q4i + s2i + 2 s i ( q2i − s i )] 2 q4i s4i = 2 2 1 (q i − s i ) [−( q2i − s i )2 ] 2 q4i s4i = − 2 4 1 (q i − s i ) <0 2 q4i s4i Just as required. Problem 7.17 Solution We just follow the hint. According to Eq (7.102), Eq (7.86) and matrix identity (C.7), we have: Qi −1 = φT i C t −1 −1 T −1 = φT i (β I + ΦA Φ ) t T −1 T = φT i (βI − βIΦ(A + Φ βIΦ) Φ βI)t 2 T −1 T = φT i (β − β Φ(A + βΦ Φ) Φ )t 2 T = φT i (β − β ΦΣΦ )t 2 T T = βφT i t − β φ i ΦΣΦ t Similarly, we can obtain: Si −1 = φT i C φi 2 T = φT i (β − β ΦΣΦ )φ i 2 T T = βφT i φ i − β φ i ΦΣΦ φ i Just as required. Problem 7.18 Solution We begin by deriving the first term in Eq (7.109) with respect to w. This can be easily evaluate based on Eq (4.90)-(4.91). N ∂ {∑ ∂w n=1 } ∑ N t n ln yn + (1 − t n ) ln(1 − yn ) = ( t n − yn )ϕn = ΦT (t − y) n=1 156 Since the derivative of the second term in Eq (7.109) with respect to w is rather simple to obtain. Therefore, The first derivative of Eq (7.109) with respect to w is: ∂ ln p = ΦT (t − y) − Aw ∂w For the Hessian matrix, we can first obtain: ∂ { ∂w ΦT (t − y) } } N ∂ { ∑ ( t n − yn )ϕn n=1 ∂w } N ∂ { ∑ = − yn · ϕn n=1 ∂w = = − = − N ∂σ(wT ϕ ) ∑ n n=1 ∂w · ϕT n N ∂σ(a) ∂a ∑ · · ϕT ∂w n n=1 ∂a Where we have defined a = wT ϕn . Then we can utilize Eq (4.88) to derive: ∂ { } N ∑ T ΦT (t − y) = − σ(1 − σ) · ϕn · ϕT n = −Φ BΦ ∂w n=1 Where B is a diagonal N × N matrix with elements b n = yn (1 − yn ). Therefore, we can obtain the Hessian matrix: ∂ { ∂ ln p } = −(ΦT BΦ + A) H= ∂w ∂w Just as required. Problem 7.19 Solution We begin from Eq (7.114). p(t|w∗ ) p(w∗ |α)(2π) M /2 |Σ|1/2 ¯ ] [∏ ][ ∏ N M 1 M /2 1/2 ¯ = p( t n | xn , w) N (w i |0, α− ) (2 π ) | Σ | ¯ i p(t|α) = n=1 = [∏ N n=1 i =1 ¯ ] ¯ p( t n | xn , w) · N (w|0, A) · (2π) M /2 |Σ|1/2 ¯ w = w∗ w = w∗ We further take logarithm for both sides. ln p(t|α) = [∑ N n=1 ln p( t n | xn , w) + ln N (w|0, A) + [∑ N [ ]¯ M 1 ¯ ln 2π + ln |Σ| ¯ w = w∗ 2 2 ]¯ ] 1 1 1 ¯ t n ln yn + (1 − t n ) ln(1 − yn ) − wT Aw − ln |A| + ln |Σ| + const ¯ w = w∗ 2 2 2 n=1 ] [1 ]¯ [∑ N [ ] 1 1 ¯ t n ln yn + (1 − t n ) ln(1 − yn ) − wT Aw + ln |Σ| − ln |A| + const ¯ = w = w∗ 2 2 2 n=1 = 157 Using the Chain rule, we can obtain: ∂ ln p(t|α) ¯¯ ∂ ln p(t|α) ∂w ¯¯ = ¯ ¯ w = w∗ ∂α i ∂w ∂α i w = w∗ Observing Eq (7.109), (7.110) and that (7.110) will equal 0 at w∗ , we can conclude that the first term on the right hand side of ln p(t|α) will have zero derivative with respect to w at w∗ . Therefore, we only need to focus on the second term: [ ]¯ ∂ 1 1 ∂ ln p(t|α) ¯¯ ¯ = ln |Σ| − ln |A| ¯ ¯ w = w∗ w = w∗ ∂α i ∂α i 2 2 It is rather easy to obtain: ] 1 1 ∂ [∑ 1 [− ln |A|] = − ln α−i 1 = ∂α i 2 2 ∂α i i 2α i ∂ Then we follow the same procedure as in Prob.7.12, we can obtain: ∂ 1 1 [ ln |Σ|] = − Σ ii ∂α i 2 2 Therefore, we obtain: ∂ ln p(t|α) ∂α i = 1 1 − Σ ii 2α i 2 Note: here I draw a different conclusion as the main text. I have also verified my result in another way. You can write the prior as the product of 1 N (w i |0, α− ) instead of N (w|0, A). In this form, since we know that: i M ∂ ∑ ∂α i i =1 ln N (w i |0, α−i 1 ) = ∂ 1 αi 1 1 ( ln α i − w2i ) = − (w∗i )2 ∂α i 2 2 2α i 2 The above expression can be used to replace the derivative of −1/2wT Aw− 1/2 ln |A|. Since the derivative of the likelihood with respect to α i is not zero at w∗ , (7.115) seems not right anyway. 0.8 Graphical Models Problem 8.1 Solution We are required to prove: ∫ x p(x) d x = ∫ ∏ K x k=1 p( xk | pa k ) d x = 1 158 Here we adopt the same assumption as in the main text: No arrows lead from a higher numbered node to a According to Eq(8.5), we can write: ∫ ∫ ∏ K p(x) d x = p( xk | pa k ) d x x ∫ = ∫ = ∫ = ∫ = ∫ = x k=1 x p( xK | pa K ) ∫ K∏ −1 p( xk | pa k ) d x k=1 [ [ x1 ,x2 ,...,xK −1 ] xK p( xK | pa K ) [ K∏ −1 [ x1 ,x2 ,...,xK −1 ] k=1 [ K∏ −1 [ x1 ,x2 ,...,xK −1 ] k=1 K∏ −1 [ x1 ,x2 ,...,xK −1 ] k=1 K∏ −1 k=1 ∫ p( xk | pa k ) xK ] p( xk | pa k ) dxK dx1 dx2 , ...dxK −1 ] p( xK | pa K ) dxK dx1 dx2 , ...dxK −1 ] p( xk | pa k ) dx1 dx2 , ...dxK −1 p( xk | pa k ) dx1 dx2 , ...dxK −1 Note that from the third line to the fourth line, we have used the fact that x1 , x2 , ...xK −1 do not depend on xK , and thus the product from k = 1 to K − 1 can be moved to the outside of the integral with respect to xK , and that we have used the fact that the conditional probability is correctly normalized from the fourth line to the fifth line. The aforementioned procedure will be repeated for K times until all the variables have been integrated out. Problem 8.2 Solution This statement is obvious. Suppose that there exists an ordered numbering of the nodes such that for each node there are no links going to a lower-numbered node, and that there is a directed cycle in the graph: a 1 → a 2 → ... → a N To make it a real cycle, we also require a N → a 1 . According to the assumption, we have a 1 ≤ a 2 ≤ ... ≤ a N . Therefore, the last link a N → a 1 is invalid since a N ≥ a 1 . Problem 8.3 Solution Based on definition, we can obtain:   0.336, if a    0.264, if a p(a, b) = p(a, b, c = 0) + p(a, b, c = 1) =  0.256, if a    0.144, if a Similarly, we can obtain: p(a) = p(a, b = 0) + p(a, b = 1) = { = 0, b = 0, b = 1, b = 1, b 0.6, if a = 0 0.4, if a = 1 =0 =1 =0 =1 159 And { p ( b ) = p ( a = 0, b ) + p ( a = 1, b ) = 0.592, if b = 0 0.408, if b = 1 Therefore, we conclude that p(a, b) ̸= p(a) p( b). For instance, we have p(a = 1, b = 1) = 0.144, p(a = 1) = 0.4 and p( b = 1) = 0.408. It is obvious that: 0.144 = p(a = 1, b = 1) ̸= p(a = 1) p( b = 1) = 0.4 × 0.408 To prove the conditional dependency, we first calculate p( c): p( c) = ∑ a,b = 0,1 { p(a, b, c) = 0.480, if c = 0 0.520, if c = 1 According to Bayes’ Theorem, we have:   0.400, if a     0.277, if a      0 .100, if a   p(a, b, c)  0.415, if a = p(a, b| c) =  0.400, if a p( c)     0 .123, if a      0.100, if a    0.185, if a = 0, b = 0, b = 0, b = 0, b = 1, b = 1, b = 1, b = 1, b = 0, c = 0, c = 1, c = 1, c = 0, c = 0, c = 1, c = 1, c =0 =1 =0 =1 =0 =1 =0 =1 Similarly, we also have:   0.240/0.480 = 0.500, if a   p(a, c)  0.360/0.520 = 0.692, if a = p ( a| c ) =  0.240/0.480 = 0.500, if a p( c)    0.160/0.520 = 0.308, if a = 0, c = 0, c = 1, c = 1, c =0 =1 =0 =1 Where we have used p(a, c) = p(a, b = 0, c) + p(a, b = 1, c). Similarly, we can obtain:   0.384/0.480 = 0.800, if b = 0, c = 0    p( b, c) 0.208/0.520 = 0.400, if b = 0, c = 1 p( b| c) = =  0 .096/0.480 = 0.200, if b = 1, c = 0 p( c)    0.312/0.520 = 0.600, if b = 1, c = 1 Now we can easily verify the statement p(a, b| c) = p(a| c) p( b| c). For instance, we have: 0.1 = p(a = 1, b = 1| c = 0) = p(a = 1| c = 0) p( b = 1| c = 0) = 0.5 × 0.2 = 0.1 Problem 8.4 Solution 160 This problem follows the previous one. We have already calculated p(a) and p( b| c), we rewrite it here. { p(a) = p(a, b = 0) + p(a, b = 1) = And 0.6, if a = 0 0.4, if a = 1   0.384/0.480 = 0.800, if b   p( b, c)  0.208/0.520 = 0.400, if b p( b| c) = =  0.096/0.480 = 0.200, if b p( c)    0.312/0.520 = 0.600, if b = 0, c = 0, c = 1, c = 1, c =0 =1 =0 =1 We can also obtain p( c|a):   0.24/0.6 = 0.4, if a   p(a, c)  0.36/0.6 = 0.6, if a p ( c | a) = =  0.24/0.4 = 0.6, if a p ( a)    0.16/0.4 = 0.4, if a = 0, c = 0, c = 1, c = 1, c =0 =1 =0 =1 Now we can easily verify the statement that p(a, b, c) = p(a) p( c|a) p( b| c) given Table 8.2. The directed graph looks like: a→c→b Problem 8.5 Solution It looks quite like Figure 8.6. The difference is that we introduce α i for each w i , where i = 1, 2, ..., M . Figure 1: probabilistic graphical model corresponding to the RVM described in (7.79) and (7.80). Problem 8.6 Solution(Wait for update) Problem 8.7 Solution Let’s just follow the hint. We begin by calculating the mean µ. E[ x 1 ] = b 1 161 According to Eq (8.15), we can obtain: ∑ E[ x 2 ] = j ∈ pa 2 w2 j E[ x j ] + b 2 = w21 b 1 + b 2 Then we can obtain: E[ x3 ] = w32 E[ x2 ] + b 3 = w32 (w21 b 1 + b 2 ) + b 3 = w32 w21 b 1 + w32 b 2 + b 3 Therefore, we obtain Eq (8.17) just as required. Next, we deal with the covariance matrix. cov[ x1 , x1 ] = v1 Then we can obtain: cov[ x1 , x2 ] = ∑ k=1 w2k cov[ x1 , xk ] + I 12 v2 = w21 cov[ x1 , x1 ] = w21 v1 And also cov[ x2 , x1 ] = cov[ x1 , x2 ] = w21 v1 . Hence, we can obtain: cov[ x2 , x2 ] = ∑ k=1 2 w2k cov[ x2 , xk ] + I 22 v2 = w21 v1 + v2 Next, we can obtain: cov[ x1 , x3 ] = ∑ k=2 w3k cov[ x1 , xk ] + I 31 v1 = w32 w21 v1 Then, we can obtain: cov[ x2 , x3 ] = ∑ k=2 2 w3k cov[ x2 , xk ] + I 23 v3 = w32 (v2 + w21 v1 ) Finally, we can obtain: cov[ x3 , x3 ] = = ∑ k=2 w3k cov[ x3 , xk ] + I 33 v3 [ ] 2 w32 w32 (v2 + w21 v1 ) + v3 Where we have used the fact that cov[ x3 , x2 ] = cov[ x2 , x3 ]. By now, we have obtained Eq (8.18) just as required. Problem 8.8 Solution According to the definition, we can write: p(a, b, c| d ) = p(a| d ) p( b, c| d ) 162 We marginalize both sides with respect to c, yielding: p(a, b| d ) = p(a| d ) p( b| d ) Just as required. Problem 8.9 Solution This statement is easy to see but a little bit difficult to prove. We put Fig 8.26 here to give a better illustration. Figure 2: Markov blanket of a node x i Markov blanket Φ of node x i is made up of three kinds of nodes:(i) the set Φ1 containing all the parents of node x i ( x1 and x2 in Fig.2), (ii) the set Φ2 containing all the children of node x i ( x5 and x6 in Fig.2), and (iii) the set Φ3 containing all the co-parents of node x i ( x3 and x4 in Fig.2). According to the d-separation criterion, we need to show that all the paths from node x i to an arbitrary node x̂ ∉ Φ = {Φ1 ∪ Φ2 ∪ Φ3 } are blocked given that the Markov blanket Φ are observed. It is obvious that x̂ can only connect to the target node x i via two kinds of node: Φ1 ,Φ2 . First, suppose that x̂ connects to x i via some node x⋆ ∈ Φ1 . The arrows definitely meet head-to-tail or tail-to-tail at node x⋆ because the link from a parent node x⋆ to x i has its tail connected to the parent node x⋆ , and since x⋆ is in Φ1 ⊆ Φ, we see that this path is blocked. In the second case, suppose that x̂ connects to x i via some node x⋆ ∈ Φ2 . We need to further divide this situation. If the path from x̂ to x i also goes through a node x⋆⋆ from Φ3 (e.g., in Fig.2, some node x̂ connects to node x3 , and in this example x⋆⋆ = x3 , x⋆ = x5 ), it is clearly that the arrows meet head-to-tail or tail-to-tail at the node x⋆⋆ ∈ Φ3 ⊆ Φ, this path is blocked. In the final case, suppose that x̂ connects to x i via some node x⋆ ∈ Φ2 and the path doesn’t go through any node from Φ3 . An important observation is that the arrows cannot meet head-to-head at node x⋆ (otherwise, this path will go through a node from Φ3 ). Thus, the arrows must meet either headto-tail or tail-to-tail at node x⋆ ∈ Φ2 ⊆ Φ. Therefore, the path is also blocked. 163 Problem 8.10 Solution By examining Fig.8.54, we can obtain: p(a, b, c, d ) = p(a) p( b) p( c|a, b) p( d | c) Next we performing summation on both sides with respect to c and d , we can obtain: ∑∑ p(a, b) = p(a) p( b) p( c|a, b) p( d | c) c = p ( a) p ( b ) ∑ d [∑ ] p( c|a, b) p( d | c) c = p ( a) p ( b ) ∑ d p( c|a, b) × 1 c = p ( a) p ( b ) × 1 = p ( a) p ( b ) If we want to prove that a and b are dependent conditioned on d , we only need to prove: p(a, b| d ) = p(a| d ) p( b| d ) We multiply both sides by p( d ) and use Bayes’ Theorem, yielding: p(a, b, d ) = p(a) p( b| d ) (∗) In other words, we can equivalently prove the expression above instead. Recall that we have: p(a, b, c, d ) = p(a) p( b) p( c|a, b) p( d | c) We perform summation on both sides with respect to c, yielding: ∑ p(a, b, d ) = p(a) p( b) p( c|a, b) p( d | c) c Combining with (∗), we only need to prove: ∑ p( b| d ) = p( b) p( c|a, b) p( d | c) c However, we can see that the value of the right hand side depends on a, b and d , while the left hand side only depends on b and d . In general, this expression will not hold, and, thus, a and b are not dependent conditioned on d. Problem 8.11 Solution This problem is quite straightforward, but it needs some patience. According to the Bayes’ Theorem, we have: p(F = 0|D = 0) = p(D = 0|F = 0) p(F = 0) p(D = 0) (∗) 164 We will calculate each of the term on the right hand side. Let’s begin from the numerator p(D = 0). According to the sum rule, we have: p(D = 0) = = p(D = 0,G = 0) + p(D = 0,G = 1) p(D = 0|G = 0) p(G = 0) + p(D = 0|G = 1) p(G = 1) = 0.9 × 0.315 + (1 − 0.9) × (1 − 0.315) = 0.352 Where we have used Eq(8.30), Eq(8.105) and Eq(8.106). Note that the second term in the denominator, i.e., p(F = 0), equals 0.1, which can be easily derived from the main test above Eq(8.30). We now only need to calculate p(D = 0|F = 0). Similarly, according to the sum rule, we have: p(D = 0|F = 0) = = = ∑ G =0,1 ∑ G =0,1 ∑ G =0,1 p(D = 0,G |F = 0) p(D = 0|G, F = 0) p(G |F = 0) p(D = 0|G ) p(G |F = 0) = 0.9 × 0.81 + (1 − 0.9) × (1 − 0.81) = 0.748 Several clarifications must be made here. First, from the second line to the third line, we simply eliminate the dependence on F = 0 because we know that D only depends on G according to Eq(8.105) and Eq(8.106). Second,from the third line to the fourth line, we have used Eq(8.31), Eq(8.105) and Eq(8.106). Now, we substitute all of them back to (∗), yielding: p(F = 0|D = 0) = 0.748 × 0.1 p(D = 0|F = 0) p(F = 0) = = 0.2125 p(D = 0) 0.352 Next, we are required to calculate the probability conditioned on both D = 0 and B = 0. Similarly, we can write: p(F = 0|D = 0, B = 0) = = = = p(D = 0, B = 0, F = 0) p(D = 0, B = 0) ∑ G p(D = 0, B = 0, F = 0,G ) ∑ p(D = 0, B = 0,G ) ∑ G p ( B = 0, F = 0,G ) p(D = 0|B = 0, F = 0,G ) G ∑ G p(B = 0,G ) p(D = 0|B = 0,G ) ∑ G p(B = 0, F = 0,G ) p(D = 0|G ) (∗∗) ∑ G p(B = 0,G ) p(D = 0|G ) We need to calculate p(B = 0, F = 0,G ) and p(B = 0,G ), where G = 0, 1. 165 We begin by calculating p(B = 0, F = 0,G = 0): p(B = 0, F = 0,G = 0) = = p(G = 0|B = 0, F = 0) × p(B = 0, F = 0) p(G = 0|B = 0, F = 0) × p(B = 0) × p(F = 0) = (1 − 0.1) × (1 − 0.9) × (1 − 0.9) = 0.009 Similarly, we can obtain p(B = 0, F = 0,G = 1) = 0.001. Next we calculate p(B = 0,G ): p(B = 0,G = 0) = = = ∑ F =0,1 ∑ F =0,1 ∑ F =0,1 p(B = 0,G = 0, F ) p(G = 0|B = 0, F ) × p(B = 0, F ) p(G = 0|B = 0, F ) × p(B = 0) × p(F ) = (1 − 0.1) × (1 − 0.9) × (1 − 0.9) + (1 − 0.2) × (1 − 0.9) × 0.9 = 0.081 Similarly, we can obtain p(B = 0,G = 1) = 0.019. We substitute them back into (∗∗), yielding: ∑ G p(B = 0, F = 0,G ) p(D = 0|G ) p(F = 0|D = 0, B = 0) = ∑ G p(B = 0,G ) p(D = 0|G ) 0.009 × 0.9 + 0.001 × (1 − 0.9) = 0.081 × 0.9 + 0.019 × (1 − 0.9) = 0.1096 Just as required. The intuition behind this result coincides with the common sense. Moreover, by analogy to Fig.8.54, the node a and b in Fig.8.54 represents B and F in our case. Node c represents G , while node d represents D . You can use d-separation criterion to verify the conditional properties. Problem 8.12 Solution An intuitive solution is that we construct a matrix A with size of M × M . If there is a link from node i to node j , the entry on the i -th row and j -th column of matrix A, i.e., A i, j , will equal to 1. Otherwise, it will equal to 0. Since the graph is undirected, the matrix A will be symmetric. What’s more, the element on the diagonal is 0 by definition. For a undirected graph, we can use a matrix A to represent it. It is also a one-to-one mapping. In other words, we equivalently count the number of possible matrix A satisfying the following criteria: (i) each of the entry is either 0 or 1, (ii) it is symmetric, and (iii) all of the entries on the diagonal are already determined (i.e., they all equal 0). 166 Using the property of symmetry, we only need to count the free variables on the lower triangle of the matrix. In the first column, there are ( M − 1) free variables. In the second column, there are ( M − 2) free variables. Therefore, the total free variables are given by: ( M − 1) + ( M − 2) + ... + 0 = M ( M − 1) 2 Each value of these free variables has two choices, i.e., 1 or 0. Therefore, the total number of such matrix is 2 M ( M −1)/2 . In the case of M = 3, there are 8 possible undirected graphs: Figure 3: the undirected graph when M = 3 Problem 8.13 Solution It is straightforward. Suppose that xk is the target variable whose state may be {−1.1} while all other variables are fixed. According to Eq (8.42), we can obtain: E (x, y) = h ∑ i ̸= k xi − β +h xk − β ∑ i, j ̸= k ∑ xi x j − η ∑ x i yi i ̸= k xk xm − η xk yk m Note that we write down the dependence of E (x, y) on xk explicitly, which is expressed via the second line. Moreover, the x i x j term in the first line doesn’t include the pairs { x i , x j }, which one of them is xk . These terms are considered by xk xm in the secone line. To be more specific, here xm represents the neighbor of xk . Noticing that the first line doesn’t depend on xk , we can obtain: ∑ E (x, y)| xk = 1 − E (x, y)| xk = −1 = 2 h − 2β xm − 2η yk m Obviously, the difference depends locally on xk , implied by h, the neighbors xm and its observed value yk . Problem 8.14 Solution 167 It is quite obvious. When h = 0, β = 0, the energy function reduces to ∑ E (x, y) = −η x i yi i If there exists some index j which satisfies x j ̸= y j , considering that x j , y j ∈ {−1.1}, then x j y j will equal to −1. By changing the sign of x j , we can always increase the value of x j y j from −1 to 1, and, thus, decrease the energy function E (x, y). Therefore, given the observed binary pixels yi ∈ {−1.1}, where i = 1, 2, ..., D , in order to obtain the minimum of energy, the optimal choice for x i is to set it equal to yi . Problem 8.15 Solution This problem can be solved by analogy to Eq (8.49) - Eq(8.54). We begin by noticing: ∑ ∑ ∑ ∑ p( xn−1 , xn ) = ... ... p(x) x1 xn−2 xn+1 xN We also have: p(x) = 1 ψ1,2 ( x1 , x2 ) ψ2,3 ( x2 , x3 ) ... ψ N −1,N ( x N −1 , x N ) Z By analogy to Eq(8.52), we can obtain: [ [∑ [∑ ]] ] 1 ∑ p( xn−1 , xn ) = ψn−2,n−1 ( xn−2 , xn−1 )... ψ2,3 ( x2 , x3 ) ψ1,2 ( x1 , x2 ) ... Z xn−2 x2 x1 × ψn−1,n ( xn−1,xn ) [ [∑ ] ] ∑ × ψn,n+1 ( xn , xn+1 )... ψ N −1,N ( x N −1 , x N ) ... xn+1 = xN 1 × µα ( xn−1 ) × ψn−1,n ( xn−1 , xn ) × µβ ( xn ) Z Just as required. Problem 8.16 Solution We can simply obtain p( x N ) using Eq(8.52) and Eq(8.54): p( x N ) = 1 µα ( x N ) Z (∗) According to Bayes’ Theorem, we have: p( xn | x N ) = p( xn , x N ) p( x N ) Therefore, now we only need to derive an expression for p( xn , x N ), where n = 1, 2, ..., N − 1. We follow the same procedure as in the previous problem. Since we know that: ∑ ∑ ∑ ∑ p( xn , x N ) = ... ... p(x) x1 xn−1 xn+1 x N −1 168 We can obtain: [ [∑ [∑ ]] ] 1 ∑ p( xn , x N ) = ψn−1,n ( xn−1 , xn )... ψ2,3 ( x2 , x3 ) ψ1,2 ( x1 , x2 ) ... Z xn−1 x2 x1 [ [∑ ] ] ∑ × ψn,n+1 ( xn , xn+1 )... ψ N −2,N −1 ( x N −2 , x N −1 )ψ N −1,N ( x N −1 , x N ) ... xn+1 x N −1 Note that in the second line, the summation term with respect to x N −1 is the product of ψ N −2,N −1 ( x N −2 , x N −1 ) and ψ N −1,N ( x N −1 , x N ). So here we can actually draw an undirected graph with N − 1 nodes, and adopt the proposed algorithm to solve p( xn , x N ). If we use x⋆ n to represent the new node, then the joint distribution can be written as: 1 ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ψ ( x , x ) ψ ( x , x ) ... ψ⋆ N −2,N −1 ( x N −2 , x N −1 ) Z ⋆ 1, 2 1 2 2, 3 2 3 ⋆ ⋆ Where ψ⋆ n,n+1 ( x n , x n+1 ) is defined as: { ψn,n+1 ( xn , xn+1 ), n = 1, 2, ..., N − 3 ⋆ ⋆ ⋆ ψn,n+1 ( xn , xn+1 ) = ψ N −2,N −1 ( x N −2 , x N −1 )ψ N −1,N ( x N −1 , x N ), n = N −2 p(x⋆ ) = In other words, we have combined the original node x N −1 and x N . Moreover, we have the relationship: 1 ⋆ ⋆ ⋆ ⋆ µ ( x )µ ( x ) n = 1, 2, ..., N − 1 Z⋆ α n β n By adopting the proposed algorithm to the new undirected graph, p( x⋆ n) can be easily evaluated, and so is p( xn , x N ). p ( x n , x N ) = p ( x⋆ n) = Problem 8.17 Solution It is straightforward to see that for every path connecting node x2 and x5 in Fig.8.38, it must pass through node x3 . Therefore, all paths are blocked and the conditional property holds. For more details, you should read section 8.3.1. According to Bayes’ Theorem, we can obtain: p( x2 | x3 , x5 ) = p ( x2 , x3 , x5 ) p ( x2 ) Using the proposed algorithm in section 8.4.1, we can obtain: ∑ ∑ p ( x2 , x3 , x5 ) x x p(x) p ( x2 | x3 , x5 ) = = ∑ ∑1 ∑4 p ( x3 , x5 ) x1 x2 x4 p(x) ∑ ∑ x x ψ1,2 ψ2,3 ψ3,4 ψ4,5 = ∑ ∑1 ∑4 x1 x2 x4 ψ1,2 ψ2,3 ψ3,4 ψ4,5 (∑ ) (∑ ) x1 ψ1,2 · ψ2,3 · x4 ψ3,4 ψ4,5 ) ] (∑ ) = ∑ [( ∑ ψ ψ · ψ ψ 1 , 2 2 , 3 , , 3 4 4 5 x2 x1 x4 (∑ ) x1 ψ1,2 · ψ2,3 ) ] = ∑ [( ∑ ψ ψ 2,3 x2 x1 1,2 169 It is obvious that the right hand side doesn’t depend on x5 . Problem 8.18 Solution First, the distribution represented by a directed tree can be trivially be written as an equivalent distribution over an undirected tree by moralization. You can find more details in section 8.4.2. Alternatively, now we want to represent a distribution, which is given by a directed graph, via a directed graph. For example, the distribution defined by the undirected tree in Fig.4 can be written as: p(x) = 1 ψ1,3 ( x1 , x3 ) ψ2,3 ( x2 , x3 ) ψ3,4 ( x3 , x4 ) ψ4,5 ( x4 , x5 ) Z We simply choose x4 as the root and the corresponding directed tree is well defined by working outwards. In this case, the distribution defined by the directed tree is: p(x) = p( x4 ) p( x5 | x4 ) p( x3 | x4 ) p( x1 | x3 ) p( x2 | x3 ) Thus it is not difficult to change an undirected tree to a directed on if performing: p( x4 ) p( x5 | x4 ) ∝ ψ5,4 , p( x3 | x4 ) ∝ ψ3,4 , p( x2 | x3 ) ∝ ψ2,3 , p( x1 | x3 ) ∝ ψ1,3 , Figure 4: Example of changing an undirected tree to a directed one x i The symbol ∝ is used to represent a normalization term, which is used to guarantee the integral of PDF equal to 1. In summary, in the particular case of an undirected tree, there is only one path between any pair of nodes, and thus the maximal clique is given by a pair of two nodes in an undirected tree. This is because if we choose any three nodes x1 , x2 , x3 , according to the definition there cannot exist a loop. Otherwise there are two paths between x1 and x3 : (i) x1 − > x3 and (ii) x1 − > x2 − > x3 . In the directed tree, each node 170 only depends on only one node (except the root), i.e., its parent. Thus we can easily change a undirected tree to a directed one by matching the potential function with the corresponding conditional PDF, as shown in the example. Moreover, we can choose any node in the undirected tree to be the root and then work outwards to obtain a directed tree. Therefore, in an undirected tree with n nodes, there is n corresponding directed trees in total. Problem 8.19-8.29 Solution (Waiting for update) I am quite confused by the deduction in Eq(8.66). I do not understand the sum-prodcut algorithm and the max-sum algorithm very well. 0.9 Mixture Models and EM Problem 9.1 Solution For each r nk when n is fixed and k = 1, 2, ..., K , only one of them equals 1 and others are all 0. Therefore, there are K possible choices. When N data are given, there are K N possible assignments for { r nk ; n = 1, 2, ..., N ; k = 1, 2, ..., K }. For each assignments, the optimal {µk ; k = 1, 2, ..., K } are well determined by Eq (9.4). As discussed in the main text, by iteratively performing E-step and Mstep, the distortion measure in Eq (9.1) is gradually minimized. The worst case is that we find the optimal assignment and {µk } in the last iteration. In other words, K N iterations are required. However, it is guaranteed to converge because the assignments are finite and the optimal {µk } is determined once the assignment is given. Problem 9.2 Solution By analogy to Eq (9.1), we can write down: JN = JN −1 + K ∑ k=1 r N k ||x N − µk ||2 In the E-step, we still assign the N -th data x N to the closet center and suppose that this cloest center is µm . Therefore, the expression above will reduce to: JN = JN −1 + ||xn − µm ||2 In the M-step, we set the derivative of JN with respect to µk to 0, where k = 1, 2, ..., K . We can observe that for those µk , k ̸= m, we have: ∂ JN ∂µk = ∂ JN −1 ∂µk 171 In other words, we will only update µm in the M-step by setting the derivative of JN equal to 0. Utilizing Eq (9.4), we can obtain: µ(mN ) ∑ N −1 n = 1 r nk x n + x N ∑ N −1 n = 1 r nk + 1 = ∑ N −1 n = 1 r nk x n ∑ N −1 n = 1 r nk = x + ∑N −1N n=1 r nk 1 1 + ∑N −1 n=1 r nk x µ(mN −1) + ∑N −1N r n=1 = nk 1 1 + ∑N −1 n=1 = µ(mN −1) + r nk x ∑ N −1N n = 1 r nk µ( N −1) n = 1 r nk − ∑Nm−1 1 + ∑N −11 n=1 = µ(mN −1) + r nk x N − µ(mN −1) ∑ 1 + nN=−11 r nk So far we have obtained a sequential on-line update formula just as required. Problem 9.3 Solution We simply follow the hint. p(x) = ∑ p(z) p(x|z) z = ] zk K [ ∑∏ (πk N (x|µk , Σk ) z k=1 Note that we have used 1-of-K coding scheme for z = [ z1 , z2 , ..., zK ]T . To be more specific, only one of z1 , z2 , ..., zK will be 1 and all others will equal 0. Therefore, the summation over z actually consists of K terms and the k-th term corresponds to z k equal to 1 and others 0. Moreover, for the k-th term, the product will reduce to πk N (x|µk , Σk ). Therefore, we can obtain: p(x) = K [ ∑∏ z k=1 ] zk K ∑ πk N (x|µk , Σk ) (πk N (x|µk , Σk ) = k=1 Just as required. Problem 9.4 Solution According to Bayes’ Theorem, we can write: p(θ |X) ∝ p(X|θ ) p(θ ) 172 Taking logarithm on both sides, we can write: ln p(θ |X) ∝ ln p(X|θ ) + ln p(θ ) Further utilizing Eq (9.29), we can obtain: {∑ } ln p(θ |X) ∝ ln p(X, Z|θ ) + ln p(θ ) Z = ln {[ ∑ Z = ln {∑ } ] p(X, Z|θ ) · p(θ ) } p(X, Z|θ ) p(θ ) Z In other words, in thise case, the only modification is that the term p(X, Z|θ ) in Eq (9.29) will be replaced by p(X, Z|θ ) p(θ ). Therefore, in the E-step, we still need to calculate the posterior p(Z|X, θ old ) and then in the M-step, we are re′ quired to maximize Q (θ , θ old ). In this case, by analogy to Eq (9.30), we can ′ write down Q (θ , θ old ): [ ] ∑ ′ p(Z|X, θ old ) ln p(X, Z|θ ) p(θ ) Q (θ , θ old ) = Z = ∑ [ ] p(Z|X, θ old ) ln p(X, Z|θ ) + ln p(θ ) Z = ∑ p(Z|X, θ old ) ln p(X, Z|θ ) + Z = ∑ ∑ p(Z|X, θ old ) ln p(θ ) Z p(Z|X, θ old p(Z|X, θ old ) ln p(X, Z|θ ) + ln p(θ ) · Z = ∑ ∑ p(Z|X, θ old ) Z ) ln p(X, Z|θ ) + ln p(θ ) Z = Q (θ , θ old ) + ln p(θ ) Just as required. Problem 9.5 Solution Notice that the condition on µ, Σ and π can be omitted here, and we only need to prove p(Z|X) can be written as the product of p(zn |xn ). Correspondingly, the small dots representing µ, Σ and π can also be omitted in Fig 9.6. Observing Fig 9.6 and based on definition, we can write : p(X, Z) = p(x1 , z1 ) p(z1 )...p(x N , z N ) p(z N ) = p(x1 , z1 )...p(x N , z N ) Moreover, since there is no link from zm to zn , from xm to xn , and from zm to xn ( m ̸= n), we can obtain: p(Z) = p(z1 )...p(z N ), p(X) = p(x1 )...p(x N ) 173 These can also be verified by calculating the marginal distribution from p(X, Z), for example: p(Z) = ∑ ∑ p(X, Z) = p(x1 , z1 )...p(x N , z N ) = p(z1 )...p(z N ) x1 ,...,x N X According to Bayes’ Theorem, we have p(Z|X) = = = = p(X|Z) p(Z) p(X) [∏ ][ ∏ ] N N n=1 p(x n |z n ) n=1 p(z n ) ∏N n=1 p(x n ) N p(x |z ) p(z ) ∏ n n n p(xn ) n=1 N ∏ n=1 p(zn |xn ) Just as required. The essence behind the problem is that in the directed graph, there are only links from zn to xn . The deeper reason is that (i) the mixture model is given by Fig 9.4, and (ii) we assume the data {xn } is i.i.d, and thus there is no link from xm to xn . Problem 9.6 Solution By analogy to Eq (9.19), we calculate the derivative of Eq (9.14) with respect to Σ: N 1 ∂a N ∑ ∂ ln p ∂ ∑ n ln a n } = = { (∗) ∂Σ ∂Σ n=1 n=1 a n ∂Σ Where we have defined: an = K ∑ k=1 πk N (xn |µk , Σ) Recall that in Prob.2.34, we have proved: ∂ ln N (xn |µk , Σ) ∂Σ 1 1 = − Σ−1 + Σ−1 Snk Σ−1 2 2 Where we have defined: Snk = (xn − µk )(xn − µk )T 174 Therefore, we can obtain: ∂a n ∂Σ = = = = = K ∂ {∑ ∂Σ k=1 } πk N (xn |µk , Σ) } K ∂ { ∑ πk N (xn |µk , Σ) k=1 ∂Σ K ∑ [ ]} ∂ { exp ln N (xn |µk , Σ) πk ∂Σ k=1 ] K ∑ [ ] ∂ [ ln N (xn |µk , Σ) πk · exp ln N (xn |µk , Σ) · ∂Σ k=1 K ∑ 1 1 πk · N (xn |µk , Σ) · (− Σ−1 + Σ−1 Snk Σ−1 ) 2 2 k=1 Substitute the equation above into (∗), we can obtain: ∂ ln p ∂Σ = = = = N 1 ∂a ∑ n a ∂ Σ n n=1 ∑K 1 −1 N ∑ + Σ−1 Snk Σ−1 ) k=1 π k · N (x n |µ k , Σ) · (− 2 Σ ∑K n=1 j =1 π j N (x n |µ j , Σ) N ∑ K ∑ 1 1 γ( z nk ) · (− Σ−1 + Σ−1 Snk Σ−1 ) 2 2 n=1 k=1 } {∑ } N ∑ K N ∑ K 1 1{ ∑ γ( z nk ) Σ−1 + Σ−1 γ( z nk )Snk Σ−1 − 2 n=1 k=1 2 n=1 k=1 If we set the derivative equal to 0, we can obtain: ∑ N ∑K n=1 k=1 γ( z nk )S nk Σ = ∑N ∑ K n=1 k=1 γ( z nk ) Problem 9.7 Solution We begin by calculating the derivative of Eq (9.36) with respect to µk : ∂ ln p ∂µk = N ∑ K ∂ { ∑ ∂µk n=1 k=1 z nk [ ln πk + ln N (xn |µk , Σk ) ] } N ∂ { ∑ = z nk [ ln πk + ln N (xn |µk , Σk ) ] ∂µk n=1 } N ∑ ∂ { = z nk ln N (xn |µk , Σk ) n=1 ∂µ k } ∑ ∂ { = ln N (xn |µk , Σk ) xn ∈C k ∂µ k } 175 Where we have used xn ∈ C k to represent the data point xn which are assigned to the k-th cluster. Therefore, µk is given by the mean of those xn ∈ C k just as the case of a single Gaussian. It is exactly the same for the covariance. Next, we maximize Eq (9.36) with respect to πk by enforcing a Lagrange multiplier: K ∑ L = ln p + λ( πk − 1) k=1 We calculate the derivative of L with respect to πk and set it to 0: ∂L ∂πk = N z ∑ nk +λ = 0 π n=1 k We multiply both sides by πk and sum over k making use of the constraint Eq (9.9), yielding λ = − N . Substituting it back into the expression, we can obtain: N 1 ∑ z nk πk = N n=1 Just as required. Problem 9.8 Solution Since γ( z nk ) is fixed, the only dependency of Eq (9.40) on µk occurs in the Gaussian, yielding: ∂E z [ln p] ∂µk N ∂ {∑ = = = ∂µk N ∑ n=1 N ∑ n=1 n=1 } γ( z nk ) ln N (xn |µk , Σk ) γ( z nk ) · ∂ ln N (xn |µk , Σk ) ∂µk [ ] 1 γ( z nk ) · − Σ− k (x n − µ k ) Setting the derivative equal to 0, we obtain exactly Eq (9.16), and consequently Eq (9.17) just as required. Note that there is a typo in Eq (9.16), Σk 1 shoule be Σ− . k Problem 9.9 Solution We first calculate the derivative of Eq (9.40) with respect to Σk : ∂E z ∂Σk = = = N ∂ {∑ ∂Σk N ∑ n=1 n=1 γ( z nk ) ln N (xn |µk , Σk ) γ( z nk ) } ∂ ln N (xn |µk , Σk ) ∂Σk [ 1 ] 1 −1 1 −1 γ( z nk ) · − Σ− + Σ S Σ nk k 2 k 2 k n=1 N ∑ 176 As in Prob 9.6, we have defined: Snk = (xn − µk )(xn − µk )T Setting the derivative equal to 0 and rearranging it, we obtain: ∑N ∑N γ( z nk ) Snk n=1 γ( z nk ) S nk = n=1 Σk = ∑ N Nk n=1 γ( z nk ) Where Nk is given by Eq (9.18). So now we have obtained Eq (9.19) just as required. Next to maximize Eq (9.40) with respect to πk , we still need to introduce Lagrange multiplier to enforce the summation of pi k over k equal to 1, as in Prob 9.7: K ∑ L = E z + λ( πk − 1) k=1 We calculate the derivative of L with respect to πk and set it to 0: ∂L ∂πk = N γ( z ) ∑ nk +λ = 0 π k n=1 We multiply both sides by πk and sum over k making use of the constraint Eq (9.9), yielding λ = − N (you can see Eq (9.20)- Eq (9.22) for more details). Substituting it back into the expression, we can obtain: πk = N Nk 1 ∑ γ( z nk ) = N n=1 N Just as Eq (9.22). Problem 9.10 Solution According to the property of PDF, we know that: p(xb |xa ) = K ∑ p(x) πk p(xa , xb ) = = · p(x| k) p(xa ) p(xa ) k=1 p(xa ) Note that here p(xa ) can be viewed as a normalization constant used to guarantee that the integration of p(xb |xa ) equal to 1. Moreover, similarly, we can also obtain: K ∑ πk p(xa |xb ) = · p(x| k) k=1 p(x b ) Problem 9.11 Solution According to the problem description, the expectation, i.e., Eq(9.40), can now be written as: E z [ln p] = K N ∑ ∑ n=1 k=1 { } γ( z nk ) ln πk + ln N (xn |µk , ϵI) 177 In the M-step, we are required to maximize the expression above with respect to µk and πk . In Prob.9.8, we have already proved that µk should be given by Eq (9.17): N 1 ∑ µk = γ( z nk )xn (∗) Nk n=1 Where Nk is given by Eq (9.18). Moreover, in this case, by analogy to Eq (9.16), γ( z nk ) is slightly different: πk N (xn |µk , ϵI) ∑ j π j N (x n |µ j , ϵI) = γ( z nk ) When ϵ → 0, we can obtain: ∑ π j N (xn |µ j , ϵI) ≈ πm N (xn |µm , ϵI), where m = argmin j ||xn − µ j ||2 j To be more clear, the summation is dominated by the max of π j N (xn |µ j , ϵI), and this term is further determined by the exponent, i.e., −||xn − µ j ||2 . Therefore, γ( z nk ) is given by exactly Eq (9.2), i.e., we have γ( z nk ) = r nk . Combining with (∗), we can obtain exactly Eq (9.4). Next, according to Prob.9.9, πk is given by Eq(9.22): ∑N γ( z nk ) Nk r nk πk = = n= = N N N In other words, πk equals the fraction of the data points assigned to the k-th cluster. Problem 9.12 Solution First we calculate the mean µk : ∫ µk = x p(x) d x ∫ = = = x K ∑ k=1 K ∑ k=1 K ∑ k=1 πk p(x| k) d x ∫ πk x p(x| k) d x πk µk Then we deal with the covariance matrix. For an arbitrary random variable x, according to Eq (2.63) we have: cov[x] = E[(x − E[x])(x − E[x])T ] = E[xxT ] − E[x]E[x]T 178 Since E[x] is already obtained, we only need to solve E[xxT ]. First we only focus on the k-th component and rearrange the expression above, yielding: Ek [xxT ] = covk [x] + Ek [x]Ek [x]T = Σk + µk µT k We further use Eq (2.62), yielding: ∫ E[xxT ] = = = = xxT K ∑ k=1 K ∑ ∫ xxT p(x| k) d x πk k=1 K ∑ k=1 K ∑ k=1 πk p(x| k) d x πk Ek [xxT ] πk (µk µT k + Σk ) Therefore, we obtain Eq (9.50) just as required. Problem 9.13 Solution First, let’s make this problem more clear. In a mixture of Bernoulli distribution, whose complete-data log likelihood is given by Eq (9.54) and whose model parameters are πk and µk . If we want to obtain those parameters, we can adopt EM algorithm. In the E-step, we calculate γ( z nk ) as shown in Eq (9.56). In the M-step, we update πk and µk according to Eq (9.59) and Eq (9.60), where Nk and x¯k are defined in Eq (9.57) and Eq (9.58). Now let’s back to this problem. The expectation of x is given by Eq (9.49): E[x] = ( opt) Here πk ( opt) and µk K ∑ k=1 ( opt) ( opt) µk πk are the parameters obtained when EM is converged. 179 Using Eq (9.58) and Eq(9.59), we can obtain: E[x] = = K ∑ k=1 K ∑ k=1 ( opt) ( opt) µk πk ( opt) NK ( opt) = K N ∑ k N k=1 N ∑ 1 ( opt) πk n=1 1 γ( z nk )( opt) xn N ∑ ( opt) NK n=1 γ( z nk )( opt) xn = N K 1 ∑ ∑ γ( z nk )( opt) xn N n=1 k=1 = N ∑ K γ( z )( opt) x ∑ n nk N n=1 k=1 = N x ∑ K ∑ n γ( z nk )( opt) N n=1 k=1 = N 1 ∑ xn = x̄ N n=1 If we set all µk equal to µ b in initialization, in the first E-step, we can obtain: π(0) π(0) p(xn |µk = µ b) k (1) = ∑ k (0) = π(0) γ( z nk ) = ∑ k (0) K K b) j =1 π j p(x n |µ j = µ j =1 π j Note that here µ b and π(0) are the initial values. In the subsequent M-step, k according to Eq (9.57)-(9.60), we can obtain: µ(1) k = 1 N ∑ Nk(1) n=1 ∑N ∑N (1) γ( z nk ) And π(1) k Nk(1) (1) n=1 γ( z nk ) x n ∑N (1) n=1 γ( z nk ) xn = ∑N n=1 γ( z nk ) = (0) n=1 π k x n ∑N (0) n=1 π k ∑N = n=1 x n N ∑N (1) (0) n=1 π k = π(0) k N N N In other words, in this case, after the first EM iteration, we find that the new µ(1) are all identical, which are all given by x̄. Moreover, the new π(1) are k k = = = identical to their corresponding initial value π(0) . Therefore, in the second k EM iteration, we can similarly conclude that: µ(2) = µ(1) = x̄ , k k π(2) = π(1) = π(0) k k k In other words, the EM algorithm actually stops after the first iteration. Problem 9.14 Solution 180 Let’s follow the hint. p(x, z|µ, π) = p(x|z, µ) · p(z|π) K K ∏ ∏ z p(x|µk ) zk · πkk = k=1 K ∏ = k=1 k=1 [ ] zk πk p(x|µk ) Then we marginalize over z, yielding: p(x|µ) = ∑ p(x, z|µ, π) = z ] zk K [ ∑∏ πk p(x|µk ) z k=1 The summation over z is made up of K terms and the k-th term corresponds to z k = 1 and other z j , where j ̸= k, equals 0. Therefore, the k-th term will simply reduce to πk p(x|µk ). Hence, performing the summation over z will finally give Eq (9.47) just as required. To be more clear, we summarize the aforementioned statement: p(x|µ) = = K [ ∑∏ z k=1 πk p(x|µk ) ] zk ] zk ¯ K [ ∏ ¯ πk p(x|µk ) ¯ k=1 z1 = 1 + ... + ] zk ¯ K [ ∏ ¯ πk p(x|µk ) ¯ k=1 zK = 1 = π1 p(x|µ1 ) + ... + πK p(x|µK ) = K ∑ k=1 πk p(x|µk ) Problem 9.15 Solution Noticing that πk doesn’t depend on any µki , we can omit the first term in the open brace when calculating the derivative of Eq (9.55) with respect to µki : ∂E z [ln p] ∂µki = = = = = N ∑ K { ∑ ∂ ∂µki n=1 k=1 γ( z nk ) D { K ∑ N ∑ ∑ ∂ ∂µki n=1 k=1 i =1 D [ ∑ i =1 ]} xni ln µki + (1 − xni ) ln(1 − µki ) [ ]} γ( z nk ) xni ln µki + (1 − xni ) ln(1 − µki ) [ ]} ∂ { γ( z nk ) xni ln µki + (1 − xni ) ln(1 − µki ) n=1 ∂µ ki ( ) N ∑ xni 1 − xni γ( z nk ) − µki 1 − µki n=1 N ∑ N ∑ n=1 γ( z nk ) xni − µki µki (1 − µki ) 181 Setting the derivative equal to 0, we can obtain: ∑N n=1 γ( z nk ) x ni µki = ∑N n=1 γ( z nk ) = N 1 ∑ γ( z nk ) xni Nk n=1 Where Nk is defined as Eq (9.57). If we group all the µki as a column vector, i.e., µk = [µk1 , µk2 , ..., µkD ]T , we will obtain Eq (9.59) just as required. Problem 9.16 Solution We follow the hint beginning by introducing a Lagrange multiplier: L = E z [ln p(X, Z|µ, π)] + λ( K ∑ k=1 πk − 1) We calculate the derivative of L with respect to πk and then set it equal to 0: N γ( z ) ∑ ∂L nk = +λ = 0 (∗) ∂πk π k n=1 Here E z [ln p] is given by Eq (9.55). We first multiply both sides of the expression by πk and then adopt summation with respect to k, which gives: N ∑ K ∑ n=1 k=1 Noticing that ∑K k=1 π k γ( z nk ) + K ∑ k=1 λπk = 0 equals 1, we can obtain: λ=− K N ∑ ∑ n=1 k=1 γ( z nk ) Finally, substituting it back into (∗) and rearranging it, we can obtain: ∑K k=1 γ( z nk ) πk = − λ ∑K γ( z nk ) Nk =1 = ∑ N k∑ = K N n=1 k=1 γ( z nk ) Where Nk is defined by Eq (9.57) and N is the summation of Nk over k, and also equal to the number of data points. Problem 9.17 Solution The incomplete-data log likelihood is given by Eq (9.51), and p(xn |µk ) lies in the interval [0, 1], which can be easily verified by its definition, i.e., Eq (9.44). Therefore, we can obtain: ln p(X|µ, π) = N ∑ n=1 ln K {∑ k=1 N K N } ∑ {∑ } ∑ ln 1 = 0 πk × 1 ≤ ln πk p(xn |µk ) ≤ n=1 k=1 n=1 182 Where we have used the fact that the logarithm is monotonic increasing, and that the summation of πk over k equals 1. Moreover, if we want to achieve the equality, we need p(xn |µk ) equal to 1 for all n = 1, 2, ..., N . However, this is hardly possible. To illustrate this, suppose that p(xn |µk ) equals 1 for all data points. Without loss of generality, consider two data points x1 = [ x11 , x12 , ..., x1D ]T and x2 = [ x21 , x22 , ..., x2D ]T , whose i -th entries are different. We further assume x1 i = 1 and x2 i = 0 since x i is a binary variable. According to Eq (9.44), if we want p(x1 |µk ) = 1, we must have µ i = 1 (otherwise it muse be less than 1). However, this will lead p(x2 |µk ) equal to 0 since there is a term 1 − µ i = 0 in the product shown in Eq (9.44). Therefore, when the data set is pathological, we will achieve this singularity point by adopting EM. Note that in the main text, the author states that the condition should be pathological initialization. This is also true. For instance, in the extreme case, when the data set is not pathological, if we initialize one πk equal to 1 and others all 0, and some of µ i to 1 and others 0, we may also achieve the singularity. Problem 9.18 Solution In Prob.9.4, we have proved that if we want to maximize the posterior by EM, the only modification is that in the M-step, we need to maximize ′ Q (θ , θ old ) = Q (θ , θ old ) + ln p(θ ). Here Q (θ , θ old ) has already been given by E z [ln p], i.e., Eq (9.55). Therefore, we derive for ln p(θ ). Note that ln p(θ ) is made up of two parts:(i) the prior for µk and (ii) the prior for π, we begin by dealing with the first part. Here we assume the Beta prior for µki , where k is fixed, is the same, i.e.,: p(µki |a k , b k ) = Γ(a k + b k ) a k −1 µ (1 − µki )b k −1 , Γ(a k )Γ( b k ) ki i = 1, 2, ..., D Therefore, the contribution of this Beta prior to ln p(θ ) should be given by: K ∑ D ∑ (a i − 1) ln µki + ( b i − 1) ln(1 − µki ) k=1 i =1 ′ One thing worthy mentioned is that since we will maximize Q (θ , θ old ) with respect to π, µk , we can omit the terms which do not depend on π, µk , / such as Γ(a k + b k ) Γ(a k )Γ( b k ). Then we deal with the second part. According to Eq (2.38), we can obtain: p(π|α) = K ∏ Γ(α0 ) α −1 π k Γ(α1 )...Γ(αK ) k=1 k Therefore, the contribution of the Dirichlet prior to ln p(θ ) should be given by: K ∑ k=1 (αk − 1) ln πk 183 ′ Therefore, now Q (θ , θ old ) can be written as: ′ Q (θ , θ old ) = E z [ln p] + K ∑ D [ ∑ k=1 i =1 ] ∑ K (a i − 1) ln µki + ( b i − 1) ln(1 − µki ) + (αk − 1) ln πk k=1 ′ Similarly, we calculate the derivative of Q (θ , θ old ) with respect to µki . This can be simplified by reusing the deduction in Prob.9.15: ∂Q ′ ∂µki = = = = ∂E z [ln p] ∂µki N ∑ + ai − 1 bi − 1 − µki 1 − µki ai − 1 bi − 1 xni 1 − xni − )+ − µki 1 − µki µki 1 − µki n=1 ∑N ∑N (1 − xni )γ( z nk ) + b i − 1 n=1 x ni · γ( z nk ) + a i − 1 − n=1 µki 1 − µki Nk x̄ki + a i − 1 Nk − Nk x̄ki + b i − 1 − µki 1 − µki γ( z nk )( Note that here x̄ki is defined as the i -th entry of x̄k defined in Eq (9.58). To be more clear, we have used Eq (9.57) and Eq (9.58) in the last step: N ∑ n=1 xni · γ( z nk ) = Nk · [ 1 ∑ ] N xni · γ( z nk ) = Nk · x̄ki Nk n=1 Setting the derivative equal to 0 and rearranging it, we can obtain: µki = Nk x̄ki + a i − 1 Nk + a i − 1 + b i − 1 ′ Next we maximize Q (θ , θ old ) with respect to π. By analogy to Prob.9.16, we introduce Lagrange multiplier: L ∝ Ez + K ∑ k=1 (αk − 1) ln πk + λ( K ∑ k=1 πk − 1) ′ Note that the second term on the right hand side of Q in its definition has been omitted, since that term can be viewed as a constant with regard to π. We then calculate the derivative of L with respect to πk by taking advantage of Prob.9.16: N γ( z ) ∑ ∂L αk − 1 nk = + +λ = 0 ∂πk πk n=1 π k Similarly, We first multiply both sides of the expression by πk and then adopt summation with respect to k, which gives: N K ∑ ∑ k=1 n=1 γ( z nk ) + K ∑ k=1 (αk − 1) + K ∑ k=1 λπk = 0 184 Noticing that ∑K k=1 π k λ=− equals 1, we can obtain: K ∑ k=1 Nk − K ∑ k=1 (αk − 1) = − N − α0 + K Here we have used Eq (2.39). Substituting it back into the derivative, we can obtain: ∑N Nk + αk − 1 n=1 γ( z nk ) + α k − 1 = πk = −λ N + α0 − K It is not difficult to show that if N is large, the update formula for π and µ in this case (MAP), will reduce to the results given in the main text (MLE). Problem 9.19 Solution We first introduce a latent variable z = [ z1 , z2 , ..., zK ]T , only one of which equals 1 and others all 0. The conditional distribution of x is given by: p(x|z, µ) = K ∏ k=1 p(x|µk ) zk The distribution of the latent variable is given by: p(z|π) = K ∏ k=1 z πkk If we follow the same procedure as in Prob.9.14, we can show that Eq (9.84) holds. In other words, the introduction of the latent variable is valid. Therefore, according to Bayes’ Theorem, we can obtain: p(X, Z|µ, π) = N ∏ n=1 p(zn |π) p(xn |zn , µ) = N ∏ K [ ∏ n=1 k=1 πk p(x|µ) ] znk We further use Eq (9.85), which gives: ln p(X, Z|µ, π) = = N ∑ K ∑ n=1 k=1 K N ∑ ∑ n=1 k=1 [ ∏ ] D ∏ M xni j z nk ln πk µki j d =1 j =1 [ z nk ln πk + M D ∑ ∑ d =1 j =1 xni j ln µki j ] Similarly, in the E-step, the responsibilities are evaluated using Bayes’ theorem, which gives: πk p(xn |µk ) γ( z nk ) = E[ z nk ] = ∑K j =1 π j p(x n |µ j ) 185 Next, in the M-step, we are required to maximize E z [ln p(X, Z|µ, π)] with respect to π and µk , where E z [ln p(X, Z|µ, π)] is given by: E z [ln p(X, Z|µ, π)] = N ∑ K ∑ n=1 k=1 [ ] D ∑ M ∑ γ( z nk ) ln πk + xni j ln µki j i =1 j =1 Notice that there exists two constraints: (i) the summation of πk over k equals 1, and (ii) the summation of µki j over j equals 1 for any k and i , we need to introduce Lagrange multiplier: L = E z [ln p] + λ( K ∑ k=1 πk − 1) + K ∑ D ∑ k=1 i =1 η ki ( M ∑ j =1 µki j − 1) First we maximize L with respect to πk . This is actually identical to the case in the main text. To be more clear, we calculate the derivative of L with respect to πk : N γ( z ) ∑ ∂L nk = +λ ∂πk π k n=1 As in Prob.9.16, we can obtain: πk = Nk N Where Nk is defined as: N ∑ Nk = n=1 γ( z nk ) N is the summation of Nk over k, and also equals the number of data points. Then we calculate the derivative of L with respect to µki j : ∂L ∂µki j = N γ( z ) x ∑ nk ni j n=1 µki j + η ki We set it to 0 and multiply both sides by µki j , which gives: N ∑ n=1 γ( z nk ) xni j + η ki µki j = 0 By analogy to deriving πk , an intuitive idea is to perform summation for ∑ the above expression over j and hence we can use the constraint j µki j = 1. η ki = − M ∑ N ∑ j =1 n=1 γ( z nk ) xni j = − N ∑ n=1 ] [∑ N M ∑ γ( z nk ) = − N k xni j = − γ( z nk ) j =1 n=1 186 Where we have used the fact that derivative, we can obtain: ∑ j x ni j ∑N µki j = − n=1 γ( z nk ) x ni j η ki = = 1. Substituting back into the N 1 ∑ γ( z nk ) xni j Nk n=1 Problem 9.20 Solution We first calculate the derivative of Eq (9.62) with respect to α and set it to 0: ∂E [ln p] M 1 2π E[wT w] = − =0 ∂α 2 2π α 2 We rearrange the equation above, which gives: α= M (∗) E[wT w] Therefore, we now need to calculate the expectation E[wT w]. Notice that the posterior has already been given by Eq (3.49): p(w|t) = N (m N , S N ) To calculate E[wT w], here we write down an property for a Gaussian random variable: if x ∼ N (m, Σ), we have: E[xT Ax] = Tr[AΣ] + mT Am This property has been shown in Eq(378) in ’the Matrix Cookbook’. Utilizing this property, we can obtain: E[wT w] = Tr[S N ] + mT N mN Substituting it back into (∗), we obtain what is required. Problem 9.21 Solution We calculate the derivative of Eq (9.62) with respect to β and set it equal to 0: N N 1 2π 1 ∑ ∂ ln p = − E[( t n − wT ϕn )2 ] = 0 ∂β 2 2π β 2 n=1 Rearranging it, we obtain: β = ∑N N n=1 E[( t n − w Tϕ 2 n) ] Therefore, we are required to calculate the expectation. To be more clear, this expectation is with respect to the posterior defined by Eq (3.49): p(w|t) = N (m N , S N ) 187 We expand the expectation: E[( t n − wT ϕn )2 ] = E[ t2n − 2 t n · wT ϕn + wT ϕn ϕT n w] = E[ t2n ] − E[2 t n · wT ϕn ] + E[wT (ϕn ϕT n ) w] = T T T t2n − 2 t n · E[ϕT n w] + Tr[ϕ n ϕ n S N ] + m N ϕ n ϕ n m N = T T T t2n − 2 t n ϕT n · E[w] + Tr[ϕ n ϕ n S N ] + m N ϕ n ϕ n m N = T T T t2n − 2 t n ϕT n m N + Tr[ϕ n ϕ n S N ] + m N ϕ n ϕ n m N 2 T = ( t n − mT N ϕ N ) + Tr[ϕ n ϕ n S N ] Substituting it back into the derivative, we can obtain: 1 β } N { 1 ∑ 2 T ( t n − mT ϕ ) + Tr[ ϕ ϕ S ] n n N N N N n=1 } 1{ ||t − Φm N ||2 + Tr[ΦT ΦS N ] N = = Note that in the last step, we have performed vectorization. Here the j -th row of Φ is given by ϕ j , identical to the definition given in Chapter 3. Problem 9.22 Solution First let’s expand the complete-data log likelihood using Eq (7.79), Eq (7.80) and Eq (7.76). ln p(t|X, w, β) p(w|α) = ln p(t|X, w, β) + ln p(w|α) N M ∑ ∑ = ln p( t n | xn , w, β−1 ) + ln N (w i |0, α−i 1 ) n=1 = = N ∑ n=1 i =1 ln N ( t n |wT ϕn , β−1 ) + M ∑ i =1 ln N (w i |0, α−i 1 ) N M M α N 1∑ αi ∑ β β ∑ i 2 ( t n − wT ϕn )2 + ln ln − − w 2 2π 2 n=1 2 i=1 2π i=1 2 i Therefore, the expectation of the complete-data log likelihood with respect to the posterior of w equals: Ew [ln p] = N M M α N 1∑ αi ∑ β β ∑ i Ew [( t n − wT ϕn )2 ] + ln ln − − Ew [w2i ] 2 2π 2 n=1 2 i=1 2π i=1 2 We calculate the derivative of Ew [ln p] with respect to α i and set it to 0: ∂Ew [ln p] ∂α i = 1 1 2π 1 − Ew [w2i ] = 0 2 2π α i 2 Rearranging it, we can obtain: αi = 1 Ew [w2i ] = 1 Ew [wwT ] ( i,i) 188 Here the subscript ( i, i ) represents the entry on the i -th row and i -th column of the matrix Ew [wwT ]. So now, we are required to calculate the expectation. To be more clear, this expectation is with respect to the posterior defined by Eq (7.81): p(w|t, X, α, β) = N (m, Σ) Here we use Eq (377) described in ’the Matrix Cookbook’. We restate it here: if w ∼ N (m, Σ), we have: E[wwT ] = Σ + mmT According to this equation, we can obtain: αi = 1 Ew [wwT ] = ( i,i ) 1 (Σ + mmT ) = ( i,i ) 1 Σ ii + m2i Now We calculate the derivative of Ew [ln p] with respect to β and set it to 0: ∂Ew [ln p] ∂β = N N 1 2π 1 ∑ − Ew [( t n − wT ϕn )2 ] = 0 2 2π β 2 n=1 Rearranging it, we obtain: β(new) = ∑ N N n=1 Ew [( t n − w Tϕ n) 2] Therefore, we are required to calculate the expectation. By analogy to the deduction in Prob.9.21, we can obtain: 1 β(new) = = } N { 1 ∑ ( t n − mT ϕ N )2 + Tr[ϕn ϕT Σ ] n N n=1 } 1{ ||t − Φm||2 + Tr[ΦT ΦΣ] N To make it consistent with Eq (9.68), let’s first prove a statement: (β−1 A + ΦT Φ)Σ = β−1 I This can be easily shown by substituting Σ, i.e., Eq(7.83), back into the expression: (β−1 A + ΦT Φ) Σ = (β−1 A + ΦT Φ) (A + βΦT Φ)−1 = β−1 I Now we start from this statement and rearrange it, which gives: ΦT ΦΣ = β−1 I − β−1 AΣ = β−1 (I − AΣ) 189 Substituting back into the expression for β(new) : 1 β(new) = = = = = } 1{ ||t − Φm||2 + Tr[ΦT ΦΣ] N } 1{ ||t − Φm||2 + Tr[β−1 (I − AΣ)] N } 1{ ||t − Φm||2 + β−1 Tr[I − AΣ] N } ∑ 1{ ||t − Φm||2 + β−1 (1 − α i Σ ii ) N i ∑ ||t − Φm||2 + β−1 i γ i N Here we have defined γ i = 1 − α i Σ ii as in Eq (7.89). Note that there is a typo in Eq (9.68), m N should be m. Problem 9.23 Solution Some clarifications must be made here, Eq (7.87)-(7.88) only gives the same stationary points, i.e., the same α⋆ and β⋆ , as those given by Eq (9.67)(9.68). However, the hyper-parameters estimated at some specific iteration may not be the same by those two different methods. When convergence is reached, Eq (7.87) can be written as: α⋆ = 1 − α⋆ Σ ii m2i Rearranging it, we can obtain: α⋆ = 1 m2i + Σ ii This is identical to Eq (9.67). When convergence is reached, Eq (9.68) can be written as: ∑ ||t − Φm||2 + (β⋆ )−1 i γ i (β⋆ )−1 = N Rearranging it, we can obtain: (β⋆ )−1 = This is identical to Eq (7.88). Problem 9.24 Solution ||t − Φm||2 ∑ N − i γi 190 We substitute Eq (9.71) and Eq (9.72) into Eq (9.70): { p(X, Z|θ ) p(Z|X, θ ) } q(Z) ln − ln q(Z) q(Z) Z { p(X, Z|θ ) } ∑ = q(Z) ln p(Z|X, θ ) Z ∑ = q(Z) ln p(X|θ ) L( q, θ ) + KL ( q|| p) = ∑ Z = ln p(X|θ ) Note that in the last step, we have used the fact that ln p(X|θ ) doesn’t depend on Z, and that the summation of q(Z) over Z equal to 1 because q(Z) is a PDF. Problem 9.25 Solution We calculate the derivative of Eq (9.71) with respect to θ , given q(Z) = p(Z|X, θ (old) ): ∂L( q, θ ) ∂θ = = = = = = = ∂ {∑ ∂θ p(Z|X, θ (old) ) ln p(X, Z|θ ) } p(Z|X, θ (old) ) } ∑ ∂ {∑ p(Z|X, θ (old) ) ln p(X, Z|θ ) − p(Z|X, θ (old) ) ln p(Z|X, θ (old) ) ∂θ Z Z } ∂ {∑ p(Z|X, θ (old) ) ln p(X, Z|θ ) ∂θ Z ∑ ∂ ln p(X, Z|θ ) p(Z|X, θ (old) ) ∂θ Z ∑ 1 ∂ p(X, Z|θ ) p(Z|X, θ (old) ) p (X , Z | θ ) ∂θ Z ∑ 1 ∂ p(X|θ ) · p(Z|X, θ ) p(Z|X, θ (old) ) p(X, Z|θ ) ∂θ Z [ (old) ∑ p(Z|X, θ ) ∂ p(Z|X, θ ) ∂ p(X|θ ) ] p(X|θ ) + p(Z|X, θ ) p(X, Z|θ ) ∂θ ∂θ Z Z 191 We evaluate this derivative at θ = θ old : { ∑ p(Z|X, θ (old) ) [ ∂L( q, θ ) ¯¯ ∂ p(Z|X, θ ) ∂ p(X|θ ) ] }¯¯ p(X|θ ) + p(Z|X, θ ) ¯ old = ¯ old θ θ ∂θ p(X, Z|θ ) ∂θ ∂θ Z ¯ ¯ ] ∑ p(Z|X, θ (old) ) [ (old) ∂ p(X|θ ) ¯ (old) ∂ p(Z|X, θ ) ¯ p (X | θ ) + p (Z | X , θ ) = ¯ ¯ (old) (old) (old) θ θ ∂θ ∂θ ) Z p(X, Z|θ ¯ ¯ ] [ ∑ 1 (old) ∂ p(X|θ ) ¯ (old) ∂ p(Z|X, θ ) ¯ + p (Z | X , θ ) (X | = p θ ) ¯ ¯ (old) θ (old) θ (old) ∂θ ∂θ ) Z p(X|θ ∑ ∂ p(Z|X, θ ) ¯¯ ∑ p(Z|X, θ (old) ) ∂ p(X|θ ) ¯¯ = · ¯ (old) + ¯ (old) (old) θ θ ∂θ ∂θ ) Z Z p(X|θ ¯ ¯ ∑ ∂ p(Z|X, θ ) ¯ 1 ∂ p(X|θ ) ¯ = · ¯ (old) + ¯ (old) (old) θ θ ∂θ ∂θ p(X|θ ) Z ¯ ∑ ∂ p(Z|X, θ ) ¯¯ ∂ ln p(X|θ ) ¯ = ¯ (old) + ¯ (old) θ θ ∂θ ∂θ Z { ∂ ∑ }¯ ∂ ln p(X|θ ) ¯¯ ¯ = p(Z|X, θ ) ¯ (old) + ¯ (old) θ θ ∂θ Z ∂θ ¯ ¯ ∂1 ¯ ∂ ln p(X|θ ) ¯ = ¯ (old) + ¯ (old) θ θ ∂θ ∂θ ¯ ∂ ln p(X|θ ) ¯ = ¯ (old) θ ∂θ This problem can be much easier to prove if we view it from the perspective of KL divergence. Note that when q(Z) = p(Z|X, θ (old) ), the KL divergence vanishes, and that in general KL divergence is less or equal to zero. Therefore, we must have: ∂K L( q|| p) ¯¯ ¯ (old) = 0 θ ∂θ Otherwise, there exists a point θ in the neighborhood near θ (old) which leads the KL divergence less than 0. Then using Eq (9.70), it is trivial to prove. Problem 9.26 Solution From Eq (9.18), we have: Nkold = ∑ γold ( z nk ) n If now we just re-evaluate the responsibilities for one data point xm , we can obtain: ∑ old Nknew = γ ( z nk ) + γnew ( z mk ) n̸= m = ∑ γold ( z nk ) + γnew ( z mk ) − γold ( z mk ) n = Nkold + γnew ( z mk ) − γold ( z mk ) 192 Similarly, according to Eq (9.17), we can obtain: µnew k = = = = 1 Nknew n̸=m 1 ∑ Nknew n Nkold = γold ( z nk )xn + γold ( z nk )xn + 1 ∑ Nknew N old k γnew ( z mk )xm Nknew γnew ( z mk )xm Nknew γold ( z nk )xn + n − γold ( z mk )xm γnew ( z mk )xm Nknew Nknew − γold ( z mk )xm Nknew [ ] x m old new old µ + γ ( z ) − γ ( z ) mk mk new k new Nk Nk Nkold [ ] x m old new old µ + γ ( z ) − γ ( z ) mk mk k Nknew Nknew ] x γnew ( z mk ) − γold ( z mk ) old [ new m µk + γ ( z mk ) − γold ( z mk ) µold k − new Nk Nknew ) γnew ( z mk ) − γold ( z mk ) ( old µold + · x − µ m k k Nknew = µold k − = ∑ Nknew − Nkold Just as required. Problem 9.27 Solution By analogy to the previous problem, we use Eq (9.24)-Eq(9.27), beginning by first deriving an update formula for mixing coefficients πk : πnew k } 1 { old Nk + γnew ( z mk ) − γold ( z mk ) N N new γ ( z mk ) − γold ( z mk ) = πold + k N = Nknew = Here we have used the conclusion (the update formula for Nknew ) in the previous problem. Next we deal with the covariance matrix Σ. By analogy to 193 the previous problem, we can obtain: Σnew k = new T γold ( z nk ) (xn − µnew k ) (x n − µ k ) Nknew n̸=m 1 new T γnew ( z mk ) (xm − µnew k ) (x m − µ k ) Nknew 1 ∑ old old T γ ( z nk ) (xn − µold k ) (x n − µ k ) Nknew n̸=m + ≈ ∑ 1 1 new T γnew ( z mk ) (xm − µnew k ) (x m − µ k ) Nknew 1 ∑ old old T = γ ( z nk ) (xn − µold k ) (x n − µ k ) Nknew n 1 new T + new γnew ( z mk ) (xm − µnew k ) (x m − µ k ) Nk 1 old T − new γold ( z mk ) (xm − µold k ) (x m − µ k ) Nk 1 1 old old new T = γnew ( z mk ) (xm − µnew k ) (x m − µ k ) new N k Σ k + Nk Nknew 1 old T − new γold ( z mk ) (xm − µold k ) (x m − µ k ) Nk + = { N old − N new } old 1 + k newk Σk Nk 1 new T γnew ( z mk ) (xm − µnew k ) (x m − µ k ) Nknew 1 old T − new γold ( z mk ) (xm − µold k ) (x m − µ k ) Nk { γold ( z mk ) − γnew ( z mk ) } old = 1+ Σk Nknew + + − γnew ( z mk ) Nknew γold ( z mk ) Nknew new T (xm − µnew k ) (x m − µ k ) old T (xm − µold k ) (x m − µ k ) = Σold k } γnew ( z mk ) { new T old (xm − µnew k ) (x m − µ k ) − Σ new Nk } γold ( z mk ) { old old T old (x − µ ) (x − µ ) − Σ − m m k k k Nknew + One important thing worthy mentioned is that in the second step, there is an approximate equal sign. Note that in the previous problem, we have 194 shown that if we only recompute the data point xm , all the center µk will also change from µold to µnew , and the update formula is given by Eq (9.78). k k However, for the convenience of computing, we have made an approximation here. Other approximation methods can also be applied here. For instance, you can replace µnew with µold whenever it occurs. k k The complete solution should be given by substituting Eq (9.78) into the right side of the first equal sign and then rearranging it, in order to construct a relation between Σnew and Σold . However, this is too complicated. k k 0.10 Variational Inference Problem 10.1 Solution This problem is very similar to Prob.9.24. We substitute Eq (10.3) and Eq (10.4) into Eq (10.2): ∫ { p(X, Z) p(Z|X) } q(Z) ln − ln dZ L( q) + KL ( q|| p) = q(Z) q(Z) Z ∫ { p(X, Z) } q(Z) ln = dZ p(Z|X) ∫Z q(Z) ln p(X) d Z = Z = ln p(X) Note that in the last step, we have used the fact that ln p(X) doesn’t depend on Z, and that the integration of q(Z) over Z equal to 1 because q(Z) is a PDF. Problem 10.2 Solution To be more clear, we are required to solve: { 1 m 1 = µ1 − Λ− 11 Λ12 ( m 2 − µ2 ) 1 m 2 = µ2 − Λ− 22 Λ21 ( m 1 − µ1 ) To obtain the equation above, we need to substitute E[ z i ] = m i , where i = 1, 2, into Eq (10.13) and Eq (10.14). Here the unknown parameters are m 1 and m 2 . It is trivial to notice that m i = µ i is a solution for the equation above. Let’s solve this equation from another perspective. Firstly, if any (or both) 1 −1 of Λ− 11 and Λ22 equals 0, we can obtain m i = µ i directly from Eq (10.13)1 −1 (10.14). When none of Λ− 11 and Λ22 equals 0, we substitute m 1 , i.e., the first 195 line, into the second line: m2 1 = µ 2 − Λ− 22 Λ21 ( m 1 − µ1 ) [ ] 1 −1 = µ 2 − Λ− 22 Λ21 µ1 − Λ11 Λ12 ( m 2 − µ2 ) − µ1 1 −1 −1 −1 = µ 2 − Λ− 22 Λ21 µ1 + Λ22 Λ21 Λ11 Λ12 ( m 2 − µ2 ) + Λ22 Λ21 µ1 1 −1 −1 −1 = (1 − Λ− 22 Λ21 Λ11 Λ12 ) µ2 + Λ22 Λ21 Λ11 Λ12 m 2 We rearrange the expression above, yielding: 1 −1 (1 − Λ− 22 Λ21 Λ11 Λ12 ) ( m 2 − µ2 ) = 0 The first term at the left hand side will equal 0 only when the distribution is singular, i.e., the determinant of the precision matrix Λ (i.e., Λ11 Λ22 − Λ12 Λ21 ) is 0. Therefore, if the distribution is nonsingular, we must have m 2 = µ2 . Substituting it back into the first line, we obtain m 1 = µ1 . Problem 10.3 Solution Let’s start from the definition of KL divergence given in Eq (10.16). ∫ [∑ ] M K L( p|| q) = − p(Z) ln q i (Z i ) d Z + const ∫ = − [ i =1 p(Z) ln q j (Z j ) + ∫ ∑ i ̸= j ] ln q i (Z i ) d Z + const = − p(Z) ln q j (Z j ) d Z + const ∫ [∫ ] ∏ = − p(Z) d Z i ln q j (Z j ) d Z j + const i ̸= j ∫ = − P (Z j ) ln q j (Z j ) d Z j + const Note that in the third step, since all the factors q i (Z i ), where i ̸= j , are fixed, they can be absorbed into the ’Const’ variable. In the last step, we have denoted the marginal distribution: ∫ ∏ p(Z j ) = p(Z) d Z i i ̸= j We introduce the Lagrange multiplier to enforce q j (Z j ) integrate to 1. ∫ ∫ L = − P (Z j ) ln q j (Z j ) d Z j + λ ( q j (Z j ) d Z j − 1) Using the functional derivative (for more details, you can refer to Appendix D or Prob.1.34), we calculate the functional derivative of L with respect to q j (Z j ) and set it to 0: − p(Z j ) q j (Z j ) +λ = 0 196 Rearranging it, we can obtain: λ q j (Z j ) = p(Z j ) Integrating both sides with respect to Z j , we see that λ = 1. Substituting it back into the derivative, we can obtain the optimal q j (Z j ): q⋆j (Z j ) = p(Z j ) Notice that actually we should also enforce q j (Z j ) > 0 in the Lagrange multiplier, however as we can see that when we only enforce q j (Z j ) integrate to 1 and obtain the final close expression, q j (Z j ) is definitely larger than 0 at all Z j because p(Z j ) is a PDF. Therefore, there is no need to introduce this inequality constraint in the Lagrange multiplier. Problem 10.4 Solution We begin by writing down the KL divergence. ∫ { q(x) } dx KL ( p|| q) = − p(x) ln p(x) ∫ = − p(x) ln q(x) d x + const ∫ [ D ] 1 1 = − p(x) − ln 2π − ln |Σ| − (x − µ)T Σ−1 (x − µ) d x + const 2 2 2 ∫ [1 ] 1 = p(x) ln |Σ| + (x − µ)T Σ−1 (x − µ) d x + const 2 2 ∫ [1 ] 1 = ln |Σ| + p(x) (x − µ)T Σ−1 (x − µ) d x + const 2 2 ∫ ] 1 1 [ T −1 = ln |Σ| + p(x) x Σ x − 2µT Σ−1 x + µT Σ−1 µ d x + const 2 2 ∫ 1 1 1 = ln |Σ| + p(x) Tr[Σ−1 (xxT )] d x − µT Σ−1 E[x] + µT Σ−1 µ + const 2 2 2 1 1 1 T −1 −1 T T −1 = ln |Σ| + Tr[Σ E(xx )] − µ Σ E[x] + µ Σ µ + const 2 2 2 Here D is the dimension of x. We first calculate the derivative of KL ( p|| q) with respect to µ and set it to 0: ∂KL ∂µ = −Σ−1 E[ x] + Σ−1 µ = 0 Therefore, we can obtain µ = E[x]. When µ = E[x] is satisfied, KL divergence reduces to: KL ( p|| q) = 1 1 1 ln |Σ| + Tr[Σ−1 E(xxT )] − µT Σ−1 µ + const 2 2 2 197 Then we calculate the derivative of KL ( p|| q) with respect to Σ and set it to 0: ∂KL ∂Σ = 1 −1 1 −1 1 Σ − Σ E[xxT ]Σ−1 + Σ−1 µµT Σ−1 = 0 2 2 2 Note that here we have used and Eq (61) and Eq (124) in ’MatrixCookBook’, and that Σ, E[xxT ] are both symmetric. We rewrite those equations here for your reference: ∂aT X−1 b ∂X = −X−T abT X−T and ∂Tr(AX−1 B) ∂X = −X−T AT BT X−T Rearranging the derivative, we can obtain: Σ = E[xxT ] − µµT = E[xxT ] − E[x]E[x]T = cov[x] Problem 10.5 Solution We introduce a property of Dirac function: ∫ δ(θ − θ 0 ) f (θ ) d θ = f (θ 0 ) We first calculate the optimal q(z, θ ) by fixing q θ (θ ). This is achieved by minimizing the KL divergence given in Eq (10.4): ∫ ∫ { p(Z|X) } dZ KL( q|| p) = − q(Z) ln q(Z) ∫ ∫ { p(z, θ |X) } = − q z (z) q θ (θ ) ln dz dθ q z (z) q θ (θ ) ∫ ∫ ∫ { p(z, θ |X) } = − q z (z) q θ (θ ) ln d z d θ + q θ (θ ) ln q θ (θ ) d θ q z (z) ∫ ∫ { p(z, θ |X) } = − q z (z) q θ (θ ) ln d z d θ + const q z (z) ∫ {∫ { p(z, θ |X) } } = − q θ (θ ) q z (z) ln d z d θ + const q z (z) ∫ { p(z, θ |X) } 0 = − q z (z) ln d z + const q z (z) ∫ { p(z|θ , X) p(θ |X) } 0 0 d z + const = − q z (z) ln q z (z) ∫ { p(z|θ , X) } 0 d z + const = − q z (z) ln q z (z) Here the ’Const’ denotes the terms independent of q z (z). Note that we will show at the end of this problem, here ’Const’ actually is −∞ due to the existence of the entropy of Dirac function: ∫ q θ (θ ) ln q θ (θ ) d θ 198 Now it is clear that when q z (z) equals p(z|θ 0 , X), the KL divergence is minimized. This corresponds to the E-step. Next, we calculate the optimal q θ (θ ), i.e., θ 0 , by maximizing L( q) given in Eq (10.3), but fixing q θ (θ ): ∫ ∫ { p(X, Z) } L( q) = q(Z) ln dZ q(Z) ∫ ∫ { p(X, z, θ ) } = q z (z) q θ (θ ) ln dz dθ q z (z) q θ (θ ) ∫ ∫ ∫ { p(X, z, θ ) } d z d θ − q θ (θ ) ln q θ (θ ) d θ = q z (z) q θ (θ ) ln q z (z) ∫ ∫ ∫ { } = q z (z) q θ (θ ) ln p(X, z, θ ) d z d θ − q θ (θ ) ln q θ (θ ) d θ + const ∫ ∫ = q θ (θ ) E qz [ln p(X, z, θ )] d θ − q θ (θ ) ln q θ (θ ) d θ + const ∫ = E qz (z) [ln p(X, z, θ 0 )] − q θ (θ ) ln q θ (θ ) d θ + const The second term is actually the entropy of a Dirac function, which is −∞ and independent of the value of θ 0 . Not strictly speaking, we only need to maximize the first term. This is exactly the M-step. One important thing needs to be clarified here. You may ask no matter how we set θ 0 , L( q) will always be −∞. Actually, this is an intrinsic problem as long as we use a point estimate q θ (θ ). This will even occur when we derive the optimal q z (z) by minimizing the KL divergence at the first step. Therefore, the ’Maximizing’ and ’Minimizing’ is a general meaning in this problem where we neglect the −∞ term. Problem 10.6 Solution 199 Let’s use the hint by first enforcing α → 1. ∫ ) 4 ( (1+α)/2 (1−α)/2 D α ( p|| q) = 1 − p q dx 1 − α2 ∫ [ 4 { p 1−α 1−α 2 ] } = 1 + ln q + O ( ) dx 1 − 2 2 1 − α2 p(1−α)/2 ∫ } 1 + 1−2α ln q + O ( 1−2α )2 4 { = 1 − p · dx 1 − α2 1 + 1−2α ln p + O ( 1−2α )2 ∫ } 1 + 1−2α ln q 4 { ≈ 1 − p · dx 1 − α2 1 + 1−2α ln p ∫ } [ 1 + 1−2α ln q ] 4 { = − p · − 1 dx 1 − α2 1 + 1−2α ln p ∫ 1−α 1−α } 4 { 2 ln q − 2 ln p = − p · dx 1 − α2 1 + 1−2α ln p ∫ { } 2 ln q − ln p = − p· dx 1+α 1 + 1−2α ln p ∫ ∫ q ≈ − p · (ln q − ln p) dx = − p · ln dx p Here p and q is short for p( x) and q( x). It is similar when α → −1. One important thing worthy mentioning is that if we directly approximate p(1+α)/2 by p instead of p/ p(1−α)/2 in the first step, we won’t get the desired result. Problem 10.7 Solution Let’s begin from Eq (10.25). } N ∑ E[ τ ] { λ0 (µ − µ0 )2 + ( xn − µ)2 + const 2 n=1 } N N ∑ ∑ E[ τ ] { x2n + const λ0 µ2 − 2 λ0 µ0 µ + λ0 µ20 + N µ2 − 2 ( xn ) µ + = − 2 n=1 n=1 { } N N ∑ ∑ E[ τ ] = − (λ0 + N ) µ2 − 2 ( λ0 µ0 + xn ) µ + (λ0 µ20 + x2n ) + const 2 n=1 n=1 ∑N 2 ∑N { λ0 µ0 + n=1 x2n } λ0 µ0 + n=1 xn E[τ] (λ0 + N ) 2 µ −2 µ+ + const = − 2 λ0 + N λ0 + N ln q⋆ µ (µ) = − From this expression, we see that q⋆ µ (µ) should be a Gaussian. Suppose −1 ⋆ that is has form: q µ (µ) ∼ N (µ|µ N , λ N ), then its logarithm can be written as: ln q⋆ µ (µ) = 1 λN λN ln − (µ − µ N )2 2 2π 2 200 We match the terms related to µ (the quadratic term and linear term), yielding: ∑ λ0 µ0 + nN=1 xn λ N = E[τ] · (λ0 + N ) , and λ N µ N = E[τ] · (λ0 + N ) · λ0 + N Therefore, we obtain: µN = λ0 µ0 + N x̄ λ0 + N Where x̄ is the mean of xn , i.e., N 1 ∑ xn N n=1 x̄ = Then we deal with the other factor q τ (τ). Note that there is a typo in Eq (10.28), the coefficient ahead of ln τ should be N2+1 . Let’s verify this by considering the terms introducing ln τ. The first term inside the expectation, i.e., ln p(D |µ, τ), gives N 2 ln τ, and the second term inside the expectation, i.e., 1 ln p(µ|τ), gives 2 ln τ. Finally the last term ln p(τ) gives (a 0 − 1) ln τ. Therefore, Eq (10.29), Eq (10.31) and Eq (10.33) will also change consequently. The right forms of these equations will be given in this and following problems. Now suppose that q τ (τ) is a Gamma distribution, i.e., q τ (τ) ∼ Gam(τ|a N , b N ), we have: ln q τ (τ) = − ln Γ(a N ) + a N ln b N + (a N − 1) ln τ − b N τ Comparing it with Eq (10.28) and matching the coefficients ahead of τ and ln τ, we can obtain: a N − 1 = a0 − 1 + N +1 2 ⇒ a N = a0 + N +1 2 And similarly ] N 1 [∑ b N = b 0 + Eµ ( xn − µ)2 + λ0 (µ − µ0 )2 2 n=1 Just as required. Problem 10.8 Solution According to Eq (B.27), we have: E[ τ ] = ≈ = = b 0 + 21 Eµ 1 2 Eµ Eµ [∑ [∑ [∑ a 0 + ( N + 1)/2 N 2 2 n=1 ( x n − µ) + λ0 (µ − µ0 ) N /2 N 2 n=1 ( x n − µ) N N 2 n=1 ( x n − µ) ] ] ]}−1 {1 [∑ N ( x n − µ )2 · Eµ N n=1 ] 201 According to Eq (B.28), we have: var[τ] = ≈ ( a 0 + ( N + 1)/2 ])2 [ ∑ N 2 + λ (µ − µ )2 b 0 + 21 Eµ ( x − µ ) n 0 0 n=1 1 4 Eµ [ ∑N N /2 2 n=1 ( x n − µ) ]2 ≈ 0 Just as required. Problem 10.9 Solution The underlying assumption of this problem is a 0 = b 0 = λ0 = 0. According to Eq (10.26), Eq (10.27) and the definition of variance, we can obtain: 1 2 E[µ2 ] = λ− N + E[ µ ] = = λ0 µ0 + N x 2 1 +( ) (λ0 + N )E[τ] λ0 + N 1 + x2 N E[τ] Note that since there is a typo in Eq (10.29) as stated in the previous problem, i.e., missing a term 12 . E[τ]−1 actually equals: [∑ ] 1 N 2 2 b + E x − ) λ ( − ) ( µ + µ µ 0 0 0 n=1 n bN 1 2 µ = = E[τ] aN a 0 + ( N + 1)/2 [∑ ] 1 N 2 E ( x − µ ) n=1 n 2 µ = ( N + 1)/2 N 1 ∑ ( xn − µ)2 ] = Eµ [ N + 1 n=1 = = = = = N N 1 ∑ Eµ [ ( xn − µ)2 ] N +1 N n=1 } N { 2 x − 2 xE[µ] + E[µ2 ] N +1 } N { 2 1 x − 2 x2 + + x2 N +1 N E[τ] 1 } N { 2 x − x2 + N +1 N E[ τ ] } N { 2 1 2 x −x + N +1 ( N + 1)E[τ] Rearranging it, we can obtain: N 1 1 ∑ = ( x2 − x2 ) = ( xn − x)2 E[τ] N n=1 Actually it is still a biased estimator. 202 Problem 10.10 Solution We substitute L m , i.e., Eq (10.35), back into the right hand side of Eq (10.34), yielding: { ∑∑ p(Z, X, m) p(Z, m|X) } q(Z| m) q( m) ln (right) = − ln q(Z| m) q( m) q(Z| m) q( m) m Z { } ∑∑ p(Z, X, m) = q(Z| m) q( m) ln p(Z, m|X) m Z ∑∑ = q(Z, m) ln p(X) m Z = ln p(X) Just as required. Problem 10.11 Solution We introduce the Lagrange Multiplier: {∑ } { p(Z, X, m) } ∑∑ −λ q ( m) − 1 L = q(Z| m) q( m) ln q(Z| m) q( m) m m Z { } ∑∑ {∑ } ∑∑ = q(Z| m) q( m) ln p(Z, X, m) − q(Z| m) − q(Z| m) q( m) ln q( m) − λ q ( m) − 1 m Z = ∑ q ( m) · C − m ∑ q(Z| m) {∑ } m Z {∑ } q( m) ln q( m) − λ q ( m) − 1 m Z m Where we have defined: { } ∑ C= q(Z| m) ln p(Z, X, m) − q(Z| m) Z According to Calculus of Variations given in Appendix D, and also Prob.1.34, we can obtain the derivative of L with respect to q( m) and set it to 0: [ ] ∑ ∂L = C + q(Z| m) ln q( m) + 1 − λ ∂ q ( m) Z { } ∑ ∑ = q(Z| m) ln p(Z, X, m) − q(Z| m) + q(Z| m) ln q( m) + 1 − λ Z = ∑ Z Z { p(Z, X, m) } q(Z| m) ln +1−λ = 0 q(Z| m) q( m) We multiply both sides by q( m) and then perform summation over m, yielding: { p(Z, X, m) } ∑∑ ∑ q(Z| m) q( m) ln + (1 − λ) q( m) = 0 q(Z| m) q( m) m Z m Notice that the first term is actually L m defined in Eq (10.35) and that the summation of q( m) over m equals 1, we can obtain: λ = Lm + 1 m 203 We substitute λ back into the derivative, yielding: ∑ q(Z| m) ln Z { p(Z, X, m) } − Lm = 0 q(Z| m) q( m) (∗) One important thing must be clarified here, there is a typo in Eq (10.36), ′′ L m in Eq (10.36) should be L , which is defined as: ′′ L = ∑ q(Z| m) ln { p(Z, X| m) } q(Z| m) Z ′′ Now with the definition of L , we expand (∗): (∗) = = = = = = = { p(Z, X, m) } − Lm q(Z| m) q( m) Z { p(Z, X| m) p( m) } ∑ q(Z| m) ln − Lm q(Z| m) q( m) Z ∑ ′′ p ( m) L + q(Z| m) ln − Lm q ( m) Z { p(Z, X, m) } ′′ p ( m) ∑ ∑ − q(Z| m) q( m) ln L + ln q ( m) m Z q(Z| m) q( m) { ∑ ∑ ′′ p ( m) p(Z, X| m) p( m) } L + ln − q ( m) q(Z| m) ln q ( m) m q(Z| m) q( m) Z { ∑ ′′ ′′ p ( m) ∑ p ( m) } L + ln − q( m) L + q(Z| m) ln q ( m) m q ( m) Z { } ∑ ′′ ′′ p ( m) p ( m) − q( m) L + ln L + ln q ( m) m q ( m) ∑ q(Z| m) ln ′′ ′′ p( m) exp(L ) ∑ p( m) exp(L ) = ln − q( m) ln =0 q ( m) q ( m) m The solution is given by: q ( m) = ′′ 1 · p( m) exp(L ) A Where A1 is a normalization constant, used to guarantee the summation of q( m) over m equals 1. More specific, it is given by: ∑ ′′ A= p( m) exp(L ) Z Therefore, it is obvious that A does not depend on the value of Z. You can verify the result of q( m) by substituting it back into the last line of (∗), yielding: ′′ ′′ ∑ p( m) exp(L ) ∑ p( m) exp(L ) ln − q( m) ln = ln A − q( m) · ln A = 0 q ( m) q ( m) m m 204 One last thing worthy mentioning is that you can directly start from L m given in Eq (10.35), without enforcing Lagrange Multiplier, to obtain q( m). In this way, we can actually obtain: Lm = ∑ m ′′ p( m) exp(L ) q( m) ln q ( m) ′′ It is actually the KL divergence between q( m) and p( m) exp(L ). Note ′′ ′′ that p( m) exp(L ) is not normalized, we cannot let q( m) equal to p( m) exp(L ) to achieve the minimum of a KL distance, i.e., 0, since q( m) is a probability distribution and should sum to 1 over m. Therefore, we can guess that the optimal q( m) is given by the normal′′ ized p( m) exp(L ). In this way, the constraint, i.e., summation of q( m) over m equals 1, is implicitly guaranteed. The more strict proof using Lagrange Multiplier has been shown above. Problem 10.12 Solution The solution procedure has already been given in Eq (10.43) - (10.49), so here we explain it in more details, starting from Eq (10.43): ln q⋆ (Z) = Eπ,µ,Λ [ln p(X, Z, π, µ, Λ)] + const = Eπ [ln p(Z|π)] + Eµ,Λ [ln p(X|Z, µ, Λ)] + const = const + Eπ [ +Eµ,Λ [ n=1 k=1 = n=1 k=1 n=1 k=1 n=1 k=1 z nk ln πk ] z nk { K N ∑ ∑ N ∑ K ∑ N ∑ K ∑ n=1 k=1 K N ∑ ∑ = const + + K N ∑ ∑ 1 D 1 ln |Λk | − ln 2π − (xn − µk )T Λk (xn − µk ) } ] 2 2 2 z nk Eπ [ln πk ] z nk Eµ,Λ [{ 1 D 1 ln |Λk | − ln 2π − (xn − µk )T Λk (xn − µk ) } ] 2 2 2 z nk ln ρ nk + const Where we have substituted used Eq (10.37) and Eq (10.38), and D is the dimension of xn . Here ln ρ nk is defined as: ln ρ nk 1 D 1 ln |Λk | − ln 2π − (xn − µk )T Λk (xn − µk ) } ] 2 2 2 1 D 1 = Eπ [ln πk ] + Eµ,Λ [ ln |Λk |] − ln 2π − Eµk ,Λk [ (xn − µk )T Λk (xn − µk ) ] 2 2 2 = Eπ [ln πk ] + Eµ,Λ [{ Taking exponential of both sides, we can obtain: q⋆ (Z) ∝ K N ∏ ∏ n=1 k=1 z nk ρ nk 205 Because q⋆ (Z) should be correctly normalized, we are required to find the normalization constant. In this problem, we find that directly calculate the normalization constant by performing summation of q⋆ (Z) over Z is non trivial. Therefore, we will proof that Eq (10.49) is the correct normalization ∏ z1 k by mathematical induction. When N = 1, q⋆ (Z) will reduce to: K k=1 ρ 1 k , and it is easy to see that the normalization constant is given by: A= K ∑∏ z1 k=1 K ∑ z ρ 11kk = j =1 ρ1 j Here we have used 1-of-K coding scheme for z1 = [ z11 , z12 , ..., z1K ]T , i.e., only one of { z11 , z12 , ..., z1K } will be 1 and others all 0. Therefore the summation over z1 is made up of K terms, and the j -th term corresponds to z1 j = 1 and other z1 i equals 0. In this case, we have obtained: q⋆ (Z) = ) z1 k K K ( ∏ ρ 1k 1 ∏ z ρ 11kk = ∑K A k=1 k=1 j =1 ln ρ 1 j It is exactly the same as Eq (10.48) and Eq (10.49). Suppose now we have proved that for N − 1, the normalized q⋆ (Z) is given by Eq (10.48) and Eq (10.49). For N , we have: ∑ q⋆ (Z) = Z = ∑ z1 ,...,z N n=1 k=1 ∑{ zN = ∑{ zN = ∑ z nk r nk N ∏ K ∏ z1 ,...,z N −1 n=1 k=1 ∑ K [∏ z1 ,...,z N −1 k=1 k=1 K ∑ {[ ∏ zN = ∑ K ∑ {[ ∏ zN = N ∏ K ∏ k=1 K ∏ z N k=1 z } } N∏ −1 ∏ K z nk ] z ] [ r nk r NNkk · n=1 k=1 ∑ zN k ] rNk · z nk r nk N∏ −1 ∏ K z1 ,...,z N −1 n=1 k=1 z nk r nk } } z ] r NNkk · 1 r NNkk = K ∑ k=1 rNk = 1 The proof of the final step is exactly the same as that for N = 1. So now, with the assumption Eq (10.48) and Eq (10.49) are right for N − 1, we have shown that they are also correct for N . The proof is complete. Problem 10.13 Solution 206 Let’s start from Eq (10.54). ln q⋆ (π, µ, Λ) ∝ ln p(π) + = K ∑ k=1 ln C (α0 ) + + + K ∑ k=1 ln p(µk , Λk ) + E[ln p(Z|π)] + K ∑ n=1 k=1 k=1 n=1 1 E[ z nk ] ln N (xn |µk , Λ− k ) (α0 − 1) ln πk k=1 ln N (µk |m0 , (β0 Λk )−1 ) + N ∑ K ∑ K ∑ N ∑ ln πk E[ z nk ] + K ∑ N ∑ k=1 n=1 K ∑ k=1 ln W (Λk |W0 , v0 ) 1 E[ z nk ] ln N (xn |µk , Λ− k ) It is easy to observe that the equation above can be decomposed into a sum of terms involving only π together with those only involving µ and Λ. In other words, q(π, µ, Λ) can be factorized into the product of q(π) and q(µ, Λ). We first extract those terms depend on π. ln q⋆ (π) ∝ (α0 − 1) = = (α0 − 1) K ∑ k=1 K ∑ k=1 K ∑ k=1 ln πk + ln πk + K ∑ N ∑ k=1 n=1 K ∑ N ∑ k=1 n=1 ln πk E[ z nk ] r nk ln πk [ ] N ∑ ln πk · α0 − 1 + r nk n=1 Comparing it to the standard form of a Dirichlet distribution, we can conclude that q⋆ (π) = Dir(π|α), where the k-th entry of α, i.e., αk is given by: αk = α0 + N ∑ n=1 r nk = α0 + Nk Next we gather all the terms dependent on µ = {µk } and Λ = {Λk }: ln q⋆ (µ, Λ) = K ∑ k=1 ln N (µk |m0 , (β0 Λk )−1 ) + K ∑ k=1 ln W (Λk |W0 , v0 ) + } 1 ln |β0 Λk | − (µk − m0 )T β0 Λk (µk − m0 ) ∝ 2 k=1 2 { } K ∑ v0 − D − 1 1 1 + ln |Λk | − Tr(W− Λ ) k 0 2 2 k=1 {1 } N K ∑ ∑ 1 r nk ln |Λk | − (xn − µk )T Λk (xn − µk ) + 2 2 k=1 n=1 K {1 ∑ K ∑ N ∑ k=1 n=1 1 E[ z nk ] ln N (xn |µk , Λ− k ) With the knowledge that the optimal q⋆ (µ, Λ) can be written as: q⋆ (µ, Λ) = K ∏ k=1 q ⋆ (µ k | Λ k ) q ⋆ (Λ k ) = K ∏ k=1 N (µk |mk , (βk Λk )−1 )W (Λk |Wk , vk ) (∗) 207 We first complete square with respect to µk . The quadratic term is given by: K ∑ N ∑ 1 1 T 1 − µT ( β Λ ) µ − r nk µT 0 k k k k Λ k µ k = − µ k (β0 Λ k + N k Λ k ) µ k 2 2 2 k=1 n=1 Therefore, comparing with (∗), we can obtain: βk = β0 + N k Next, we write down the linear term with respect to µk : µT k (β0 Λ k m0 ) + N ∑ n=1 T r nk · µT k (Λ k x n ) = µ k (β0 Λ k m0 + = µT k Λ k (β0 m0 + N ∑ n=1 N ∑ n=1 r nk Λk xn ) r nk xn ) = µT k Λ k (β0 m0 + N k x̄ k ) Where we have defined: x̄k = ∑ N 1 N ∑ n=1 r nk n=1 r nk xn = N 1 ∑ r nk xn Nk n=1 Comparing to the standard form, we can obtain: mk = 1 βk (β0 m0 + Nk x̄k ) Now we have obtained q⋆ (µk |Λk ) = N (µk |mk , (βk Λk )−1 ), using the relation: ln q⋆ (Λk ) = ln q⋆ (µk , Λk ) − ln q⋆ (µk |Λk ) 208 And focusing only on the terms dependent on Λk , we can obtain: {1 } 1 ln q⋆ (Λk ) ∝ ln |β0 Λk | − (µk − m0 )T β0 Λk (µk − m0 ) 2 2 } {v −D −1 1 0 1 + ln |Λk | − Tr(W− Λ ) k 0 2 2 } {1 N ∑ 1 + r nk ln |Λk | − (xn − µk )T Λk (xn − µk ) 2 2 n=1 } {1 1 − ln |βk Λk | − (µk − mk )T βk Λk (µk − mk ) 2 2 ] {1 1 [ ln |Λk | − Tr β0 (µk − m0 )(µk − m0 )T · Λk ∝ 2 2 {v −D −1 } 1 0 1 + ln |Λk | − Tr(W− Λ ) k 0 2 2 ] [ {N N ∑ 1 k ln |Λk | − Tr r nk (xn − µk )T (xn − µk ) · Λk + 2 2 n=1 {1 ] 1 [ − ln |Λk | − Tr βk (µk − mk )T (µk − mk ) · Λk 2 2 1 v0 − D − 1 + N k ln |Λk | − Tr[T · Λk ] = 2 2 Where we have defined: 1 T = β0 (µk −m0 )(µk −m0 )T +W− 0 + N ∑ n=1 r nk (xn −µk )T (xn −µk )−βk (µk −mk )T (µk −mk ) By matching the coefficient ahead of ln |Λk |, we can obtain: v k = v0 + N k Next, by matching the coefficient in the Trace, we see that: 1 W− k =T Let’s further simplify T , beginning by introducing a useful equation, which will be used here and later in Prob.10.16: N ∑ n=1 r nk xn xT n = = = N ∑ n=1 N ∑ n=1 N ∑ n=1 r nk (xn − x̄k + x̄k )(xn − x̄k + x̄k )T [ ] T r nk (xn − x̄k )(xn − x̄k )T + x̄k x̄T k + 2(x n − x̄ k )x̄ k N N [ ] ∑ [ ] ∑ [ ] r nk (xn − x̄k )(xn − x̄k )T + r nk x̄k x̄T r nk 2(xn − x̄k )x̄T k + k n=1 = Nk Sk + Nk x̄k x̄T k +2 = Nk Sk + Nk x̄k x̄T k Nk Sk + Nk x̄k x̄T k = N ∑ n=1 [ ] r nk (xn − x̄k )x̄T k [ ] + 2 ( N k x̄k − N k x̄k )x̄T k n=1 209 Where in the last step we have used Eq (10.51). Now we are ready to prove that T is exactly given by Eq (10.62). Let’s first consider the coefficients ahead of the quadratic term with repsect to µk : (quad) = β0 µk µT k + N ∑ n=1 T r nk µk µT k − β k µ k µ k = (β0 + N ∑ n=1 r nk − βk )µk µT k =0 Where the summation is actually equal to Nk and we have also used Eq (10.60). Next we focus on the linear term: (linear) = −2β0 m0 µT k + = 2(−β0 m0 + N ∑ n=1 N ∑ n=1 T 2 r nk xn µT k + 2β k m k µ k r nk xn + βk mk )µT k =0 Finally we deal with the constant term: 1 T (const) = W− 0 + β0 m0 m0 + N ∑ n=1 T r nk xn xT n − βk mk mk 1 T W− 0 + β0 m0 m0 T + N k Sk + N k x̄k x̄T k − βk mk mk 1 2 1 T T β m k mT = W− 0 + N k S k + β0 m0 m0 + N k x̄ k x̄ k − k βk k 1 1 T T (β0 m0 + NK x̄k )(β0 m0 + NK x̄k )T = W− 0 + N k S k + β0 m0 m0 + N k x̄ k x̄ k − βk = 1 = W− 0 + N k S k + (β0 − = = β20 )m0 m0T + ( Nk − Nk2 )x̄k x̄T k − 1 2(β0 m0 ) · ( NK x̄k )T βk βk βk β N β N β0 NK 0 k 0 k 1 W− m0 m0T + x̄k x̄T 2(m0 ) · (x̄k )T 0 + Nk Sk + k − βk βk βk β0 N k 1 W− (m0 − x̄k )(m0 − x̄k )T 0 + Nk Sk + βk Just as required. Problem 10.14 Solution Let’s begin by definition. ∫ ∫ (xn − µk )T Λk (xn − µk )) q⋆ (µk , Λk ) d µk d Λk ∫ {∫ } = (xn − µk )T Λk (xn − µk )) q⋆ (µk |Λk ) d µk q⋆ (Λk ) d Λk ∫ = Eµk [(µk − xn )T Λk (µk − xn )] · q⋆ (Λk ) d Λk Eµk ,Λk [(xn − µk )T Λk (xn − µk )] = The inner expectation is with respect to µk , which satisfies a Gaussian distribution. We use Eq (380) in ’MatrixCookbook’: if x ∼ N (m, Σ), we have: ′ ′ ′ ′ E[(x − m )T A(x − m )] = (m − m )T A(m − m ) + Tr(AΣ) 210 Therefore, here we can obtain: [ ] Eµk [(µk − xn )T Λk (µk − xn )] = (mk − xn )T Λk (mk − xn ) + Tr Λk · (βk Λk )−1 Substituting it back into the integration, we can obtain: ∫ [ ] T 1 Eµk ,Λk [(xn − µk ) Λk (xn − µk )] = (mk − xn )T Λk (mk − xn ) + D β− · q⋆ (Λk ) d Λk k [ 1 = D β− + E (mk − xn )T Λk (mk − xn )] Λ k k { } 1 T = D β− + E Tr[ Λ · (m − x )(m − x ) ] Λk n n k k k k { } 1 T = D β− + Tr E [ Λ ] · (m − x )(m − x ) Λ n n k k k k k { } −1 = D βk + Tr vk Wk · (mk − xn )(mk − xn )T = 1 T D β− k + v k (m k − x n ) W k (m k − x n ) Just as required. Problem 10.15 Solution There is a typo in Eq (10.69). The numerator should be α0 + Nk . Let’s substitute Eq (10.58) into (B.17): α0 + N k α0 + N k αk = = E[πk ] = ∑ ∑ K α0 + k Nk K α0 + N k αk Problem 10.16 Solution According to Eq (10.38), we can obtain: E[ln p(X|Z, µ, Λ)] = = = = K N ∑ ∑ n=1 k=1 K N ∑ ∑ n=1 k=1 1 E[ z nk ln N (xn |µk , Λ− k )] E[ z nk ] · E[− D 1 1 ln 2π + ln |Λk | − (xn − µk )T Λk (xn − µk )] 2 2 2 { } N ∑ K 1 ∑ E[ z nk ] · − D ln 2π + E[ln |Λk |] − E[(xn − µk )T Λk (xn − µk )] 2 n=1 k=1 { } N ∑ K 1 ∑ e k − D β−1 − vk (xn − mk )T Wk (xn − mk ) r nk · − D ln 2π + ln Λ k 2 n=1 k=1 Where we have used Eq (10.50), Eq (10.64) and Eq (10.65). Then we first deal with the first three terms inside the bracket, i.e., { } { } N ∑ K K ∑ N 1 ∑ 1∑ e k − D β−1 e k − D β−1 r · − D ln 2 π + ln Λ r nk · − D ln 2π + ln Λ = nk k k 2 n=1 k=1 2 k=1 n=1 ] [ K [∑ N ] 1∑ e k − D β−1 = r nk · − D ln 2π + ln Λ k 2 k=1 n=1 = K [ ] 1∑ e k − D β−1 Nk · − D ln 2π + ln Λ k 2 k=1 211 Where we have used the definition of Nk . Next we deal with the last term inside the bracket, i.e., { } N ∑ K N ∑ K 1 ∑ 1 ∑ r nk · − vk (xn − mk )T Wk (xn − mk ) = − Tr[ r nk vk · (xn − mk )(xn − mk )T · Wk ] 2 n=1 k=1 2 n=1 k=1 = − K N ∑ 1∑ Tr[ r nk vk · (xn − mk )(xn − mk )T · Wk ] 2 k=1 n=1 Since we have: N ∑ n=1 r nk vk · (xn − mk )(xn − mk )T = vk = vk N ∑ n=1 N ∑ n=1 +vk +vk = r nk · (x̄k − mk + xn − x̄k )(x̄k − mk + xn − x̄k )T r nk · (x̄k − mk )(x̄k − mk )T N ∑ n=1 N ∑ n=1 r nk · (xn − x̄k )(xn − x̄k )T r nk · 2(x̄k − mk )(xn − x̄k )T vk Nk · (x̄k − mk )(x̄k − mk )T +vk N k Sk +vk · 2(x̄k − mk )( = N ∑ n=1 r nk xn − N ∑ n=1 r nk x̄k )T vk Nk · (x̄k − mk )(x̄k − mk )T + vk Nk Sk +vk · 2(x̄k − mk )( N k x̄k − N k x̄k )T = vk Nk · (x̄k − mk )(x̄k − mk )T + vk Nk Sk Therefore, the last term can be reduced to: { } K N ∑ K 1∑ 1 ∑ r nk · − vk (xn − mk )T Wk (xn − mk ) = − Tr[vk Nk · (x̄k − mk )(x̄k − mk )T Wk ] 2 n=1 k=1 2 k=1 − = − − K 1∑ Tr[vk Nk Sk Wk ] 2 k=1 K 1∑ Nk vk · (x̄k − mk )Wk (x̄k − mk )T 2 k=1 K 1∑ Nk vk Tr[Sk Wk ] 2 k=1 If we combine the first three and the last term, we just obtain Eq (10.71). Next we prove Eq (10.72). According to Eq (10.37), we have: E[ln p(Z|π)] = N ∑ K ∑ n=1 k=1 E[ z nk ln πk ] = N ∑ K ∑ n=1 k=1 r nk ln π ek 212 Just as required. Problem 10.17 Solution According to Eq (10.39), we have: E[ln p(π)] = ln C (α0 ) + (α0 − 1) = ln C (α0 ) + (α0 − 1) K ∑ k=1 K ∑ k=1 E[ln πk ] ln π ek According to Eq (10.40), we have: E[ln p(µ, Λ)] = K ∑ k=1 E[ln N (µk |m0 , (β0 Λk )−1 )] + K ∑ k=1 E[ln W (Λk |W0 , v0 )] } K { D ∑ 1 1 E − ln 2π + ln |β0 Λk | − (µk − m0 )T (β0 Λk )(µk − m0 ) 2 2 2 k=1 } K { ∑ v0 − D − 1 1 1 + E ln B(W0 , v0 ) + ln |Λk | − Tr[W− 0 Λk ] 2 2 k=1 } K { D ∑ D 1 1 = E − ln 2π + ln β0 + ln |Λk | − (µk − m0 )T (β0 Λk )(µk − m0 ) 2 2 2 2 k=1 { } K ∑ v0 − D − 1 1 1 + E ln B(W0 , v0 ) + ln |Λk | − Tr[W− Λ ] k 0 2 2 k=1 } K K { 1∑ K · D β0 1 ∑ fk − ln + ln Λ E (µk − m0 )T (β0 Λk )(µk − m0 ) = 2 2π 2 k=1 2 k=1 } K K { v0 − D − 1 ∑ 1∑ 1 fk − K · ln B(W0 , v0 ) + ln Λ E Tr[W− Λ ] k 0 2 2 k=1 k=1 = So now we need to calculate these two expectations. Using (B.80), we can obtain: { } ∑ { } K { K K ∑ ∑ 1 −1 −1 E Tr[W− Λ ] = Tr W · E [ Λ ] = v · Tr W W k k k k 0 0 0 k=1 k=1 k=1 To calculate the other expectation, first we write down two properties of the Gaussian distribution, i.e., E[µk ] = mk , T −1 −1 E[µk µT k ] = mk mk + βk Λk 213 Therefore, we can obtain: } } K { K { ∑ ∑ E (µk − m0 )T (β0 Λk )(µk − m0 ) = β0 E Tr[Λk · (µk − m0 )(µk − m0 )T ] k=1 k=1 = β0 = β0 = β0 = β0 K ∑ k=1 K ∑ k=1 K ∑ k=1 K ∑ k=1 { [ ]} T T Eµk ,Λk Tr Λk · (µk µT − 2 µ m + m m ) 0 k 0 0 k { [ ]} −1 −1 T T EΛk Tr Λk · (mk mT + β Λ − 2m m + m m ) 0 0 k 0 k k k { [ ]} 1 T T T EΛk Tr β− k I + Λ k · (m k m k − 2m k m0 + m0 m0 ) { [ ]} 1 T EΛk D · β− + Tr Λ · (m − m )(m − m ) 0 0 k k k k = { } K ∑ K D β0 + β0 EΛk (mk − m0 )Λk (mk − m0 )T βk k=1 = K ∑ K D β0 + β0 (mk − m0 ) · EΛk [Λk ] · (mk − m0 )T βk k=1 = K ∑ K D β0 + β0 (mk − m0 ) · (vk Wk ) · (mk − m0 )T βk k=1 Substituting these two expectations back, we obtain Eq (10.74) just as required. According to Eq (10.48), we have: E[ln q(Z)] = N,K ∑ n,k=1 E[ z nk ] · ln r nk = N,K ∑ n,k=1 r nk · ln r nk According to Eq (10.57), we have: E[ln q(π)] = ln C (α) + (αk − 1) K ∑ k=1 = ln C (α0 ) + (αk − 1) E[ln πk ] K ∑ k=1 ln π ek 214 To derive Eq(10.77), we follow the same procedure as that for Eq (10.74): E[ln q(µ, Λ)] = K ∑ k=1 E[ln N (µk |mk , (βk Λk )−1 )] + K ∑ k=1 E[ln W (Λk |Wk , vk )] } K { D ∑ D 1 1 E − ln 2π + ln βk + ln |Λk | − (µk − mk )T (βk Λk )(µk − mk ) 2 2 2 2 k=1 { } K ∑ 1 vk − D − 1 1 + ln |Λk | − Tr[W− E ln B(Wk , vk ) + Λ ] k k 2 2 k=1 } K K { K · D βk 1 ∑ 1∑ fk − = ln + ln Λ E (µk − mk )T (βk Λk )(µk − mk ) 2 2π 2 k=1 2 k=1 } K K { vk − D − 1 ∑ 1∑ 1 fk − K · ln B(Wk , vk ) + ln Λ E Tr[W− Λ ] k k 2 2 k=1 k=1 = = K K · D βk 1 ∑ KD fk − ln + ln Λ 2 2π 2 k=1 2 K · ln B(Wk , vk ) + = { } K K 1∑ vk − D − 1 ∑ 1 fk − ln Λ vk E Tr[W− W ] k k 2 2 k=1 k=1 K K · D βk 1 ∑ KD fk − ln + ln Λ 2 2π 2 k=1 2 K · ln B(Wk , vk ) + K K vk − D − 1 ∑ 1∑ fk − ln Λ vk · D 2 2 k=1 k=1 It is identical to Eq (10.77). Problem 10.18 Solution This problem is very complicated. Let’s explain it in details. In section 10.2.1, we have obtained the update formula for all the coefficients using the general framework of variational inference. For more details you can see Prob.10.12 and Prob.10.13. Moreover, in the previous problem, we have shown that L is given by Eq (10.70)-Eq (10.77), if we have assumed the form of q, i.e., Eq (10.42), Eq (10.48),Eq (10.55), Eq (10.57) and Eq (10.59). Note that here we do not know the specific value of those coefficients, e.g., Eq (10.60)-Eq (10.63). In this problem, we will show that by maximizing L with respect to those coefficients, we will obtain those formula just as in section 10.2.1. To summarize, here we write down all the coefficients required to estimate: {βk , mk , vk , Wk , αk , r nk }. We begin by considering βk . Note that only Eq (10.71), (10.74) and (10.77) contain βk , we calculate the derivative of L with 215 respect to βk and set it to zero: ∂L ∂βk 1 1 D 1 2π 2 −2 = ( N k D β− ) k ) + ( D β0 β k ) − ( 2 2 2 2π β k 1 −2 β · ( N k D + D β0 − D βk ) = 0 = 2 k The three brackets in the first line correspond to the derivative with respect to Eq (10.71), (10.74) and (10.77). Rearranging it, we obtain Eq (10.60). Next we consider mk , which only occurs in the quadratic terms in Eq (10.71) and (10.74). ∂L ∂mk = = [1 ] [ 1 ] Nk vk · 2Wk (x¯k − mk ) + − β0 vk · 2Wk (mk − m0 ) 2 2 [ ] vk Wk Nk · (x¯k − mk ) − β0 (mk − m0 ) = 0 Similarly, the two brackets in the first line correspond to the derivative with respect to Eq (10.71) and (10.74). Rearranging it, we obtain Eq (10.61). Next noticing that vk and Wk are always coupled in L , e.g.,vk occurs ahead of quadratic terms in Eq (10.71). We will deal with vk and Wk simultaneously. Let’s first make this more clear by writing down those terms depend on vk and Wk in L : } K {1 ∑ e k − H[ q(Λk )] (10.77) ∝ ln Λ (∗) k=1 2 (10.71) ∝ (10.74) ∝ = { } K 1∑ e k − vk · Tr[(Sk + Ak )Wk ] Nk ln Λ 2 k=1 } v −D −1 ∑ K { K K 1∑ 1∑ 1 e k − β0 vk · Tr[Bk Wk ] + 0 ek − ln Λ ln Λ vk Tr[W− 0 Wk ] 2 k=1 2 2 k=1 k=1 K K v0 − D ∑ 1∑ 1 ek − ln Λ vk Tr[(β0 Bk + W− 0 )W k ] 2 k=1 2 k=1 e k is given by Eq (10.65) and Ak and Bk are given by: Where ln Λ Ak = (x¯k − mk )(x¯k − mk )T , Bk = (mk − m0 )(mk − m0 )T Moreover, H[ q(Λk )] is given by (B.82): H[ q(Λk )] = − ln B(Wk , vk ) − v D vk − D − 1 ek + k ln Λ 2 2 Where ln B(Wk , vk ) can be calculated based on (B.79). Note here we only focus on those terms dependent on vk and Wk : ln B(Wk , vk ) ∝ − D ∑ vk + 1 − i vk vk D Γ( ln |Wk | − ln 2 − ) 2 2 2 i =1 216 To further simplify the derivative, we now write down those terms in L which only depends on vk and Wk with a given specific index k: {1 } 1 { } e k − H[ q(Λk )] + N k ln Λ e k − vk · Tr[(Sk + Ak )Wk ] L ∝ − ln Λ 2 2 v0 − D 1 e k − vk Tr[(β0 Bk + W−1 )Wk ] + ln Λ 0 2 2 1 1 e k + H[ q(Λk )] − vk · Tr[( N k Sk + N k Ak + β0 Bk + W−1 )Wk ] = (−1 + Nk + v0 − D ) ln Λ 0 2 2 1 1 e k − vk · Tr[( N k Sk + N k Ak + β0 Bk + W−1 )Wk ] (−1 + Nk + v0 − D ) ln Λ = 0 2 2 vk − D − 1 v D ek + k − ln B(Wk , vk ) − ln Λ 2 2 1 1 v D e k − vk · Tr[Fk Wk ] + k − ln B(Wk , vk ) = ( Nk + v0 − vk ) ln Λ 2 2 2 Where we have defined: Fk = 1 Nk Sk + Nk Ak + β0 Bk + W− 0 Note that Eq (10.77) has a minus sign in L , the negative of (∗) has been used in the first line. We first calculate the derivative of L with respect to vk and set it to zero: ek 1 e k ln Λ 1 ∂L d ln Λ D = − ( N k + v0 − v k ) − Tr[Fk Wk ] + ∂vk 2 dvk 2 2 2 D |Wk | D ln 2 1 ∑ ′ vk + 1 − i Γ( + + + ) 2 2 2 i=1 2 ] ek 1[ d ln Λ − Tr[Fk Wk ] + D = 0 = ( N k + v0 − v k ) 2 dvk e k , i.e., Eq Where in the last step, we have used the definition of ln Λ (10.65). Then we calculate the derivative of L with respect to Wk and set it to zero: ∂L ∂Wk = = 1 vk vk 1 1 ( Nk + v0 − vk )W− Fk + W− k − 2 2 2 k vk 1 1 1 ( Nk + v0 − vk )W− (Fk − W− k − k )=0 2 2 Staring at these two derivatives long enough, we find that if the following two conditions: 1 Nk + v0 − vk = 0 , and Fk = W− k are satisfied, the derivatives of L with respect to vk and Wk will all be zero. Rearranging the first condition, we obtain Eq (10.63). Next we prove that the second condition is exactly Eq (10.62), by simplifying Fk . Fk = 1 Nk Sk + Nk Ak + β0 Bk + W− 0 1 T T = W− 0 + N k S k + N k · (x¯k − m k )(x¯k − m k ) + β0 · (m k − m0 )(m k − m0 ) 217 Comparing this with Eq (10.62), we only need to prove: Nk ·(x¯k −mk )(x¯k −mk )T +β0 ·(mk −m0 )(mk −m0 )T = β0 N k β0 + N k (x¯k −m0 )(x¯k −m0 )T Let’s start from the left hand side. T T T T Nk x¯k x¯k T − 2 Nk x¯k mT k + N k m k m k + β0 m k m k − 2β0 m k m0 + β0 m0 m0 β0 m0 + N k x¯k β0 m0 + N k x¯k T β0 m0 + N k x¯k T ) + ( Nk + β0 )( )( ) = N k x¯k x¯k T − 2 N k x¯k ( β0 + N k β0 + N k β0 + N k β0 m0 + N k x¯k −2β0 ( )m0T + β0 m0 m0T β0 + N k (left) = Then we complete the square with respect to x¯k , and we will see the coefficients match with the right hand side. Here as an example, we calculate the coefficients ahead of the quadratic term x¯k x¯k T : (quad) = = = Nk − 2 Nk Nk Nk + (β0 + N k )( )2 β0 + N k β0 + N k Nk (β0 + Nk ) − 2 Nk2 + Nk2 β0 + N k β0 N k β0 + N k It is similar for the linear and the constant term, and here due to page limit, we omit the proof. the update formula for αk , r nk are still remaining to obtain. Noticing that only Eq (10.72), (10.73) and (10.76) depend on αk , we now calculate the derivative of L with respect to αk : ∂L ∂αk = N ∑ n=1 r nk d ln π ek d ln π ek [ d ln π ek d ln C (α) ] + (α0 − 1) − (αk − 1) + ln π ek + d αk d αk d αk d αk d ln π ek d ln C (α) − ln π ek − d αk d αk [ ′ ] [ ] d [ ln Γ(α ′ b) − ln Γ(αk ) ] = ( N k + α0 − αk ) ϕ (αk ) − ϕ (α b) − ϕ(αk ) − ϕ(α b) − d αk [ ′ ] [ ] ′ = ( N k + α0 − αk ) ϕ (αk ) − ϕ (α b) − ϕ(αk ) − ϕ(α b) − [ ϕ(α b) − ϕ(αk ) ] [ ′ ] ′ = ( N k + α0 − αk ) ϕ (αk ) − ϕ (α b) = 0 = ( N k + α0 − αk ) Where we have used (B.25), Eq (10.66). Therefore, we obtain Eq (10.58). Finally, we are required to derive an update formula for r nk . Note that x¯k , Sk and Nk also contains r nk , we conclude that Eq (10.71), (10.72) and (10.75) depend on r nk . Using the definition of Nk , i.e., Eq (10.51), we can obtain: { } 1∑ 1∑ e k − D β−1 − L ∝ r nk ln Λ Nk vk Tr[(Sk + Ak )Wk ] k 2 k,n 2 k 1∑ 1∑ r nk ln π ek − r nk ln r nk + 2 k,n 2 k,n 218 ∑ Note that constraint exists for r nk : k r nk = 1, we cannot calculate the derivative and set it to zero. We must introduce a Lagrange Multiplier. Before doing so, let’s simplify Sk + Ak : Sk + Ak = = = N 1 ∑ r nk (xn − x¯k )(xn − x¯k )T + (x¯k − mk )(x¯k − mk )T Nk n=1 ] N [ 1 ∑ T T r nk xn xT + x¯k x¯k T − 2x¯k mk + mk mT n − 2 r nk x n x¯k + r nk x¯k x¯k k Nk n=1 ∑N ∑N N ¯ T r nk x¯k x¯k T 1 ∑ n=1 2 r nk x n x k T r nk xn xn − + n=1 + x¯k x¯k T − 2x¯k mk + mk mT k Nk n=1 Nk Nk = N 1 ∑ 2 Nk x¯k x¯k T Nk x¯k x¯k T r nk xn xT − + + x¯k x¯k T − 2x¯k mk + mk mT n k Nk n=1 Nk Nk = N 1 ∑ T r nk xn xT n − 2x¯k m k + m k m k Nk n=1 = = = N 1 ∑ T ( r nk xn xT n − 2 N k x¯k m k + N k m k m k ) Nk n=1 ] N 1 [∑ T ¯ r nk (xn xT − 2 x m + m m ) k k k n k Nk n=1 N 1 ∑ r nk (xn − mk )(xn − mk )T Nk n=1 Therefore, we obtain: L ∝ = { } ∑ ∑ 1∑ e k − D β−1 + r nk ln π r nk ln Λ e k − r nk ln r nk k 2 k,n k,n k,n 1∑ − Nk vk Tr[(Sk + Ak )Wk ] 2 k { } ∑ ∑ 1∑ e k − D β−1 + r nk ln π r nk ln Λ e k − r nk ln r nk k 2 k,n k,n k,n − K ∑ N 1∑ vk r nk (xn − mk )T Wk (xn − mk ) 2 k=1 n=1 Introducing Lagrange Multiplier λn , we obtain: (Lagrange) = { } ∑ ∑ 1∑ e k − D β−1 + r nk ln π r nk ln Λ e k − r nk ln r nk k 2 k,n k,n k,n − N N K ∑ ∑ ∑ 1∑ λn (1 − r nk ) vk r nk (xn − mk )T Wk (xn − mk ) + 2 k=1 n=1 n=1 k Calculating the derivative with respect to λn and setting it to zero, we can 219 obtain: ∂(Lagrange) ∂ r nk = 1 e k − D β−1 } + ln π {ln Λ e k − [ln r nk + 1] k 2 1 − vk (xn − mk )T Wk (xn − mk ) + λn = 0 2 Moving ln r nk to the right side and then exponentiating both sides, we obtain Eq (10.67), and the normalized r nk is given by Eq (10.49), (10.46), and (10.64)-(10.66). Problem 10.19 Solution Let’s start from the definition, i.e., Eq (10.78). ∑∫ ∫ ∫ b |X) = b |b p(x p(x z, µ, Λ) p(b z|π) p(π, µ, Λ|X) d π d µ d Λ b z = K ∑∫ ∫ ∫ ∏ b z k=1 1 b zk b |µk , Λ− N (x k ) · K ∏ k=1 b z πkk · p(π, µ, Λ|X) d π d µ d Λ K [ ∑∫ ∫ ∫ ∏ ] bzk 1 b |µk , Λ− ≈ N (x · q(π, µ, Λ) d π d µ d Λ k ) · πk b z = = K ∑ ∫ ∫ ∫ k=1 K ∫ ∫ ∫ ∑ k=1 k=1 [ ] 1 b |µk , Λ− N (x k ) · π k · q(π, µ, Λ) d π d µ d Λ K ∏ ] [ 1 b |µk , Λ− N (x q(µ j , Λ j ) d π d µ d Λ k ) · π k · q(π) · j =1 Where we have used the fact that z uses a one-of-k coding scheme. Recall that µ = {µk } and Λ = {Λk }, the term inside the summation can be further simplified. Namely, for those index j ̸= k, the integration with respect to µ j and Λ j will equal 1, i.e., b |X) = p (x = = K ∫ ∫ ∫ ∑ k=1 K ∫ ∫ ∫ ∑ k=1 K ∫ ∫ ∫ ∑ k=1 K ∏ [ ] 1 b |µk , Λ− N (x ) · π · q ( π ) · q (µ j , Λ j ) d π d µ d Λ k k j =1 1 b |µk , Λ− N (x k ) · π k · q(π) · q(µ k , Λ k ) d π d µ k d Λ k 1 −1 b |µk , Λ− N (x k ) · π k · Dir(π|α) · N (µ k |m k , (β k Λ k ) )W (Λ k |W k , v k ) d π d µ k d Λ k We notice that in the expression above, only πk · Dir(π|α) contains πk , and we know that the expectation of πk with respect to Dir(π|α) is αk /α bk . 220 Therefore, we can obtain: b |X) = p(x = = = K ∫ ∫ α ∑ k 1 −1 b |µk , Λ− N (x k ) · N (µ k |m k , (β k Λ k ) ) · W (Λ k |W k , v k ) d µ k d Λ k α b k=1 } K {∫ [∫ ∑ ] αk 1 −1 b |µk , Λ− · W ( Λ | W , v ) d Λ N (x ) · N ( µ | m , ( β Λ ) ) d µ · k k k k k k k k k k α b k=1 ∫ } { K ∑ 1 −1 α k b |mk , (1 + β− · W ( Λ | W , v ) d Λ N (x ) Λ ) · k k k k k k α b k=1 K α ∫ ∑ k 1 −1 b |mk , (1 + β− N (x k )Λ k ) · W (Λ k |W k , v k ) d Λ k α b k=1 Notice that the Wishart distribution is a conjugate prior for the Gaussian distribution with known mean and unknown precision. We conclude that the 1 1 b |mk , (1 + β− )Λ− ) · W (Λk |Wk , vk ) is again a Wishart distribuproduct of N (x k k tion without normalized, which can be verified by focusing on the dependency on Λk : { Tr[Λ · (x } b − m k )T ] 1 k b − m k )(x −1 (product) ∝ |Λk |1/2+(vk −D −1)/2 · exp − Tr[ Λ W ] − k k 1) 2 2(1 + β− k ′ ′ ∝ W (Λ k | W , v ) Where we have defined: ′ v = vk + 1 and ′ [W ]−1 = b − mk )(x b − m k )T (x 1 1 + β− k 1 + W− k Using the normalization constant of Wishart distribution, i.e., (B.79), we can obtain: b |X) p(x = K α ∫ ∑ k 1 −1 b |mk , (1 + β− N (x k )Λ k ) · W (Λ k |W k , v k ) d Λ k α b k=1 K α ∑ ′ ′ k · B(W , v ) α b k=1 ¯ (x ¯−(vk +1)/2 b − m k )T ¯ b − mk )(x −1 ¯ ∝ ¯ + W ¯ k 1 1 + β− k ¯−(vk +1)/2 ¯ 1 ¯ ¯ T b b W ( x − m )( x − m ) + I ∝ ¯ ¯ k k k 1 1 + β− k = b k . Next, we use: Here we have only considered those terms dependent on x |I + abT | = 1 + aT b 221 The expression above can be further simplified to: ¯−(vk +1)/2 ¯ T b b W ( x − m )( x − m ) + I ¯ k k k −1 1 + βk ]−(vk +1)/2 [ 1 T b b ( x − m ) W ( x − m ) 1+ k k k 1 1 + β− k ¯ ¯ b |X) ∝ ¯ p(x = 1 By comparing it with (B.68), we notice that it is a Student’s t distribution, whose parameters are defined by Eq (10.81)-(10.82). Problem 10.20 Solution Let’s begin by dealing with q⋆ (Λk ). When N → +∞, we know that Nk also approaches +∞ based on Eq (10.51). Therefore, we know that [Wk ]−1 → 1 Nk Sk and vk → Nk . Using (B.80), we conclude that E[Λk ] = vk Wk → S− . k If we now can prove that the entropy H[Λk ] is zero, we can conclude that the distribution collapse to a Dirac function, i.e, the distribution is sharply 1 peaked around S− , which is identical to the EM of Gaussian mixture given k by Eq (9.25). Therefore, let’s now start from ln B(Wk , vk ), i.e., (B.79). ln B(Wk , vk ) D ∑ vk vk D D (D − 1) vk + 1 − i ln |Wk | − ln 2 − ln π − ln Γ( ) 2 2 4 2 i =1 = − → D ∑ Nk + 1 − i Nk Nk D ln Γ( ln | Nk Sk | − ln 2 − ) 2 2 2 i =1 = D ∑ Nk − 1 − i Nk (D ln Nk + ln |Sk | − D ln 2) − ln Γ( + 1) 2 2 i =1 ≈ ≈ = = Nk Nk (D ln + ln |Sk |) 2 2 D [1 ∑ Nk − 1 − i Nk − 1 − i 1 Nk − 1 − i ] − ln 2π − +( + ) ln 2 2 2 2 i =1 2 D [ N ∑ Nk Nk Nk Nk ] k (D ln + ln |Sk |) − − + ln 2 2 2 2 2 i =1 Nk Nk D Nk D Nk Nk (D ln + ln |Sk |) + − ln 2 2 2 2 2 Nk (D + ln |Sk |) 2 Where we have used Eq (1.146) to approximate the logarithm of Gamma 222 function. Next we deal with E[ln Λk ] based on (B.81): E[ln Λk ] D ∑ = i =1 D ∑ → ϕ( ln( i =1 D ∑ ≈ vk + 1 − i ) + D ln 2 + ln |Wk | 2 ln i =1 Nk + 1 − i ) + D ln 2 − ln | Nk Sk | 2 Nk + D ln 2 − D ln N k − ln |Sk | 2 Nk + D ln 2 − D ln N k − ln |Sk | 2 − ln |Sk | = D ln = Where we have used Eq (10.241) to approximate the ϕ( vk +21− i ). Now we are ready to deal with the entropy H[ q(Λk )]: H[ q(Λk )] vk − D − 1 vk D E[ln Λk ] + 2 2 Nk Nk Nk D → − (D + ln |Sk |) + ln |Sk | + =0 2 2 2 = − ln B(Wk , vk ) − Therefore, we can conclude that the distribution q⋆ (Λk ) collapse to a 1 Dirac function at S− . In other words, when N → +∞, Λk can only achieve k −1 one value Sk . Next, we deal with q⋆ (µk |Λk ). According to Eq (10.60), when N → +∞, we conclude that βk → Nk , and thus, mk → x¯k based on Eq (10.61). Since 1 we know q⋆ (µk |Λk ) = N (µk | mk , (βk Λk )−1 ) and βk Λk → Nk S− is large, we k conclude that when N → ∞, µk also achieves only one value x¯k , which is identical to the EM of Gaussian Mixture, i.e., Eq (9.24). Finally, we consider q⋆ (π) given by Eq (10.54). Since we know αk → Nk N based on Eq (10.58), we see that E[µk ] = αk /α b → Nk and var[µk ] = αk (α b − αk ) α b + 1) b 2 (α ≤ α b·α b α b3 = 1 α b →0 We can also conclude that pi k only achieves one value NNk , which is identical to the EM of Gaussian Mixture, i.e., Eq (9.26). Now it is trivial to see that the predictive distribution will reduce to a Mixture of Gaussian using Eq (10.80). Beause π, µk and Λk all reduce to a Dirac function, the integration is easy to perform. Problem 10.21 Solution This can be verified directly. The total number of labeling equals assign K labels to K object. For the first label, we have K choice, K − 1 choice for the second label, and so on. Therefore, the total number is given by K !. Problem 10.22 Solution 223 Let’s explain this problem in details. Suppose that now we have a mixture of Gaussian p(Z|X), which are required to approximate. Moreover, it has K components and each of the modes is denoted as {µ1 , µ2 , ..., µK }. We use the variational inference, i.e., Eq (10.3), to minimize the KL divergence: KL( q|| p), and obtain an approximate distribution q s (Z) and a corresponding lower bound L( q s ). According to the problem description, this approximate distribution q s (Z) will be a single mode Gaussian located at one of the modes of p(Z|X), i.e., q s (Z) = N (Z|µs , Σs ), where s ∈ {1, 2, ..., K }. Now, we replicate this q s for K ! times in total. Each of the copies is moved to one mode’s center. Now we can write down the mixing distribution made up of K ! Gaussian distribution: K! 1 ∑ N (Z|µC (m) , Σs ) q m (Z) = K ! m=1 Where C ( m) represents the mode of the m-th component. C ( m) ∈ {1, 2, ..., K }. What the problem wants us to prove is: L( q s ) + ln K ! ≈ L( q m ) In other words, the lower bound using q m to approximate, i.e., L( q m ), is ln K ! larger than using q s , i.e., L( q s ). Based on Eq (10.3), let’s equivalently deal with the KL divergence. According to Eq (10.4), we can obtain: ∫ p(Z|X) KL( q m || p) = − q m (Z) ln dZ q m (Z) ∫ ∫ = − q m (Z) ln p(Z|X) d Z + q m (Z) ln q m (Z) d Z ∫ ∫ } { 1 ∑ K! N (Z|µC (m) , Σs ) d Z K ! m=1 ∫ ∫ {∑ } K! 1 = − q m (Z) ln p(Z|X) d Z + q m (Z) ln N (Z|µC (m) , Σs ) d Z + ln K ! m=1 ∫ = − ln K ! − q m (Z) ln p(Z|X) d Z = − + q m (Z) ln p(Z|X) d Z + 1 K! ∫ ∑ K! m=1 q m (Z) ln N (Z|µC (m) , Σs ) ln {∑ K! m=1 } N (Z|µC (m) , Σs ) d Z In order to further simplify the KL divergence, here we write down two useful equations. First, we use the "negligible overlap" property. To be more specific, according to the assumption that the overlap are negligible, we can obtain: ∫ N (Z|µC (m) , Σs ) ln {∑ K! m=1 ∫ } { } N (Z|µC (m) , Σs ) d Z ≈ N (Z|µC (m) , Σs ) ln N (Z|µC (m) , Σs ) d Z 224 The second equation is that for any m 1 , m 2 ∈ {1, 2, ..., K }, we have: ∫ ∫ { } q s ln q s d Z = N (Z|µC (m1 ) , Σs ) ln N (Z|µC (m1 ) , Σs ) d Z ∫ { } = N (Z|µC (m2 ) , Σs ) ln N (Z|µC (m2 ) , Σs ) d Z Therefore, now we can obtain: ∫ KL( q m || p) = − ln K ! − q m (Z) ln p(Z|X) d Z + 1 K! ∫ ∑ K! m=1 {∑ K! ∫ ≈ − ln K ! − 1 + K! N (Z|µC (m) , Σs ) ln m=1 } N (Z|µC (m) , Σs ) d Z q m (Z) ln p(Z|X) d Z ∫ ∑ K! m=1 { } N (Z|µC (m) , Σs ) ln N (Z|µC (m) , Σs ) d Z ∫ = − ln K ! − q m (Z) ln p(Z|X) d Z ∫ { } + N (Z|µC (m) , Σs ) ln N (Z|µC (m) , Σs ) d Z (∀ m ∈ {1, 2, ..., K }) ∫ ∫ = − ln K ! − q m (Z) ln p(Z|X) d Z + q s (Z) ln q s (Z) d Z ∫ {∑ ∫ } K! 1 = − ln K ! − N (Z|µC (m) , Σs ) ln p(Z|X) d Z + q s (Z) ln q s (Z) d Z K! m=1 ∫ ∫ ≈ − ln K ! − q s (Z) ln p(Z|X) d Z + q s (Z) ln q s (Z) d Z ∫ p(Z|X) = − ln K ! − q s (Z) ln d Z = − ln K ! + KL( q s || p) q s (Z) To obtain the desired result, we have adopted an approximation here, however, you should notice that this approximation is rough. Problem 10.23 Solution Let’s go back to Eq (10.70). If now we treat πk as a parameter without a prior distribution, πk will only occur in the second term in Eq (10.70), i.e., E[ln p(Z|π)]. Therefore, we can obtain: L ∝ E[ln p(Z|π)] = N ∑ K ∑ n=1 k=1 r nk ln πk Where we have used Eq (10.72), and here since πk is a point estimate, the expectation E[ln πk ] will reduce to ln πk . Now we introduce a Lagrange Multiplier. K K N ∑ ∑ ∑ πk − 1) r nk ln πk + λ · ( Lag = n=1 k=1 k=1 225 Calculating the derivative of the expression above with respect to πk and setting it to zero, we obtain: ∑N Nk n=1 r nk +λ = +λ = 0 (∗) πk πk Multiplying both sides by πk and then adopting summation of both sides with respect to k, we obtain K ∑ k=1 Nk + λ K ∑ k=1 πk = 0 Since we know the summation of Nk with respect to k equals N , and the summation of πk with respect to k equals 1, we rearrange the equation above, yielding: λ = −N Substituting it back into (∗), we can obtain: πk = N 1 ∑ Nk = r nk N N n=1 Just as required. Problem 10.24 Solution Recall that the singularity in the maximum likelihood estimation of Gaussian mixture is caused by the determinant of the covariance matrix Σk approaches 0, and thus the value in N (xn |µk , Σk ) will approach +∞. For more details, you can read Section 9.2, especially page 434. In this problem, an intuition is that since we have introduce a prior distribution for Λk , this singularity won’t exist when adopting MAP. Let’s verify this statement beginning by writing down the posterior. p(Z|X, π, µ, Λ) ∝ p(X|Z, π, µ, Λ) · p(Z, π, µ, Λ) = p(X|Z, π, µ, Λ) · p(Z|π, µ, Λ) · p(π|µ, Λ) · p(µ, Λ) = p(X|Z, µ, Λ) · p(Z|π) · p(π) · p(µ, Λ) Note that in the first step we have used Bayes’ theorem, that in the second step we have used the fact that p(a, b) = p(a| b) · p( b), and that in the last step we have omitted the extra dependence based on definition, i.e., Eq (10.37)(10.40). Now let’s calculate the MAP solution for Λk . ln p(Z|X, π, µ, Λ) ∝ = { } N 1 ∑ z nk ln |Λk | − (xn − µk )T Λk (xn − µk ) 2 n=1 } 1{ ln |Λk | − β0 (µk − m0 )T Λk (µk − m0 ) 2 } 1{ 1 + (v0 − D − 1) ln |Λk | − Tr[W− 0 Λ k ] + const 2 c · ln |Λk | − Tr[BΛk ] + const 226 Where const is the term independent of Λk , and we have defined: c= N ∑ 1 ( v0 − D + z nk ) 2 n=1 and B= } N 1{ ∑ 1 z nk (xn − µk )(xn − µk )T + β0 (µk − m0 )(µk − m0 )T + W− 0 2 n=1 Next we calculate the derivative of ln p(Z|X, π, µ, Λ) with respect to Λk and set it to 0, yielding: 1 c · Λ− k −B = 0 therefore, we obtain: 1 B c Note that in the MAP framework, we need to solve z nk first, and then substitute them in c and B in the expression above. Nevertheless, from the 1 expression above, we can see that Λ− won’t have zero determinant. k 1 Λ− k = Problem 10.25 Solution We qualitatively solve this problem. As the number of mixture components grows, so does the number of variables that may be correlated, but they are treated as independent under a variational approximation if Eq (10.5) has been used. Therefore, the proportion of probability mass under the true distribution, p(Z, π, µ, Σ|X), that the variational approximation q(Z, π, µ, Σ) does not capture, will grow. The consequence will be that the second term in (10.2), the KL divergence between q(Z, π, µ, Σ) and p(Z, π, µ, Σ|X) will increase. To answer the question whether we will underestimate or overestimate the number of components by minimizing KL( q|| p) divergence under factorization, we only need to see Fig.10.3. It is obvious that we will underestimate the number of components. Problem 10.26 Solution In this problem, we also need to consider the prior p(β) = Gam(β| c 0 , d 0 ). To be more specific, based on the original joint distribution p(t, w, α), i.e., Eq (10.90), the joint distribution p(t, w, α, β) now should be written as: p(t, w, α, β) = p(t|w, β) p(w|α) p(α) p(β) Where the first term on the right hand side is given by Eq (10.87), the second one is given by Eq (10.88), the third one is given by Eq (10.89), and the last one is given by Gam(β| c 0 , d 0 ). Using the variational framework, we assume a posterior variational distribution: q(w, α, β) = q(w) q(α) q(β) 227 It is trivial to observe that introducing a Gamma prior for β doesn’t affect q(α) because the expectation of p(β) can be absorbed into the ’const’ term in Eq (10.92). In other words, we still obtain Eq (10.93)-Eq(10.95). Now we deal with q(w). By analogy to Eq (10.96)-(10.98), we can obtain: ln q⋆ (w) ∝ Eβ [ln p(t|w, β)] + Eα [ln p(w|α)] + const N Eβ [β] ∑ Eα [α] T · w w + const ( t n − wT ϕn )2 − 2 2 n=1 { } 1 − wT Eβ [β] · ΦT Φ + Eα [α] · I w + Eβ [β]wT ΦT t + const 2 ∝ − = Therefore, by analogy to Eq (10.99)-(10.101), we can conclude that q⋆ (w) is still Gaussian, i.e., q⋆ (w) = N (w|m N , S N ), where we have defined: m N = Eβ [β]S N ΦT t and }−1 { S N = Eβ [β] · ΦT Φ + Eα [α] · I Next, we deal with q(β). According to definition, we have: ln q⋆ (β) ∝ Ew [ln p(t|w, β)] + ln p(β) + const N ∑ N β ∝ · ln β − · E[ ( t n − wT ϕn )2 ] + ( c 0 − 1) ln β − d 0 β 2 2 n=1 = = = = = = = N 2 N ( 2 N ( 2 N ( 2 N ( 2 N ( 2 N ( 2 ( β · E[||Φw − t||2 ] − d 0 β {1 } + c 0 − 1) · ln β − β · · E[||Φw − t||2 ] + d 0 2 {1 } + c 0 − 1) · ln β − β · · E[wT ΦT Φw − 2tT Φw + tT t] + d 0 2 {1 } 1 + c 0 − 1) · ln β − β · · Tr[ΦT ΦE[wwT ]] − tT ΦE[w] + tT t + d 0 2 2 {1 } 1 T T + c 0 − 1) · ln β − β · · Tr[ΦT Φ(m N mT + S )] − t Φ m + t t + d 0 N N N 2 2 {1 } 1 T T 1 T T + c 0 − 1) · ln β − β · Tr[Φ ΦS N ] + m N Φ Φm N − t Φm N + tT t + d 0 2 2 2 } 1{ + c 0 − 1) · ln β − β · Tr[ΦT ΦS N ] + ||Φm N − t||2 + 2 d 0 2 + c 0 − 1) · ln β − 2 Therefore, we obtain q⋆ (β) = Gam(β| c N , d N ), where we have defined: cN = and d N = d0 + N + c0 2 } 1{ Tr[ΦT ΦS N ] + ||Φm N − t||2 2 228 Furthermore, notice that from (B.27), the expectations in m N and S N can be expressed in a N , b N , and c N , d N : E[α] = aN bN and E[β] = cN dN We have already obtained all the update formula. Next, we calculate the lower bound. By noticing Eq (10.107), in this case, the first term on the right hand side of Eq (10.107) will be modified, and two more terms will be added on the right hand side, i.e., +E[ln p(β)] and −E[ln q⋆ (β)]. Let’s start from calculating the adding two terms: +E[ln p(β)] = ( c 0 − 1)E[ln β] − d 0 E[β] + c 0 ln d 0 − ln Γ( c 0 ) cN + c 0 ln d 0 − ln Γ( c 0 ) = ( c 0 − 1) · (φ( c N ) − ln d N ) − d 0 dN where we have used (B.26) and (B.30). Similarly, we have: −E[ln q⋆ (β)] = ( c N − 1) · φ( c N ) − c N + ln d N − ln Γ( c N ) where we have used (B.31). Finally, we deal with the modification of the first term on the right hand side of Eq (10.107): = = = {N } N β ln 2π − Ew [||Φw − t||2 ] 2 2 2 Eβ [β] N N Eβ [ln β] − ln 2π − Ew [||Φw − t||2 ] 2 2 2 N cN (φ( c N ) − ln d N − ln 2π) − Ew [||Φw − t||2 ] 2 2d N } N cN { (φ( c N ) − ln d N − ln 2π) − Tr[ΦT ΦS N ] + ||Φm N − t||2 2 2d N Eβ,w [ln p(t|w, β)] = Eβ ln β − The last question is the predictive distribution. It is not difficult to observe that the predictive distribution is still given by Eq (10.105) and Eq (10.106), with 1/β replaced by 1/E[β]. Problem 10.27 Solution Let’s deal with the terms in Eq(10.107) one by one. Noticing Eq (10.87), we have: E[ln p(t|w)]w ] N N β [∑ N ( t n − wT ϕn )2 ln(2π) + ln β − E 2 2 2 n=1 ] N N N ∑ ∑ N N β [∑ = − ln(2π) + ln β − E t2n − 2 t n · wT ϕ n + wT ϕ n · ϕT nw 2 2 2 n=1 n=1 n=1 ] N β [ T N = − ln(2π) + ln β − E t t − 2wT ΦT t + wT · (ΦT Φ) · w 2 2 2 [ [ ] ] N N β T = − ln(2π) + ln β − t t − βE[wT ] · ΦT t + Tr E (wwT ) · (ΦT Φ)] 2 2 2 = − 229 Where we have defined Φ = [ϕ1 , ϕ2 , ..., ϕ N ]T , i.e., the i -th row of Φ is Then using Eq(10.99), (10.100) and (10.103), it is easy to obtain (10.108). Next, we deal with the second term by noticing Eq (10.88): ϕT . i [ ] E ln p(w|α) w,α = − M M E[α]α ln(2π) + E[ln α]α − · E[wwT ]w 2 2 2 Then using Eq (10.93)-(10.95), (B.27), (B.30) and Eq (10.103), we obtain Eq (10.109) just as required. Then we deal with the third term in Eq (10.107) by noticing Eq (10.89): E[ln p(α)]α a 0 ln b 0 + (a 0 − 1)E[ln α] − b 0 E[α] − ln Γ(a 0 ) = Similarly, using Eq (10.93)-(10.95), (B.27), (B.30), we will obtain Eq (10.110). Notice that there is a typo in Eq (10.110). The last term in Eq (10.110) should be ln Γ(a 0 ) instead of ln Γ(a N ). Finally we deal with the last two terms in Eq (10.107). We notice that these two terms are actually negative entropy of a Gaussian and a Gamma distribution, so that using (B.31) and (B.41), we can obtain: −E[ln q(α)]α = H[α] = ln Γ(a N ) − (a N − 1) · φ(a N ) − ln b N + a N and 1 M ln |S N | + (1 + ln(2π)) 2 2 Problem 10.28 Solution(waiting for update) −E[ln q(w)]w = H[w] = Problem 10.29 Solution The second derivative of f ( x) is given by: d 1 1 d2 (ln x) = ( ) = − 2 <0 2 dx x dx x Therefore, f ( x) = ln x is concave for 0 < x < ∞. Based on definition, i.e., Eq (10.133), we can obtain: g(λ) = min{λ x − ln x} x We observe that: d 1 (λ x − ln x) = λ − dx x In other words, when λ ≤ 0, λ x−ln x will always decrease as x increase. On the other hand, when λ > 0, λ x − ln x will achieve its minimum when x = 1/λ. Therefore, we conclude that: g(λ) = λ · 1 1 − ln = 1 + ln λ λ λ 230 Substituting g(λ) back into Eq (10.132), we obtain: f ( x) = min{λ x − 1 − ln λ} λ We calculate the derivative: 1 d (λ x − 1 − ln λ) = x − dλ λ Therefore, when λ = 1/ x, λ x − 1 − ln λ achieves minimum with respect to λ, which yields: 1 1 f ( x) = · x − 1 − ln = ln x x x In other words, we have shown that Eq (10.132) indeed recovers f ( x) = ln x. Problem 10.30 Solution We begin by calculating the first derivative: d f ( x) − e− x = σ( x ) · e − x =− dx 1 + e− x Then we can obtain the second derivative: d 2 f ( x) − e− x (1 + e− x ) − e− x (− e− x ) = = −[σ( x)]2 · e− x < 0 2 dx (1 + e− x )2 Therefore, the log logistic function f ( x) is concave. Utilizing this concave property, we can obtain: ′ f ( x) ≤ f (ξ) + f (ξ) · ( x − ξ) which gives, ln σ( x) ≤ ln σ(ξ) + σ(ξ) · e−ξ · ( x − ξ) (∗) Comparing the expression above with Eq (10.136), we define λ = σ(ξ) · e−ξ . Then we can obtain: λ = σ(ξ) · e−ξ = e−ξ 1 + e−ξ = 1− 1 1 + e−ξ = 1 − σ (ξ ) In other words, we have obtained σ(ξ) = 1 − λ. In order to simplify (∗), we need to express ξ using λ and x. According to the definition of λ, we can obtain: ξ = ln σ(ξ) − ln λ = ln(1 − λ) − ln λ Now (∗) can be simplified as: ln σ( x) ≤ ln(1 − λ) + λ · ( x − ξ) = λ · x + ln(1 − λ) − λ · ξ [ ] = λ · x + ln(1 − λ) − λ · ln(1 − λ) − ln λ = λ · x + (1 − λ) ln(1 − λ) + λ ln λ = λ · x − g(λ) 231 Just as required. Problem 10.31 Solution We start from calculating the derivative of f ( x) with respect to x: 1 x/2 e − 1 e− x/2 d f ( x) 1 e x/2 − e− x/2 = − 2 x/2 2− x/2 = − · x/2 dx 2 e + e− x/2 e +e Then the second derivative is: 1 x/2 d 2 f ( x) dx2 = − 1 (2 e 2 + 12 e− x/2 )( e x/2 + e− x/2 ) − ( e x/2 − e− x/2 )( 12 e x/2 − 12 e− x/2 ) ( e x/2 + e− x/2 )2 1 4· e · e 1 = − = − x/2 <0 x /2 − x /2 2 4 (e + e ) ( e + e− x/2 )2 − x/2 x/2 Therefore, the f ( x) is concave with respect to x. We denote y = x2 , and calculate: 1 −1/2 p p y } df d { = − ln( e y/2 + e− y/2 ) = − 4 dy dy e e p p y/2 p − 14 y−1/2 e− y/2 p y/2 + e− y/2 Then the second derivative can be calulated as: d2 f d y2 p p 1 d { −1/2 e y/2 − e− y/2 } = − y · p p 4 dy e y/2 + e− y/2 p p } 1 { 1 −3/2 e y/2 − e− y/2 y−1/2 = − − y · p + y−1/2 · p p p 4 2 e y/2 + e− y/2 ( e y/2 + e− y/2 )2 p p { 1 } y−1 1 1 = − p · − y−1/2 · ( e y/2 − e− y/2 ) + p p p 4 e y/2 + e− y/2 2 e y/2 + e− y/2 In order to show that the second derivative is no less than 0, we only need to prove: p p 1 1 − y−1/2 · ( e y/2 − e− y/2 ) + p ≤0 p y /2 2 e + e− y/2 Which is equivalent to: (e p y/2 p + e− y/2 )( e p y/2 p − e− y/2 ) ≥ 2 y1/2 ⇐⇒ e p y/2 p − e− y/2 p ≥2 y (∗) We construct a function g( t) = e t − e− t − 2 t, where t ≥ 0. Notice that: √ ′ g ( t) = e t + e − t − 2 ≥ 2 e t · e − t − 2 = 0 Therefore, we conclude that g( t) is monotonically increasing on t ≥ 0, and the minimum of g( t) is achieved when t = 0. Substituting t = 0 into g( t), we obtain: e t − e− t ≥ 2 t, ∀ t ≥ 0 232 By now, we have already proved (∗), and thus f ( x) is convex with respect to x2 . Utilizing the convex property of f ( x) with respect to x2 , we can obtain: − ln( e p y/2 +e p − y/2 ) ≥ − ln( e p ϵ/2 +e p − ϵ/2 p p 1 e ϵ/2 − e− ϵ/2 ) − ϵ−1/2 · p · ( y − ϵ) p 4 e ϵ/2 + e− ϵ/2 Substituting y = x2 and ϵ = ξ2 back into the expression above, we obtain: 1 eξ/2 − e−ξ/2 − ln( e x/2 + e− x/2 ) ≥ − ln( eξ/2 + e−ξ/2 ) − ξ−1 · ξ/2 · ( x 2 − ξ2 ) 4 e + e−ξ/2 = − ln( eξ/2 + e−ξ/2 ) − λ(ξ) · ( x2 − ξ2 ) Where we have used the definition of λ(ξ), i.e., Eq (10.141). Notice that the expression above is identical to Eq (10.143), from which we can easily obtain Eq (10.144). Figure 5: f ( x) with respect to x and x2 Problem 10.32 Solution By carefully reading Section 10.6.1, we can obtain an Gaussian approximation to the posterior p(w|t), i.e,. Eq (10.156)-(10.168) if variational lower bound of logistic sigmoid function has been used, i.e., Eq (10.151). What the 1 . problem asks is to obtain a sequential update for m N and S− N The intrinsic reason is that we notice that each data point {ϕn , t n } corresponds to one variational parameter ξn and the likelihood is given by the 233 product of p( t n |w). Let’s begin by deriving a sequential update formula for 1 S− : N 1 S− N +1 1 = S− 0 +2 N∑ +1 n=1 λ(ξn )ϕn ϕT n −1 = 2λ(ξ N +1 )ϕ N +1 ϕT N +1 + S0 + 2 N ∑ n=1 λ(ξn )ϕn ϕT n −1 = 2λ(ξ N +1 )ϕ N +1 ϕT N +1 + S N Next we derive a sequential update formula for m N : m N +1 [ ] N∑ +1 1 m + ( t − 1/2) ϕ = S N +1 S− 0 n n 0 n=1 [ ] N ∑ 1 = S N +1 S− m + ( t − 1/2) ϕ + ( t − 1/2) ϕ 0 n N +1 n N +1 0 n=1 ] [ N ∑ [ −1 ] 1 = S N +1 S− ( t n − 1/2)ϕn + ( t N +1 − 1/2)ϕ N +1 N S N S0 m0 + n=1 [ ] 1 = S N +1 S− N m N + ( t N +1 − 1/2)ϕ N +1 1 In conclusion, when a new data (ϕ N +1 , t N +1 ) arrives, we first update S− , N +1 and then update m N +1 based on the formula we obtained above. Problem 10.33 Solution To prove Eq (10.163), we only need to prove Eq (10.162), from which Eq (10.163) can be easily derived according to the text below Eq (10.132). Therefore, in what follows, we prove that the derivative of Q (ξ, ξold ) with respect to ξn will give Eq (10.162). We start by noticing Eq (4.88), i.e., d σ(ξ) = σ(ξ) · (1 − σ(ξ)) dξ Noticing Eq (10.150), now we can obtain: dQ (ξ, ξold ) d ξn ′ σ (ξn ) ′ 1 T 2 − λ (ξn ) · (ϕT n E[ww ]ϕ n − ξ n ) − λ(ξ n ) · (−2ξ n ) σ(ξn ) 2 ′ 1 T 2 = 1 − σ(ξn ) − + 2ξn · λ(ξn ) − λ (ξn ) · (ϕT n E[ww ]ϕ n − ξ n ) 2 ′ 1 1 T 2 = − σ (ξ n ) + σ (ξ n ) − − λ (ξ n ) · (ϕ T n E[ww ]ϕ n − ξ n ) 2 2 ′ T 2 = −λ (ξn ) · (ϕT n E[ww ]ϕ n − ξ n ) = − Setting the derivative equal to zero, we obtain Eq (10.162),from which Eq (10.16b3) follows. 234 Problem 10.34 Solution First, we should clarify one thing and that is there is typos in Eq(10.164). It is not difficult to observe these error if we notice that for q(w) = N (w|m N , S N ), in its logarithm, i.e.,ln q(w), 12 ln |S N | should always have the same sign as 1 T −1 2 m N S N m N . This is our intuition. However, this is not the case in Eq(10.164). Based on Eq(10.159), Eq(10.153) and the Gaussian prior p(w) = N (w|m0 , S0 ), we can analytically obtain the correct lower bound L(ξ) (this will also be strictly proved by the next problem): L(ξ) = = 1 1 1 T −1 1 −1 ln |S N | − ln |S0 | + mT N S N m N − m0 S0 m0 2 2 2 2 N { ∑ } 1 + ln σ(ξn ) − ξn + λ(ξn )ξ2n 2 n=1 N { ∑ } 1 1 1 −1 ln |S N | + mT ln σ(ξn ) − ξn + λ(ξn )ξ2n + const N SN mN + 2 2 2 n=1 Where const denotes the term unrelated to ξn because m0 and S0 don’t 1 depend on ξn . Moreover, noticing that S− · m N also doesn’t depend on ξn N according to Eq(10.157),thus it will be convenient to define a variable: z N = 1 S− · m N , and we can easily verify: N −1 −1 T −1 −1 T −1 T −1 mT N S N m N = [S N S N m N ] S N [S N S N m N ] = [S N z N ] S N [S N z N ] = z N S N z N Now, we can obtain: N { ∑ }} d {1 1 1 ∂L(ξ) 2 = ln |S N | + zT S z + ln σ ( ξ ) − ξ + λ ( ξ ) ξ n n n n N N ∂ξn d ξn 2 2 N 2 n=1 ′ 1 [ −1 ∂S N ] 1 [ ∂S N ] = Tr S N + Tr z N zT + λ (ξn )ξ2n N· 2 ∂ξn 2 ∂ξn Where we have used Eq(3.117) for the first term, and for the second term we have used: } 1 d { [ ]} 1 [ d {1 T ∂S N ] zN SN zN = Tr z N zT = Tr z N zT N · SN N· d ξn 2 2 d ξn 2 ∂ξn Furthermore, for the last term, we can follow the same procedure as in the previous problem and now our remain task is to calculate ∂S N /∂ξn . Based on Eq(10.158) and (C.21), we can obtain: ∂S N ∂ξn = −S N 1 ∂S− N ∂ξn ′ S N = −S N · [2λ (ξn )ϕn ϕT n ] · SN Substituting it back into the derivative, we can obtain: ′ 1 [ −1 ∂L(ξ) ∂S N ] = Tr (S N + z N zT + λ (ξn )ξ2n N) ∂ξn 2 ∂ξn ] ′ ′ 1 [ 1 T T 2 = − Tr (S− N + z N z N )S N · [2λ (ξ n )ϕ n ϕ n ] · S N + λ (ξ n )ξ n 2 { [ ] 2} ′ 1 T T = −λ (ξn ) · Tr (S− + z z ) · S · ϕ ϕ · S N N N N − ξn = 0 n n N 235 Therefore, we can obtain: ξ2n [ ] 1 T T = Tr (S− N + z N z N ) · S N · ϕn ϕn · S N 1 T = (S N · ϕn )T · (S− N + z N z N ) · (S N · ϕ n ) T = ϕT n · (S N + S N z N z N S N ) · ϕ n T = ϕT n · (S N + m N m N ) · ϕ n 1 Where we have used the defnition of z N ,i.e., z N = S− · m N and also reN peatedly used the symmetry property of S N . Problem 10.35 Solution There is a typo in Eq (10.164), for more details you can refer to the previous problem. Let’s calculate L(ξ) based on Based on Eq(10.159), Eq(10.153) and the Gaussian prior p(w) = N (w|m0 , S0 ): h(w, ξ) p(w) = N (w|m0 , S0 ) · N ∏ n=1 { σ(ξn ) exp wT ϕn tn − (wT ϕn + ξn )/2 } −λ(ξn )([w ϕn ]2 − ξ2n ) T = } { 1 { } N ∏ 1 σ(ξn ) · exp − (w − m0 )T S− (2π)−W /2 · |S0 |−1/2 · (w − m ) 0 0 2 n=1 { } N ∏ · exp wT ϕn tn − (wT ϕn + ξn )/2 − λ(ξn )([wT ϕn ]2 − ξ2n ) n=1 { N N ξ N ∏ ∑ ∑ ( 1 )} n 1 2 = (2π)−W /2 · |S0 |−1/2 · σ(ξn ) · exp − m0T S− m − + λ ( ξ ) ξ 0 n n 0 2 n=1 n=1 2 n=1 { 1 ( ( ) N N ∑ ∑ 1 )} 1 T T −1 · exp − wT S− S + 2 λ ( ξ ) ϕ ϕ w + w m + ϕ ( t − n 0 n n n n 0 0 2 2 n=1 n=1 Noticing Eq (10.157)-(10.58), we can obtain: { N N N ξ ∏ ∑ ∑ ( 1 )} n 1 (2π)−W /2 · |S0 |−1/2 · σ(ξn ) · exp − m0T S− + λ(ξn )ξ2n 0 m0 − 2 n=1 n=1 n=1 2 { 1 } 1 T −1 · exp − wT S− N w + w SN mN 2 { N ∏ σ(ξn ) = (2π)−W /2 · |S0 |−1/2 · h(w, ξ) p(w) = n=1 N ξ N ∑ ∑ ( 1 )} 1 n 1 −1 λ(ξn )ξ2n + mT · exp − m0T S− + 0 m0 − N SN mN 2 2 n=1 2 n=1 } { 1 1 · exp − (w − m N )T S− N (w − m N ) 2 Therefore, utilizing the normalization constant of Gaussian distribution, 236 now we can obtain: ∫ { N ∏ h(w, ξ) p(w) d w = (2π)W /2 · |S N |1/2 · (2π)−W /2 · |S0 |−1/2 · σ(ξn ) n=1 ( 1 )} 1 T −1 ξn 2 1 + λ ( ξ ) ξ + m · exp − m0T S− m − S m n 0 N n 0 2 2 N N n=1 n=1 2 { |S | N N 1/2 ∏ ) · = ( σ (ξ n ) | S0 | n=1 N ξ N ∑ ∑ )} ( 1 1 T −1 n 1 2 m − + λ ( ξ ) ξ + m S m · exp − m0T S− 0 n n N 0 2 2 N N n=1 2 n=1 N ∑ N ∑ Therefore, L(ξ) can be written as: ∫ L(ξ) = ln h(w, ξ) p(w) d w = } N { ∑ 1 1 |S N | 1 T −1 1 −1 ln σ(ξn ) − ξn + λ(ξn )ξ2n ln − m0 S0 m0 + mT N SN mN + 2 |S0 | 2 2 2 n=1 Problem 10.36 Solution Let’s clarify this problem. What this problem wants us to prove is that suppose at beginning the joint distribution comprises a product of j −1 factors, i.e., j∏ −1 p j−1 (D, θ ) = f j−1 (θ ) i =1 and now the joint distribution comprises a product of j factors: p j (D, θ ) = j ∏ i =1 f j (θ ) = p j−1 (D, θ ) · f j (θ ) Then we are asked to prove Eq (10.242). This situation corresponds to j − 1 data points at the beginning and then one more data point is obtained. For more details you can read the text below Eq (10.188). Based on definition, we can write down: ∫ p j (D ) = p j (D, θ ) d θ ∫ = p j−1 (D, θ ) · f j (θ ) d θ ∫ = p j−1 (D ) · p j−1 (θ |D ) · f j (θ ) d θ ∫ = p j−1 (D ) · p j−1 (θ |D ) · f j (θ ) d θ ∫ ≈ p j−1 (D ) · q j−1 (θ ) · f j (θ ) d θ = p j−1 (D ) · Z j 237 Where we have sequentially used Bayes’ Theorem, q j−1 (θ ) is an approximation for the posterior p j−1 (θ |D ), and Eq (10.197). To further prove Eq (10.243), we only need to recursively use the expression we have proved. Problem 10.37 Solution Let’s start from definition. q() will be initialized as qinit (θ ) = fe0 (θ ) ∏ i ̸=0 fei (θ ) = f 0 (θ ) ∏ i ̸=0 fei (θ ) Where we have used fe0 (θ ) = f 0 (θ ) according to the problem description. Then we can obtain: ∏ q(θ ) q/0 (θ ) = = fei (θ ) fe0 (θ ) i̸=0 Next, we will obtain qnew (θ ) by matching its moments against q/0 (θ ) f 0 (θ ), which exactly equals: q/0 (θ ) f 0 (θ ) = ∏ q (θ ) fei (θ ) · f 0 (θ ) = qinit (θ ) = e f 0 (θ ) i̸=0 In other words, in order to obtain qnew (θ ), we need to match its moment against q/0 (θ ), and since qnew and qinit both belong to exponential family, they will be identical if they have the same moment. Moreover, based on Eq (10.206), we have: ∫ ∫ /0 Z0 = q (θ ) f 0 (θ ) d θ = qinit (θ ) d θ = 1 Therefore, based on Eq(10.207), we have: qnew (θ ) qinit (θ ) fe0 (θ ) = Z0 /0 = 1 · /0 = f 0 (θ ) q (θ ) q (θ ) Problem 10.38 Solution Based on Eq (10.205), (10.212) and (10.213), we can obtain: q/ j (θ ) q(θ ) N (θ |m, vI) = s n N (θ |mn , vn I) fej (θ ) { } exp − 12 (θ − m)T (vI)−1 (θ − m) { } ∝ exp − 12 (θ − mn )T (vn I)−1 (θ − mn ) } { 1 1 = exp − (θ − m)T (vI)−1 (θ − m) + (θ − mn )T (vn I)−1 (θ − mn ) 2 2 { 1 } T T = exp − (θ Aθ + θ · B + C) 2 = 238 Where we have completed squares over θ in the last step, and we have defined: ] [ A = (vI)−1 − (vn I)−1 and B = 2 · − (vI)−1 · m + (vn I)−1 · mn Note that in order to match this to a Gaussian, we don’t actually need C, so we omit it here. Now we match this against a Gaussian, beginning by first considering the quadratic term, we can obtain: 1 −1 [Σ/n ]−1 = (vI)−1 − (vn I)−1 = (v−1 − v− = [v/n ]−1 · I−1 n )I It is identical to Eq (10.214). By matching the linear term, we can also obtain: [ ] −2 · [Σ/n ]−1 · (m/n ) = B = 2 · − (vI)−1 · m + (vn I)−1 · mn Rearranging it, we can obtain: [ ] (m/n ) = −[Σ/n ] · − (vI)−1 · m + (vn I)−1 · mn [ ] 1 = −[v/n ] · − v−1 · m + v− · m n n v/ n · mn vn = v/n · v−1 · m − = 1 v/n ([v/n ]−1 − v− n )·m− = m+ v/ n · mn vn v/ n · (m − mn ) vn Which is identical to Eq (10.214). One important thing worthy clarified is that: for arbitrary two Gaussian random variable, their division is not a Gaussian. You can find more details by typing "ratio distribution" in Wikipedia. Generally speaking, the division of two Gaussian random variable follows a Cauchy distribution. Moreover, the product of two Gaussian random variables is not a Gaussian random variable. However, the product of two Gaussian PDF, e.g., p(x) and p(y), can be a Gaussian PDF because when x and y are independent, p(x, y) = p(x) p(y), is a Gaussian PDF. In the EP framework,according to Eq (10.204), we have already assumed that q(θ ), i.e., Eq (10.212), is given by the product of fej (θ ), i.e.,(10.213). Therefore, their division still gives by the product of many remaining Gaussian PDF, which is still a Gaussian. Finally, based on Eq (10.206) and (10.209), we can obtain: ∫ Zn = q/n (θ ) p(xn |θ ) d θ ∫ { } = N (θ |m/n , v/n I) · (1 − w)N (xn |θ , I) + wN (xn |0, αI) d θ ∫ ∫ /n /n = (1 − w) N (θ |m , v I)N (xn |θ , I) d θ + w N (θ |m/n , v/n I) · N (xn |0, αI) d θ = (1 − w)N (xn |m/n , (v/n + 1)I) + wN (xn |0, αI) 239 Where we have used Eq (2.115). Problem 10.39 Solution This problem is really complicated, but hint has already been given in Eq (10.244) and (10.255). Notice that in Eq (10.244), we have a quite complicated term ∇m/n ln Z n , which we know that ∇m/n ln Z n = (∇m/n Z n )/ Z n based on the Chain Rule, and since we know the exact form of Z n which has been derived in the previous problem, we guess that we can start from dealing with ∇m/n ln Z n to obtain Eq (10.244). Before starting, we write down a basic formula here: for a Gaussian random variable x ∼ N (x|µ, Σ), we have: ∇µ N (x|µ, Σ) = N (x|µ, Σ) · (x − µ)Σ−1 Now we can obtain: ∇m/n ln Z n = = = = = = 1 Zn 1 Zn 1 Zn 1 Zn 1 Zn 1 v · ∇m/n Z n ∫ · ∇m/n q/n (θ ) p(xn |θ ) d θ ∫ { } · ∇m/n q/n (θ ) · p(xn |θ ) d θ ∫ 1 · (θ − m/n ) · q/n (θ ) · p(xn |θ ) d θ v/ n ∫ ∫ } 1 { /n θ · q (θ ) · p(xn |θ ) d θ − m/n · q/n (θ ) · p(xn |θ ) d θ · /n · v { } /n · E [ θ ] − m /n Here we have used q/n (θ ) = N (θ |m/n , v/n I), and q/n (θ ) · p(xn |θ ) = Z n · q (θ ). Rearranging the equation above, we obtain Eq (10.244). Then we use Eq (10.216), yielding: new E[θ ] = m/n + v/n · ∇m/n ln Z n 1 1 = m/n + v/n · (1 − w)N (xn |m/n , (v/n + 1)I) · /n (xn − m/n ) Zn v +1 1 = m/n + v/n · ρ n · /n (xn − m/n ) v +1 Where we have defined: ρn 1 (1 − w)N (xn |m/n , (v/n + 1)I) Zn 1 Z n − wN (xn |0, αI) = (1 − w) · Zn 1−w w N (xn |0, αI) = 1− Zn = 240 Therefore, we have proved the mean m is given by Eq (10.217), next we prove Eq (10.218). Similarly, we can write down: ∇v/n ln Z n = = = = = = 1 · ∇ /n Z n Zn v ∫ 1 · ∇v/n q/n (θ ) p(xn |θ ) d θ Zn ∫ { } 1 · ∇v/n q/n (θ ) p(xn |θ ) d θ Zn ∫ { 1 1 D } /n /n 2 · || m − θ || − q (θ ) · p(xn |θ ) d θ Z 2(v/n )2 2 v/ n ∫n { 1 D } /n T /n (m − θ ) (m − θ ) − dθ qnew (θ ) · 2(v/n )2 2 v/ n } 1 { D T /n /n 2 E [ θθ ] − 2 E [ θ ]m + || m || − /n / n 2 2(v ) 2v Rearranging it, we can obtain: E[θθ T ] = 2(v/n )2 · ∇v/n ln Z n + 2E[θ ]m/n − ||m/n ||2 + D · v/n There is a typo in Eq (10.255), and the intrinsic reason is that when calculating ∇v/n q/n (θ ), there are two terms in q/n (θ ) dependent on v/n : one is inside the exponential, and the other is in the fraction |v/n1I|1/2 , which is outside the exponential. Now, we still use Eq (10.216), yielding: ∇v/n ln Z n [ ] 1 D 1 /n 2 || x − m || − (1 − w)N (xn |m/n , (v/n + 1)I) · n Zn 2(v/n + 1)2 2(v/n + 1) ] [ D 1 ||xn − m/n ||2 − = ρn · / n 2 / 2(v + 1) 2(v n + 1) = Finally, using the definition of variance, we obtain: vI = E[θθ T ] − E[θ ]E[θ T ] Therefore, taking the trace, we obtain: v = = = = 1 D 1 D 1 D 1 D { } } 1 { · E[θ T θ ] − E[θ T ]E[θ ] = · E[θ T θ ] − ||E[θ ]||2 D { } /n 2 · 2(v ) · ∇v/n ln Z n + 2E[θ ]m/n − ||m/n ||2 + D · v/n − ||E[θ ]||2 { } · 2(v/n )2 · ∇v/n ln Z n − ||E[θ ] − m/n ||2 + D · v/n { } 1 · 2(v/n )2 · ∇v/n ln Z n − ||v/n · ρ n · /n (xn − m/n )||2 + D · v/n v +1 If we substitute ∇v/n ln Z n into the expression above, we will just obtain Eq (10.215) as required. 241 0.11 Sampling Methods Problem 11.1 Solution Based on definition, we can write down: L 1∑ E[ fb] = E[ f (z(l ) )] L l =1 = L 1∑ E[ f (z(l ) )] L l =1 = 1 · L · E[ f ] = E[ f ] L Where we have used the fact that the expectation and the summation can exchange order because all the z(l ) are independent, and that E[ f (z(l ) )] = E[ f ] because all the z(l ) are drawn from p(z). Next, we deal with the variance: var[ fb] = E[( fb − E[ fb])2 ] = E[ fb2 ] − E[ fb]2 = E[ fb2 ] − E[ f ]2 L 1∑ = E[( f (z(l ) ))2 ] − E[ f ]2 L l =1 = L ∑ 1 E [( f (z(l ) ))2 ] − E[ f ]2 L2 l =1 = L L ∑ 1 ∑ 2 (l ) (z ) + f (z( i) ) f (z( j) )] − E[ f ]2 E [ f L2 l =1 i, j =1,i ̸= j = L 1 ∑ L2 − L 2 (l ) f (z )] + E [ E[ f ]2 − E[ f ]2 L2 l =1 L2 = = = L 1 ∑ 1 E[ f 2 (z(l ) )] − E[ f ]2 L L2 l =1 1 1 · L · E[ f 2 ] − E[ f ]2 2 L L 1 1 1 2 E[ f ] − E[ f ]2 = E[( f − E[ f ])2 ] L L L Just as required. Problem 11.2 Solution What this problem wants us to prove is that if we use y = h−1 ( z) to transform the value of z to y, where z satisfies a uniform distribution over [0, 1] and h(·) is defined by Eq(11.6), we can enforce y to satisfy a specific desired distribution p( y). Let’s prove it beginning by Eq (11.1): ∫ y ′ dz d p ⋆ ( y) = p ( z ) · | | = 1 · h ( y) = p( b y) d b y = p ( y) dy d y −∞ 242 Just as required. Problem 11.3 Solution We use what we have obtained in the previous problem. ∫ y h ( y) = p( b y) d b y ∫−∞ y 1 1 = db y y2 −∞ π 1 + b = tan−1 ( y) Therefore, since we know that z = h( y) = tan−1 ( y), we can obtain the transformation from z to y: y = tan( z). Problem 11.4 Solution First, I believe there is a typo in Eq (11.10) and (11.11). Both ln z1 and ln z2 should be ln( z12 + z22 ). In the following, we will solve the problem under this assumption. We only need to calculate the Jacobian matrix. First, based on Eq (11.10)(11.11), it is not difficult to observe that z1 only depends on y1 , and z2 only depends on y2 , which means that ∂ z1 /∂ y2 = 0 and ∂ z2 /∂ y1 = 0. To obtain the diagonal terms of the Jacobian matrix, i.e., ∂ z1 /∂ y1 and ∂ z2 /∂ y2 . To deal with the problem associated with a circle, it is always convenient to use polar coordinate: z1 = r cos θ , and z2 = r sin θ It is easily to obtain: [ ∂ z 1 /∂ r = ∂ z 2 /∂ r ∂( r, θ ) ∂( z1 , z2 ) ] [ cos θ ∂ z1 /∂θ = sin θ ∂ z2 /∂θ − r sin θ r cos θ ] Therefore, we can obtain: | ∂( z1 , z2 ) ∂( r, θ ) | = r (cos2 θ + sin2 θ ) = r Then we substitute r and θ into Eq (11.10), yielding: y1 = r cos θ ( −2 ln r 2 1/2 ) = cos θ (−2 ln r 2 )1/2 r2 (∗) Similarly, we also have: y2 = sin θ (−2 ln r 2 )1/2 It is easily to obtain: [ ] [ ∂( y1 , y2 ) ∂ y1 /∂ r ∂ y1 /∂θ −2 cos θ (−2 ln r 2 )−1/2 · r −1 = = ∂ y2 /∂ r ∂ y2 /∂θ −2 sin θ (−2 ln r 2 )−1/2 · r −1 ∂( r, θ ) (∗∗) − sin θ (−2 ln r 2 )1/2 cos θ (−2 ln r 2 )1/2 ] 243 Therefore, we can obtain: | ∂( y1 , y2 ) ∂( r, θ ) | = (−2 r −1 (cos2 θ + sin2 θ )) = −2 r −1 Next, we need to use the property of Jacobian Matrix: | ∂( z1 , z2 ) ∂( y1 , y2 ) | = | = = = ∂( z1 , z2 ) · ∂( r, θ ) | ∂( r, θ ) ∂( y1 , y2 ) ∂( r, θ ) ∂( z1 , z2 ) |·| | | ∂( r, θ ) ∂( y1 , y2 ) ∂( z1 , z2 ) ∂( y1 , y2 ) −1 | |·| | ∂( r, θ ) ∂( r, θ ) r · (−2 r −1 )−1 = − r2 2 By squaring both sides of (∗) and (∗∗) and adding them together, we can obtain: { y2 + y2 } 2 y12 + y22 = −2 ln r 2 => r 2 = exp 1 −2 Finally, we can obtain: p( y1 , y2 ) = p( z1 , z2 )| ∂( z1 , z2 ) ∂( y1 , y2 ) |= { y2 + y2 } 1 r2 1 2 1 2 ·|− | = r = exp 1 π 2 2π 2π −2 Just as required. Problem 11.5 Solution This is a linear transformation of z, we still obtain a Gaussian random variable y. We only need to match its moments (mean and variance). We know that z ∼ N (0, I), Σ = LLT , and y = µ + Lz. Now, using E[z] = 0, we obtain: E[y] = E[µ + Lz] = µ + L · E[z] = µ Moreover, using cov[z] = E[zzT ] − E[z]E[zT ] = E[zzT ] = I, we can obtain: cov[y] = E[yyT ] − E[y]E[yT ] = E[(µ + Lz) · (µ + Lz)T ] − µµT = E[µµT + 2µ · (Lz)T + (Lz) · (Lz)T ] − µµT = 2µ · E[zT ] · LT + E[LzzT LT ] = L · E[zzT ] · LT = L · I · LT = Σ 244 Just as required. Problem 11.6 Solution This problem is all about definition. According to the description of rejection sampling, we know that: for a specific value z0 (drawn from q(z)), we will generate a random variable u 0 , which satisfies a uniform distribution in the interval [0, kq(z0 )], and if the generated value of u 0 is less than pe(z0 ), we will accept this value. Therefore, we obtain: P [accept|z0 ] = pe(z0 ) kq(z0 ) Since we know z0 is drawn from q(z), we can obtain the total acceptance rate by integral: ∫ ∫ pe(z0 ) P [accept] = P [accept|z0 ] · q(z0 ) d z0 = d z0 k It is identical to Eq (11.14). We substitute Eq (11.13) into the expression above, yielding: Zp P [accept] = k We define a very small vector ϵ, and we can obtain: P [x0 ∈ (x, x + ϵ)] = = = P [z0 ∈ (x, x + ϵ)|accept] P [accept, z0 ∈ (x, x + ϵ)] P [accept] ∫ (x,x+ϵ) q(z0 )P [accept|z0 ] d z0 Z p /k ∫ = ∫ = ∫ = ∫ = (x,x+ϵ) (x,x+ϵ) (x,x+ϵ) (x,x+ϵ) k · q(z0 ) · p(accept|z0 ) d z0 Zp pe(z0 ) k · q(z0 ) · d z0 Zp kq(z0 ) 1 ·p e(z0 ) d z0 Zp p(z0 ) d z0 Just as required. Several clarifications must be made here:(1)we have used P [ A ] to represent the probability of event A occurs, and p(z) or q(z) to represent the Probability Density Function (PDF). (2) Please be careful with P [x0 ∈ (x, x + ϵ)] = P [z0 ∈ (x, x + ϵ)|accept], and this is the key point of this problem. Problem 11.7 Solution 245 Notice that the symbols used in the main text is different from those in the problem description. in the following, we will use those in the main text. Namely, y satisfies a uniform distribution on interval [0, 1], and z = b tan y+ c. Then we aims to prove Eq (11.16). Since we know that: q ( z ) = p ( y) · | dy | dz and that: y = arctan z−c b => dy 1 1 = · dz b 1 + [( z − c)/ b]2 Substituting it back, we obtain: q( z) = 1 · 1 1 · b 1 + [( z − c)/ b]2 In my point of view, Eq (11.16) is an expression for the comparison function kq( z), not the proposal function q( z). If we wish to use Eq (11.16) to express the proposal function, the numerator in Eq (11.16) should be 1/ b instead of k. Because the proposal function q( z) is a PDF, it should integrate to 1. However, in rejection sampling, the comparison function is what we actually care about. Problem 11.8 Solution There is a typo in Eq (11.17), which is not difficult to observe, if we carefully examine Fig.11.6. The correct form should be: q i ( z) = k i λ i exp{−λ i ( z − z i )}, e z i−1,i < z ≤ e z i,i+1 , where i = 1, 2, ..., N Here we use e z i,i+1 to represent the intersection point of the i -th and i + 1th envelope, q i ( z) to represent the comparison function of the i -th envelope, and N is the total number of the envelopes.Notice that e z0,1 and e z N,N +1 could be −∞ and ∞ correspondingly. First, from Fig.11.6, we see that: q( z i ) = pe( z i ), substituting the expression above into the equation and yielding: k i λ i = pe( z i ) (∗) One important thing should be made clear is that we can only evaluate pe( z) at specific point z, but not the normalized PDF p( z). This is the assumption of rejection sampling. For more details, please refer to section 11.1.2. Notice that q i ( z) and q i+1 ( z) should have the same value at e z i,i+1 , we obtain: k i λ i exp{−λ i ( e z i,i+1 − z i )} = k i+1 λ i+1 exp{−λ i+1 ( e z i,i+1 − z i+1 )} 246 After several rearrangement, we obtain: e z i,i+1 = { } 1 k i λi ln + λ i z i − λ i+1 z i+1 λ i − λ i+1 k i+1 λ i+1 (∗∗) Before moving on, we should make some clarifications: the adaptive rejection sampling begins with several grid points, e.g., z1 , z2 , ..., z N , and then we evaluate the derivative of pe( z) at those points, i.e., λ1 , λ2 , ..., λ N . Then we can easily obtain k i based on (∗), and next the intersection points e z i,i+1 based on (∗∗). Problem 11.9 Solution In this problem, we will still use the same notation as in the previous one. First, we need to know the probability of sampling from each segment. Notice that Eq (11.17) is not correctly normalized, we first calculate its normalization constant Z q : ∫ Zq = = = = e z N,N +1 e z 0, 1 N ∫ ∑ q( z) dz = e z i,i+1 z i−1,i i =1 e N ∑ i =1 N ∑ i =1 N ∫ ∑ e z i,i+1 z i−1,i i =1 e q i ( z i ) dz i k i λ i exp{−λ i ( z − z i )} dz i ¯ ez i,i+1 ¯ − k i exp{−λ i ( z − z i )}¯ e z i−1,i [ ] ∑ N bi − k i exp{−λ i ( e z i,i+1 − z i )} − exp{−λ i ( e z i−1,i − z i )} = k Where we have defined: [ ] b i = − k i exp{−λ i ( e k z i,i+1 − z i )} − exp{−λ i ( e z i−1,i − z i )} i =1 (∗) From this derivation, we know that the probability of sampling from the b i / Z q , where Z q = ∑ N k b i -th segment is given by k i =1 i . Therefore, now we define an auxiliary random variable η, which is uniform in interval [0, 1], and then define: j −1 ∑ 1 j∑ bm , 1 b m ], j = 1, 2, ..., N k k (∗∗) i = j if η ∈ [ Z q m=0 Z q m=0 b0 = 0 for convenience. Until now, we have decide Where we have defined k the chosen i -th segment. Next, we should sample from the i -th exponential distribution using the technique in section 11.1.1.. According to Eq (11.6), we 247 can write down: ∫ h i ( z) = = = = = z e z i−1,i 1 · b ki ∫ q i (zi ) dz i bi k z e z i−1,i k i λ i exp{−λ i ( z − z i )} dz i ¯z −k i ¯ · exp{−λ i ( z − z i )}¯ bi e z i−1,i k [ ] −k i · exp{−λ i ( z − z i )} − exp{−λ i ( e z i−1,i − z i )} bi k [ ] ki · exp(λ i z i ) exp{−λ i e z i−1,i } − exp{−λ i z} bi k b i is the correct Notice that q i ( z) is not correctly normalized, and q i ( z)/ k normalized form. With several rearrangement, we can obtain: h−i 1 (ξ) = = [ 1 · ln exp{−λ i e z i−1,i } − −λ i [ ] ln exp {− λ e z } i i − 1 ,i 1 · bi ξ k −λ i ln ξ ki bi k ] · exp(λ i z i ) k i ·exp(λ i z i ) = e z i−1,i b ln ξ + ln kk ii − λ i z i In conclusion, we first generate a random variable η, which is uniform in interval [0, 1], and obtain the value i according to (∗∗), and then we generate a random variable ξ, which is also uniform in interval [0, 1], and then transform it to z using z = h−i 1 (ξ). Notice that here, λ i , e z i,i+1 and k i can be obtained once the grid points z1 , z2 , ..., z N are given. For more details, please refer to the previous problem. b i can also be determined using (∗), and After these variables are obtained, k −1 thus h i (ξ) can be determined. Problem 11.10 Solution Based on definition and Eq (11.34)-(11.36), we can write down: Eτ [( z(τ) )2 ] = 0.5 · Eτ−1 [( zτ−1 )2 ] + 0.25 · Eτ−1 [( zτ−1 + 1)2 ] + 0.25 · Eτ−1 [( zτ−1 − 1)2 ] = Eτ−1 [( zτ−1 )2 ] + 0.5 If the initial state is z(0) = 0 (there is a typo in the line below Eq (11.36)), we can obtain Eτ [( z(τ) )2 ] = τ/2 just as required. Problem 11.11 Solution 248 This problem requires you to know the definition of detailed balance, i.e., Eq (11.40): p⋆ (z)T (z, z′ ) = p⋆ (z′ )T (z′ , z) Note that here z and z′ are the sampled values of [ z1 , z2 , ..., z M ]T in two consecutive Gibbs Sampling step. Without loss of generality, we assume that we are now updating zτj to zτj +1 in step τ: p⋆ (z)T (z, z′ ) = p( z1τ , z2τ , ..., zτM ) · p( zτj +1 |zτ/ j ) = p( zτj |zτ/ j ) · p(zτ/ j ) · p( zτj +1 |zτ/ j ) = p( zτj |zτ/ j+1 ) · p(z/τj+1 ) · p( zτj +1 |zτ/ j+1 ) = p( zτj |zτ/ j+1 ) · p( z1τ+1 , z2τ+1 , ..., zτM+1 ) = T (z′ , z) · p⋆ (z) To be more specific, we write down the first line based on Gibbs sampling, where zτ/ j denotes all the entries in vector zτ except zτj . In the second line, we use the conditional property, i.e, p(a, b) = p(a| b) p( b) for the first term. In the third line, we use the fact that zτ/ j = zτ/ j+1 . Then we reversely use the conditional property for the last two terms in the fourth line, and finally obtain what has been asked. Problem 11.12 Solution Obviously, Gibbs Sampling is not ergodic for this specific distribution, and the quick reason is that neither the projection of the two shaded region on z1 axis nor z2 axis overlaps. For instance, we denote the left down shaded region as region 1. If the initial sample falls into this region, no matter how many steps have been carried out, all the generated samples will be in region 1. It is the same for the right up region. Problem 11.13 Solution Let’s begin by definition. p(µ| x, τ, µ0 , s 0 ) ∝ p( x|µ, τ, µ0 , s 0 ) · p(µ|τ, µ0 , s 0 ) = p( x|µ, τ) · p(µ|µ0 , s 0 ) = N ( x|µ, τ−1 ) · N (µ|µ0 , s 0 ) Where in the first line, we have used Bayes’ Theorem: p(µ| x, c) ∝ p( x|µ, c) · p(µ| c) Now we use Eq (2.113)-Eq (2.117), we can obtain: p(µ| x, τ, µ0 , s 0 ) = N (µ|µ⋆ , s⋆ ), where we have defined: 1 [ s⋆ ]−1 = s− 0 +τ , 1 µ⋆ = s⋆ · (τ · x + s− 0 µ0 ) 249 It is similar for p(τ| x, µ, a, b): p(τ| x, µ, a, b) ∝ p( x|τ, µ, a, b) · p(τ|µ, a, b) = p( x|µ, τ) · p(τ|a, b) = N ( x|µ, τ−1 ) · Gam(τ|a, b) Based on Section 2.3.6, especially Eq (2.150)-(2.151), we can obtain p(τ| x, µ, a, b) = Gam(τ|a⋆ , b⋆ ), where we have defined: a⋆ = a + 0.5 , b⋆ = b + 0.5 · ( x − µ)2 Problem 11.14 Solution Based on definition, we can write down: E[ z′i ] = E[µ i + α( z i − µ i ) + σ i (1 − α2i )1/2 v] = µ i + E[α( z i − µ i )] + E[σ i (1 − α2i )1/2 v] = µ i + α · E[ z i − µ i ] + [σ i (1 − α2i )1/2 ] · E[v] = µi Where we have used the fact that the mean of z i is µ i , i.e., E[ z i ] = µ i , and that the mean of v is 0, i.e., E[v] = 0. Then we deal with the variance: var[ z′i ] = E[( z′i − µ i )2 ] = E[(α( z i − µ i ) + σ i (1 − α2i )1/2 v)2 ] = E[α2 ( z i − µ i )2 ] + E[σ2i (1 − α2i )v2 ] + E[2α( z i − µ i ) · σ i (1 − α2i )1/2 v] = α2 · E[( z i − µ i )2 ] + σ2i (1 − α2i ) · E[v2 ] + 2α · σ i (1 − α2i )1/2 · E[( z i − µ i )v] = α2 · var[ z i ] + σ2i (1 − α2i ) · (var[v] + E[v]2 ) + 2α · σ i (1 − α2i )1/2 · E[( z i − µ i )] · E[v] = α2 · σ2i + σ2i (1 − α2i ) · 1 + 0 = σ2i Where we have used the fact that z i and v are independent and thus E[( z i − µ i )v] = E[ z i − µ i ] · E[v] = 0 Problem 11.15 Solution Using Eq (11.57), we can write down: ∂H ∂r i = ∂K ∂r i = ri Comparing this with Eq (11.53), we obtain Eq (11.58). Similarly, still using Eq (11.57), we can obtain: ∂H ∂zi = ∂E ∂zi 250 Comparing this with Eq (11.55), we obtain Eq (11.59). Problem 11.16 Solution According to Bayes’ Theorem and Eq (11.54), (11.63), we have: p(r|z) = Zp p(z, r) 1/ Z H · exp(− H (z, r)) = = · exp(−K (r)) p(z) 1/ Z p · exp(−E (z)) ZH where we have used Eq (11.57). Moreover, by noticing Eq (11.56), we conclude that p(r|z) should satisfy a Gaussian distribution. Problem 11.17 Solution There are typos in Eq (11.68) and (11.69). The signs in the exponential of the second argument of the min function is not right. To be more specific, Eq (11.68) should be: 1 1 exp(− H (R ))δV min{1, exp( H (R ) − H (R ′ ))} ZH 2 (∗) and Eq (11.69) is given by: 1 1 exp(− H (R ′ ))δV min{1, exp( H (R ′ ) − H (R ))} ZH 2 (∗∗) When H (R ) = H (R ′ ), they are clearly equal. When H (R ) > H (R ′ ), (∗) will reduce to: 1 1 exp(− H (R ))δV ZH 2 Because the min function will give 1, and in this case (∗∗) will give: 1 1 1 1 exp(− H (R ′ ))δV exp( H (R ′ ) − H (R ))} = exp(− H (R ))δV ZH 2 ZH 2 Therefore, they are identical, and it is similar when H (R ) < H (R ′ ). 0.12 Continuous Latent Variables Problem 12.1 Solution By analogy to Eq (12.2), we can conclude that the projected data with respect to a vector u M +1 should have variance given by uT Su M +1 . Moreover, M +1 there are two constraints for u M +1 : (1) it should be correctly normalized, i.e., uT u = 1, and (2) it should be orthogonal to all the previous M chosen M +1 M +1 eigenvectors {u1 , u2 , ..., u M }. We aim to maximize the variance with respect to u M +1 satisfying these two constraints. This can be done by enforcing the Lagrange Multiplier: T L = uT M +1 Su M +1 + λ(1 − u M +1 u M +1 ) + M ∑ m=1 η m uT m u M +1 (∗) 251 Therefore, we can further calculate its derivative with respect to u M +1 : ∂L ∂u M +1 = 2Su M +1 − 2λu M +1 + M ∑ m=1 η m um = 0 We left multiply uT m , yielding: T T 2uT m Su M +1 − 2λu m u M +1 + u m M ∑ m=1 η m um = 2uT m Su M +1 − 0 + η m = 2uT M +1 Su m + η m = 2uT M +1 λ m u m + η m = ηm where we have used the property of orthogonality and in the second line we have transpose the first term and use the property that S is symmetric. So now we obtain η m = 0. This will directly lead (∗) reduce to the form as shown in Eq (12.4), and thus consequently we now need to choose a eigenvector of S among those not chosen, which has the largest eigenvalue. Problem 12.2 Solution(Waiting for update) Problem 12.3 Solution According to Eq (12.30), we can obtain: uTi u i = 1 T v XXT v i N λi i We left multiply vTi on both sides of Eq (12.28), yielding: 1 T v XXT v i = λ i vTi v i = λ i ||v i ||2 = λ i N i Here we have used the fact that v i has unit length. Substituting it back into uTi u i , we can obtain: uTi u i = 1 Just as required. Problem 12.4 Solution We know p(z) = N (z|m, Σ), and p(x|z) = N (x|Wz + µ, σ2 I). According to Eq (2.113)-(2.115), we have: bW b T) p(x) = N (x|Wm + µ, σ2 I + WΣWT ) = N (x|µ b , σ2 I + W where we have defined: µ b = Wm + µ 252 and b = WΣ1/2 W Therefore, in the general case, the final form of p(x) can still be written as Eq (12.35). Problem 12.5 Solution

Solution Manual For PRML

Related documents

Products

Support

Solution Manual For PRML

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib