Full Solution Manual for
“Probabilistic Machine Learning: An Introduction”
Kevin Murphy
November 21, 2021
1
1
Solutions
2
Part I
Foundations
3
2
Solutions
2.1
Conditional independence
PRIVATE
1. Bayes’ rule gives
P (H|E1 , E2 ) =
P (E1 , E2 |H)P (H)
P (E1 , E2 )
(1)
Thus the information in (ii) is sufficient. In fact, we don’t need P (E1 , E2 ) because it is equal to the
normalization constant (to enforce the sum to one constraint). (i) and (iii) are insufficient.
2. Now the equation simplifies to
P (H|E1 , E2 ) =
P (E1 |H)P (E2 |H)P (H)
P (E1 , E2 )
(2)
so (i) and (ii) are obviously sufficient. (iii) is also sufficient, because we can compute P (E1 , E2 ) using
normalization.
2.2
Pairwise independence does not imply mutual independence
We provide two counter examples.
Let X1 and X2 be independent binary random variables, and X3 = X1 ⊕ X2 , where ⊕ is the XOR
operator. We have p(X3 |X1 , X2 ) 6= p(X3 ), since X3 can be deterministically calculated from X1 and X2 . So
the variables {X1 , X2 , X3 } are not mutually independent. However, we also have p(X3 |X1 ) = p(X3 ), since
without X2 , no information can be provided to X3 . So X1 ⊥ X3 and similarly X2 ⊥ X3 . Hence {X1 , X2 , X3 }
are pairwise independent.
Here is a different example. Let there be four balls in a bag, numbered 1 to 4. Suppose we draw one at
random. Define 3 events as follows:
• X1 : ball 1 or 2 is drawn.
• X2 : ball 2 or 3 is drawn.
• X3 : ball 1 or 3 is drawn.
We have p(X1 ) = p(X2 ) = p(X3 ) = 0.5. Also, p(X1 , X2 ) = p(X2 , X3 ) = p(X1 , X3 ) = 0.25. Hence
p(X1 , X2 ) = p(X1 )p(X2 ), and similarly for the other pairs. Hence the events are pairwise independent.
However, p(X1 , X2 , X3 ) = 0 6= 1/8 = p(X1 )p(X2 )p(X3 ).
2.3
Conditional independence iff joint factorizes
PRIVATE
Independency ⇒ Factorization. Let g(x, z) = p(x|z) and h(y, z) = p(y|z). If X ⊥ Y |Z then
p(x, y|z) = p(x|z)p(y|z) = g(x, z)h(y, z)
4
(3)
Factorization ⇒ Independency. If p(x, y|z) = g(x, z)h(y, z) then
X
X
X
X
1=
p(x, y|z) =
g(x, z)h(y, z) =
g(x, z)
h(y, z)
x,y
p(x|z) =
X
x,y
p(x, y|z) =
X
y
p(y|z) =
x
g(x, z)h(y, z) = g(x, z)
y
X
p(x, y|z) =
X
x
X
X
h(y, z)
(5)
g(x, z)
(6)
y
g(x, z)h(y, z) = h(y, z)
x
p(x|z)p(y|z) = g(x, z)h(y, z)
X
x
g(x, z)
X
x
(7)
g(y, z)
y
(8)
= g(x, z)h(y, z) = p(x, y|z)
2.4
(4)
y
Convolution of two Gaussians is a Gaussian
We follow the derivation of [Jay03, p221]. Define
φ(x − µ|σ) = N (x|µ, σ 2 ) =
1 x−µ
φ(
)
σ
σ
(9)
where φ(z) is the pdf of the standard normal. We have
(10)
p(y) = N (x1 |µ1 , σ12 ) ⊗ N (x2 |µ2 , σ22 )
Z
= dx1 φ(x1 − µ1 |σ1 )φ(y − (x1 − µ2 )|σ2 )
(11)
Now the product inside the integral can be written as follows
"
(
2 2 #)
1
x1 − µ1
1
y − x1 − µ2
φ(x1 − µ1 |σ1 )φ(y − (x1 − µ2 )|σ2 ) =
exp −
+
2πσ1 σ2
2
σ1
σ2
(12)
We can bring out the dependency on x1 by rearranging the quadratic form inside the exponent as follows
x − µ1
σ1
2
+
y − (x1 − µ2 )
σ2
2
= (w1 + w2 )(x1 − x̂) +
w1 w2
(y − (µ1 − µ2 ))2
w1 + w2
(13)
where we have defined the precision or weighting terms w1 = 1/σ12 , w2 = 1/σ22 , and the term
x̂ =
Note that
w1 µ1 + w2 y − w2 µ2
w1 + w2
(14)
1
w1 w2
σ2 σ2
1
= 2 2 21 2 2 = 2
w1 + w2
σ1 σ2 σ1 + σ2
σ1 + σ22
(15)
Hence
p(y) =
1
2πσ1 σ2
Z
1
1 w1 w2
exp − (w1 + w2 )(x1 − x̂)2 −
(y − (µ1 − µ2 ))2 dx1
2
2 w1 + w2
(16)
The integral over x1 becomes one over the normalization constant for the Gaussian:
Z
1
1
exp[− (w1 + w2 )(x1 − x̂)2 ] dx1 = (2π) 2
2
5
σ22 σ22
σ12 + σ22
21
(17)
Hence
21
1
σ22 σ22
(y − µ1 − µ2 )2 ]
exp[−
σ12 + σ22
2(σ12 + σ22 )
− 1
1
(y − µ1 − µ2 )2 ]
σ12 + σ22 2 exp[−
2
2(σ1 + σ22 )
1
1
p(y) = (2π)−1 (σ12 σ22 )− 2 (2π) 2
1
= (2π)− 2
(19)
(20)
= N (y|µ1 + µ2 , σ12 + σ22 )
2.5
(18)
Expected value of the minimum of two rv’s
PRIVATE
Let Z = min(X, Y ). We have P (Z > z) = P (X > z, Y > z) = P (X > z)p(Y > z) = (1 − z)2 = 1 + z 2 − 2z.
Hence the cdf of the minimum is P (Z ≤ z) = 2z − z 2 and the pdf is p(z) = 2 − 2z. Hence the expected value
is
Z
1
(21)
z(2 − 2z) = 1/3
E [Z] =
0
2.6
Variance of a sum
We have
(22)
V [X + Y ] = E[(X + Y )2 ] − (E[X] + E[Y ])2
2
2
2
2
= E[X + Y + 2XY ] − (E[X] + E[Y ] + 2E[X]E[Y ])
(23)
= E[X ] − E[X] + E[Y ] − E[Y ] + 2E[XY ] − 2E[X]E[Y ]
(24)
= V [X] + V [Y ] + 2Cov [X, Y ]
(25)
2
2
2
2
If X and Y are independent, then Cov [X, Y ] = 0, so V [X + Y ] = V [X] + V [Y ].
2.7
Deriving the inverse gamma density
PRIVATE
We have
dx
|
dy
(26)
dx
1
= − 2 = −x2
dy
y
(27)
ba a−1 −xb
x
e
Γ(a)
(28)
py (y) = px (x)|
where
So
py (y) = x2
ba a+1 −xb
x e
Γ(a)
ba −(a+1) −b/y
=
y
e
= IG(y|a, b)
Γ(a)
=
6
(29)
(30)
2.8
Mean, mode, variance for the beta distribution
For the mode we can use simple calculus, as follows. Let f (x) = Beta(x|a, b). We have
df
dx
= (a − 1)xa−2 (1 − x)b−1 − (b − 1)xa−1 (1 − x)b−2
0=
= (a − 1)(1 − x) − (b − 1)x
a−1
x=
a+b−2
(31)
(32)
(33)
(34)
For the mean we have
Z 1
E [θ|D] =
θp(θ|D)dθ
Z
Γ(a + b)
=
θ(a+1)−1 (1 − θ)b−1 dθ
Γ(a)Γ(b)
Γ(a + b) Γ(a + 1)Γ(b)
a
=
=
Γ(a)Γ(b) Γ(a + 1 + b)
a+b
(35)
0
where we used the definition of the Gamma function and the fact that Γ(x + 1) = xΓ(x).
We can find the variance in the same way, by first showing that
Z
Γ(a + b) 1 Γ(a + 2 + b) (a+2)−1
θ
(1 − θ)b−1 dθ
E θ2 =
Γ(a)Γ(b) 0 Γ(a + 2)Γ(b)
a
a+1
Γ(a + b) Γ(a + 2 + b)
=
=
Γ(a)Γ(b) Γ(a + 2)Γ(b)
a+ba+1+b
2
2
Now we use V [θ] = E θ − E [θ] and E [θ] = a/(a + b) to get the variance.
2.9
(36)
(37)
(38)
(39)
Bayes rule for medical diagnosis
PRIVATE
Let T = 1 represent a positive test outcome, T = 0 represent a negative test outcome, D = 1 mean you have
the disease, and D = 0 mean you don’t have the disease. We are told
P (T = 1|D = 1) = 0.99
(40)
P (T = 0|D = 0) = 0.99
(41)
P (D = 1) = 0.0001
(42)
We are asked to compute P (D = 1|T = 1), which we can do using Bayes’ rule:
P (T = 1|D = 1)P (D = 1)
P (T = 1|D = 1)P (D = 1) + P (T = 1|D = 0)P (D = 0)
0.99 × 0.0001
=
0.99 × 0.0001 + 0.01 × 0.9999
= 0.009804
P (D = 1|T = 1) =
(43)
(44)
(45)
So although you are much more likely to have the disease (given that you have tested positive) than a random
member of the population, you are still unlikely to have it.
7
2.10
Legal reasoning
Let E be the evidence (the observed blood type), and I be the event that the defendant is innocent, and
G = ¬I be the event that the defendant is guilty.
1. The prosecutor is confusing p(E|I) with p(I|E). We are told that p(E|I) = 0.01 but the relevant
quantity is p(I|E). By Bayes rule, this is
p(I|E) =
p(E|I)p(I)
0.01p(I)
=
p(E|I)p(I) + p(E|G)p(G)
0.01p(I) + (1 − p(I))
(46)
since p(E|G) = 1 and p(G) = 1 − p(I). So we cannot determine p(I|E) without knowing the prior
probability p(I). So p(E|I) = p(I|E) only if p(G) = p(I) = 0.5, which is hardly a presumption of
innocence.
To understand this more intuitively, consider the following isomorphic problem (from http://en.
wikipedia.org/wiki/Prosecutor’s_fallacy):
A big bowl is filled with a large but unknown number of balls. Some of the balls are made of
wood, and some of them are made of plastic. Of the wooden balls, 100 are white; out of the
plastic balls, 99 are red and only 1 are white. A ball is pulled out at random, and observed
to be white.
Without knowledge of the relative proportions of wooden and plastic balls, we cannot tell how likely it
is that the ball is wooden. If the number of plastic balls is far larger than the number of wooden balls,
for instance, then a white ball pulled from the bowl at random is far more likely to be a white plastic
ball than a white wooden ball — even though white plastic balls are a minority of the whole set of
plastic balls.
2. The defender is quoting p(G|E) while ignoring p(G). The prior odds are
1
p(G)
=
p(I)
799, 999
(47)
p(G|E)
1
=
p(I|E)
7999
(48)
The posterior odds are
So the evidence has increased the odds of guilt by a factor of 1000. This is clearly relevant, although
perhaps still not enough to find the suspect guilty.
2.11
Probabilities are sensitive to the form of the question that was used to
generate the answer
PRIVATE
1. The event space is shown below, where X is one child and Y the other.
X Y Prob.
G G 1/4
G B 1/4
B G 1/4
B B 1/4
8
Let Ng be the number of girls and Nb the number of boys. We have the constraint (side information)
that Nb + Ng = 2 and 0 ≤ Nb , Ng ≤ 2. We are told Nb ≥ 1 and are asked to compute the probability of
the event Ng = 1 (i.e., one child is a girl). By Bayes rule we have
p(Nb ≥ 1|Ng = 1)p(Ng = 1)
p(Nb ≥ 1)
1 × 1/2
=
= 2/3
3/4
p(Ng = 1|Nb ≥ 1) =
(49)
(50)
2. Let Y be the identity of the observed child and X be the identity of the other child. We want
p(X = g|Y = b). By Bayes rule we have
p(Y = b|X = g)p(X = g)
p(Y = b)
(1/2) × (1/2)
= 1/2
=
1/2
p(X = g|y = b) =
(51)
(52)
Tom Minka [Min98] has written the following about these results:
This seems like a paradox because it seems that in both cases we could condition on the fact that
"at least one child is a boy." But that is not correct; you must condition on the event actually
observed, not its logical implications. In the first case, the event was "He said yes to my question."
In the second case, the event was "One child appeared in front of me." The generating distribution
is different for the two events. Probabilities reflect the number of possible ways an event can
happen, like the number of roads to a town. Logical implications are further down the road and
may be reached in more ways, through different towns. The different number of ways changes the
probability.
2.12
Normalization constant for a 1D Gaussian
Following the first hint we have
Z 2π Z ∞
r2
r exp − 2 drdθ
2σ
0
0
Z 2π Z ∞
r2
=
dθ
r exp − 2 dr
2σ
0
0
Z2 =
(54)
(55)
= (2π)I
where I is the inner integral
(53)
Z ∞
I=
0
r2
r exp − 2
2σ
(56)
Following the second hint we have
I = −σ
2
Z
−
h
r −r2 /2σ2
e
dr
σ2
i
∞
2
2
= −σ 2 e−r /2σ
= −σ 2 [0 − 1] = σ
0
2
(57)
(58)
(59)
Hence
Z 2 = 2πσ 2
p
Z = σ (2π)
9
(60)
(61)
3
Solutions
3.1
Uncorrelated does not imply independent
PRIVATE
We have
(62)
E[X] = 0
(63)
2
V [X] = (1 − −1) /12 = 1/3
2
(64)
2
E[Y ] = E[X ] = V [X] + (E[X]) = 1/3 + 0
Z 0
Z 1
Z
1
−(
x3 dx) + (
x3 dx)
E[XY ] = p(x)x3 dx =
2
−1
0
1 −1 4 0
1 41
1
=
[x ]−1 + [x ]0 = [−1 + 1] = 0
2 4
4
8
(65)
(66)
where we have split the integral into two pieces, since x3 changes sign in the interval [−1, 1]. Hence
ρ=
3.2
Cov [X, Y ] E[XY ] − E[X]E[Y ]
0
=
=0
σX σY
σX σY
σX σY
(67)
Correlation coefficient is between -1 and +1
We have
X
Y
+
σX
σY
X
Y
X Y
=V
+V
+ 2Cov
,
σX
σY
σX σY
X Y
V [X] V [Y ]
+
+
2Cov
,
=
2
σX
σY2
σX σY
(68)
0≤V
(69)
(70)
(71)
= 1 + 1 + 2ρ = 2(1 + ρ)
Hence ρ ≥ −1. Similarly,
0≤V
X
Y
−
σX
σY
(72)
= 2(1 − ρ)
so ρ ≤ 1.
3.3
Correlation coefficient for linearly related variables is ±1
PRIVATE
Let E [X] = µ and V [X] = σ 2 , and assume a > 0. Then we have
(73)
E [Y ] = E [aX + b] = aE [X] + b = aµ + b
2
2 2
V [Y ] = V [aX + b] = a V [X] = a σ
E [XY ] = E [X(aX + b)] = E aX 2 + bX = aE X 2 + bE [X] = a(µ2 + σ 2 ) + bµ
2
2
Cov [X, Y ] = E [XY ] − E [X] E [Y ] = a(µ + σ ) + bµ − µ(aµ + b) = aσ
2
Cov [X, Y ]
aσ
ρ= p
=√ √
=1
2
V [X] V [Y ]
σ a2 σ 2
√
If Y = aX + b, where a < 0, we find ρ = −1, since a/ a2 = −a/|a| = −1.
10
2
(74)
(75)
(76)
(77)
3.4
Linear combinations of random variables
1. Let y = Ax. Then
Cov [y] = E (y − E [y])(y − E [y])T
= E (Ax − Am)(Ax − Am)T
= AE (x − m)(x − m)T AT
= AΣA
(78)
(79)
(80)
(81)
T
P
P
2. C = AB has entries cij = k aik bkj . The diagonalPelements of C (when i = j) are given by k aik bki .
So the sum of the diagonal elements is tr(AB) = ik aik bki which is symmetric in A and B.
3. We have
E xT Ax = E tr(xT Ax) = E tr(AxxT )
= tr(AE xxT ) = tr(A(Σ + mmT ))
T
= tr(AΣ) + m Am
3.5
(82)
(83)
(84)
Gaussian vs jointly Gaussian
1. For the mean, we have
E [Y ] = E [W X] = E [X] E [X] = 0
(85)
V [Y ] = E [V [Y |W ]] + V [E [Y |W ]]
(86)
For the variance, we have
= E [W [V [X]]W ] + V [W E [X]]
= E W2 + 0 = 1
(87)
(88)
To show it’s Gaussian, we note that Y is a linear combination of Gaussian rv’s.
2. To show that Cov [X, Y ] = 0, we use the rule of iterated expectation. First we have
E [XY ] = E [E [XY |W ]]
X
=
p(w)E [XY |w]
(89)
(90)
w∈{−1,1}
= −1 · 0.5 · E [X · −X] + 1 · 0.5 · E [X · X]
(91)
== 0
(92)
E [Y ] = E [E [Y |W ]]
X
=
p(w)E [Y |w]
(93)
Then we have
(94)
w∈{−1,1}
= 0.5 · E [−X] + 0.5 · E [X]
(95)
=0
(96)
Hence
Cov [X, Y ] = E [XY ] − E [X] E [Y ] = 0
So X and Y are uncorrelated even though they are dependent.
11
(97)
3.6
Normalization constant for a multidimensional Gaussian
Let Σ = UΛUT , so
Σ
−1
=U
−T
−1
Λ
U
−1
−1
= UΛ
U=
p
X
1
i=1
λi
ui uTi
(98)
(x − µ)
(99)
Hence
T
(x − µ) Σ
−1
T
(x − µ) = (x − µ)
p
X
1
i=1
=
p
X
1
λ
i=1 i
(x − µ)
T
λi
!
ui uTi
ui uTi (x − µ) =
p
X
y2
i
λ
i=1 i
(100)
where yi , uTi (x − µ). The y variables define a new coordinate system that is shifted (by µ) and rotated
(by U) with respect to the original x coordinates: y = U(x − µ). Hence x = UT y + µ.
The Jacobian of this transformation, from y to x, is a matrix with elements
Jij =
so J = UT and |J| = 1.
So
Z
∂xi
= Uji
∂yj
1
exp(− (x − µ)T Σ−1 (x − µ))dx =
2
=
(101)
Z Y
exp(−
i
1 X yi2
)dyi |J|
2 i λi
Yp
2πλi = |2πΣ|
(102)
(103)
i
3.7
Sensor fusion with known variances in 1d
Define the sufficient statistics as
y1 =
n1
n2
1 X
1 X
(1)
(2)
yi , y 2 =
y ,
n1 i=1
n2 i=1 i
(104)
Define the prior as
(105)
µµ = 0, Σµ = ∞
Define the likelihood as
v1
1
0
A=
, µy =
, Σy = n 1
1
0
0
0
v2
n2
Now we just apply the equations. The posterior precision is a sum of the precisions of each sensor:
nv1
0
n1
n2
1
−1
T −1
1
Σµ|y = A Σy A = 1 1
=
+
0 nv22
1
v1
v2
The posterior mean is a weighted sum of the observed values from each sensor:
nv1
0
n1 y 1
n2 y 2
y1
−1
−1
1
µµ|y = Σµ|y 1 1
= Σµ|y
+
0 nv22
y2
v1
v2
12
(106)
(107)
(108)
3.8
Show that the Student distribution can be written as a Gaussian scale
mixture
Z ∞
p(w|µ, a, b) =
(109)
N (w|µ, α−1 )Ga(α|a, b)dα
0
Z ∞
α
ba e−bα αa−1 α 21
exp − (w − µ)2 dα
Γ(a)
2π
2
0
1 Z
∞
1
ba
1
1 2
=
αa− 2 exp −α b + (w − µ)2 dα
Γ(a) 2π
2
0
=
(110)
(111)
Let us define
= b + (w − µ)2 /2 and z = α∆. Then, using dz = ∆dα, and the definition of the Gamma
R∞ ∆
x−1 −u
function, 0 u
e du = Γ(x), the integral becomes
Z ∞
0
1
1
z a+ 12 −1 −z −1
e ∆ dz = ∆−a− 2 Γ(a + )
∆
2
(112)
so we have
Γ(a + 1/2) a
b
p(w|µ, a, b) =
Γ(a)
1
2π
12
1
∆−a− 2
(113)
Let us define a = ν/2 and b = ν/(2λ). Then we have
1 −(ν+1)/2
(w − µ)2
Γ((ν + 1)/2) ν ν/2 1 2 ν
+
Γ(ν/2)
2λ
2π
2λ
2
−(ν+1)/2
12 Γ((ν + 1)/2) ν ν/2 1
ν
λ
=
1 + (w − µ)2
Γ(ν/2)
2λ
2π
2λ
ν
−(ν+1)/2
12 Γ((ν + 1)/2) λ
λ
=
1 + (w − µ)2
Γ(ν/2)
νπ
ν
p(w|µ, a, b) =
(114)
(115)
(116)
(117)
= Tν (w|µ, λ−1 )
13
4
Solutions
4.1
MLE for the univariate Gaussian
PRIVATE
Starting with the mean, we have
2 X
∂`
=
(xi − µ) = 0
∂µ
2σ 2 i
(118)
N
µ̂ =
1 X
xi = x
N i=1
(119)
We can derive σ̂ 2 as follows.
X
∂`
1
n
= σ −4
(xi − µ̂)2 − 2 = 0
2
∂σ
2
2σ
i
(120)
N
1 X
(xi − µ̂)2
N i=1
"
#
X
1 X 2 X 2
1 X 2
=
xi +
µ̂ − 2
xi µ̂ = [
xi + N µ̂2 − 2N µ̂2 ]
N
N
i
i
i
i
!
N
1 X 2
x − (x)2
=
N i=1 i
σ̂ 2 =
4.2
(121)
(122)
(123)
MAP estimation for 1D Gaussians
PRIVATE
1. Since the mean and mode of a Gaussian are the same, we can just write down the MAP estimate using
the standard equations for the posterior mean of a Gaussian, which is
ns2
σ2
m
(124)
E[µ|D] =
x+
ns2 + σ 2
ns2 + σ 2
However, we can also derive the MAP estimate explicitly, as follows. The prior distribution over the
mean is
(µ − m)2
p(µ) = (2πs2 )−1/2 exp[−
]
(125)
2s2
Since the samples xi are taken to be independent, we have:
1
p(D|θ) = e− 2σ2
Pn
2
i=1 (xi −µ)
(126)
And combining them:
n
1
(µ − m)2
n
1 X
2
log p(µ)p(D|µ) = − log(2πs2 ) −
−
log(2πσ
)
−
(xi − µ)2
2
2s2
2
2σ 2 i=1
14
(127)
Taking the derivative with respect to µ :
n
1 X
∂ log(p(µ)p(D|µ))
2(µ − m)(1)
−
0
−
=0−
(2)(xi − µ)(−1)
∂µ
2s2
2σ 2 i=1
(128)
n
µ−m
1 X
=− 2 + 2
(xi − µ)
s
σ i=1
(129)
n
m
1 X
nµ
µ
xi − 2
=− 2 + 2 + 2
s
s
σ i=1
σ
(130)
Setting this derivative to zero in order to find the maximum:
0=−
n
µ̂M AP
m
1 X
nµ̂M AP
+
+
xi −
s2
s2
σ 2 i=1
σ2
(131)
n
µ̂M AP (
n
1
1 X
m
+ 2) = 2
xi + 2
2
σ
s
σ i=1
s
Pn
Pn
1
m
s2 i=1 xi + mσ 2
i=1 xi + s2
σ2
µ̂M AP =
=
n
1
ns2 + σ 2
σ 2 + s2
P
n
ns2
σ2
i=1 xi
=
+
m
ns2 + σ 2
n
ns2 + σ 2
(132)
(133)
(134)
2
2
2. Notice that as n increases, while s, m and σ remain constant, we have ns2σ+σ2 → 0 while nsns
2 +σ 2 → 1,
P
x
yielding µ̂M AP → ni i = µ̂M LE . That is, when there are many samples, the prior knowledge becomes
less and less relevant, and all MAP estimators (for any prior) converge to the maximum likelihood
estimator µ̂M LE . This is not surprising: as we have more data, it outweighs our prior speculations.
2
3. As the prior variance s2 increases, even for a small number of samples n, we have ns2σ+σ2 → 0 while
ns2
ns2 +σ 2 → 1, again yielding µ̂M AP →
P
i xi
= µ̂M LE . That is, when we have an uninformative, almost
uniform, prior, we are left only with our data to base out estimation on.
n
2
4. On the other hand, if the prior variance is very small, s2 → 0, then we have ns2σ+σ2 → 1 while
ns2
ns2 +σ 2
→ 0, yielding µ̂M AP → m. That is, if our prior is very concentrated, we essentially already
know the answer, and can ignore the data.
4.3
Gaussian posterior credible interval
We want an interval that satisfies
p(` ≤ µn ≤ u|D) ≥ 0.95
(135)
` = µn + Φ−1 (0.025)σn = µn − 1.96σn
(136)
where
u = µn + Φ
−1
(0.9755)σn = µn + 1.96σn
(137)
where Φ is the cumulative distribution function for the standard normal N (0, 1) distribution, and Φ−1 (0.025) is
the value below which 2.5% of the probability mass lies (in Matlab, norminv(0.025)=-1.96). and Φ−1 (0.975)
is the value below which 97.5% of the probability mass lies (in Matlab, norminv(0.975)=1.96). We want to
find n such that
u−`=1
(138)
15
Hence we solve
2(1.96)σn = 1
σn2 =
1
4(1.96)2
(139)
(140)
where
σ 2 σ02
nσ02 + σ 2
(141)
nσ02 + σ 2 = (σ 2 σ02 )4(1.96)2
(142)
σn2 =
Hence
2
σ (σ02 4(1.96)2 − 1)
n=
σ02
2
=
4(9 × (1.96) − 1)
= 61.0212
9
(143)
(144)
Hence we need at least n ≥ 62 samples.
4.4
BIC for Gaussians
PRIVATE
1. Using the fact that Σ̂M L = S, we have
N −1 N
tr Σ̂ S −
log(|Σ̂|)
2
2
N
N
= − tr(ID ) −
log(|Σ̂|)
2
2
N
D + log |Σ̂|
=−
2
log p(D|θ̂M L ) = −
(145)
(146)
(147)
There are D free parameters for the mean and D(D + 1)/2 for the covariance, so the BIC score is
BIC = −
d
N
D + log |Σ̂| − log N
2
2
(148)
where d = D + D(D + 1)/2.
2. Let Sdiag = diag(diag(S)) = Σ̂diag . Then
−1
tr(Σ̂−1
diag S) = tr(Σ̂diag Sdiag ) = tr(ID ) = D
because the off-diagonal elements of the matrix product get ignored. Hence
BIC = −
d
N
D + log |Σ̂| − log N
2
2
where d = D + D parameters (for the mean and diagonal).
16
(149)
4.5
BIC for a 2d discrete distribution
1. The joint distribution is p(x, y|θ) = p(x|θ1 )p(y|x, θ2 ):
x=0
x=1
y=0
(1 − θ1 )θ2
θ1 (1 − θ2 )
y=1
(1 − θ1 )(1 − θ2 )
θ1 θ2
2. The log likelihood is
log p(D|θ) =
X
log p(xi |θ1 ) +
X
i
(150)
log p(yi |xi , θ2 )
i
Hence we can optimize each term separately. For θ1 , we have
P
I(xi = 1)
N (x = 1)
4
=
= = 0.5714
θ̂1 = i
n
N
7
For θ2 , we have
P
θ̂2 =
i I(xi = yi )
n
=
(151)
N (x = y)
4
=
N
7
(152)
The likelihood is
3
4
3
4
p(D|θ̂, M2 ) = ( )N (x=1) ( )N (x=0) ( )N (x=y) ( )N (x6=y)
7
7
7
7
4 4 3 3 4 4 3 3
=( ) ( ) ( ) ( )
7 7 7 7
4 8 3 6
= ( ) ( ) ≈ 7.04 × 10−5
7 7
(153)
(154)
(155)
3. The table of joint counts is
y=0 y=1
x=0 2
1
x=1 2
2
We can think of this as a multinomial distribution with 4 states. Normalizing the counts gives the MLE:
y=0 y=1
x = 0 2/7
1/7
2/7
x = 1 2/7
The likelihood is
2 1 2 2
N (x=0,y=0) N (x=0,y=1) N (x=1,y=0) N (x=1,y=1)
θ01
θ10
θ11
= ( )2 ( )1 ( )2 ( )2
p(D|θ̂, M4 ) = θ00
7
7
7
2 1
= ( )6 ( )1 ≈ 7.77 × 10−5
7 7
7
(156)
(157)
Thus is higher than the previous likelihood, because the model has more parameters.
4. For M4 , when we omit case 7, we will have θ̂01 = 0, so p(x7 , y7 |m4 , θ̂) = 0, so L(m4 ) = −∞. However,
L(m2 ) will be finite, since all counts remain non zero when we leave out a single case. Hence CV will
prefer M2 , since M4 is overfitting.
5. The BIC score is
BIC(m) = log p(D|θ̂, m) −
dof(m)
log n
2
(158)
where n = 7. For M2 , we have dof = 2, so
4
3
2
BIC(m2 ) = 8 log( ) + 6 log( ) − log 7 = −11.5066
7
7
2
17
(159)
For M4 , we have dof = 3 because of the sum-to-one constraint, so
2
1
3
BIC(m4 ) = 6 log( ) + 1 log( ) − log 7 = −12.3814
7
7
2
(160)
So BIC also prefers m2 .
4.6
A mixture of conjugate priors is conjugate
PRIVATE
We now show that, if the prior is a mixture, so is the posterior:
p(D|θ)p(θ)
p(D)
P
p(D|θ) k p(Z = k)p(θ|Z = k)
P
=R
p(D|θ0 ) k0 p(Z = k 0 )p(θ0 |Z = k 0 )dθ0
P
k)p(D, θ|Z = k)
k p(Z = R
=P
0
p(D, θ0 |Z = k 0 )dθ0
0 p(Z = k )
Pk
p(Z = k)p(θ|D, Z = k)p(D|Z = k)
= k P
0
0
k0 p(Z = k )p(D|Z = k )
X
p(Z = k)p(D|Z = k)
P
=
p(θ|D, Z = k)
0
0
k0 p(Z = k )p(D|Z = k )
k
X
=
p(Z = k|D)p(θ|D, Z = k)
(161)
p(θ|D) =
(162)
(163)
(164)
(165)
(166)
k
where p(Z = k) = πk are the prior mixing weights, and p(Z = k|D) are the posterior mixing weights given by
p(Z = k)p(D|Z = k)
0
0
k0 p(Z = k )p(D|Z = k )
(167)
p(Z = k|D) = P
4.7
2
is biased
ML estimator σmle
Because the variance of any random variable R is given by var(R) = E[R2 ]−(E[R])2 , the expected value of the
square of a Gaussian random variable Xi with mean µ and variance σ 2 is E[Xi2 ] = var(Xi )+(E[Xi ])2 = σ 2 +µ2 .
n
1X
(Xi −
EX1 ,...,Xn ∼N (µ,σ) [σ (X1 , . . . , Xn )] = E[
n i=1
Pn
2
=
1X
nE[(Xi −
n i=1
1X
=
nE[(Xi −
n i=1
j=1 Xj 2
) ]
n
Pn
j=1 Xj 2
) ]
n
Pn
j=1 Xj
n
Pn
)(Xi −
j=1 Xj
n
)]
n
n
n
1X
2 X
1 XX
2
=
nE[Xi − Xi
Xj + 2
Xj Xk ]
n i=1
n j=1
n j=1
k=1
n
n
n
n
n
2 XX
1 XXX
1X
nE[Xi2 ] − 2
E[Xi Xj ] + 3
[Xj Xk ]
=
n i=1
n i=1 j=1
n i=1 j=1
k=1
18
Pn Pn
Pn Pn
Consider the two summations i=1 j=1 E[Xi Xj ] and j=1 k=1 [Xj Xk ]. Of the n2 terms in each of
these summations, n of them satisfy i = j or j = k, so these terms are of the form E[Xi2 ]. By linearity
of expectation, these terms contribute nE[Xi2 ] to the sum. The remaining n2 − n terms are of the form
E[Xi Xj ] or E[Xj Xk ] for i =
6 j or j =
6 k. Because the Xi are independent samples, it follows from linearity of
expectation that these terms contribute (n2 − n)E[Xi ]E[Xj ] to the summation.
n X
n
X
i=1 j=1
E[Xi Xj ] =
n X
n
X
E[Xj Xk ]
j=1 k=1
= nE[Xi2 ] + (n2 − n)E[Xi ][Xj ]
= n(σ 2 + µ2 ) + (n2 − n)µµ = nσ 2 + nµ2 + n2 µ2 − nµ2
= nσ 2 + n2 µ2
EX1 ,...,Xn ∼N (µ,σ) [σ 2 (X1 , . . . , Xn )] =
n
2
1 X
1X
= 1n (σ 2 + µ2 ) − 2 (nσ 2 + n2 µ2 ) + 3
(nσ 2 + n2 µ2 )
n i
n
n i=1
1
σ2
1
(nσ 2 + nµ2 ) − 2
− 2µ2 + 3 (n2 σ 2 + N 3 µ2 )
n
n
n
σ2
σ2
− 2µ2 +
+ µ2
= σ 2 + µ2 − 2
n
n
σ2
n−1 2
= σ2 −
=
σ
n
n
=
Since the expected value of σˆ2 (X1 , . . . , Xn ) is not equal to the actual variance σ 2 , σ 2 is not an unbiased
estimator. In fact, the maximum likelihood estimator tends to underestimate the variance. This is not
surprising: consider the case of only a single sample: we will never detect any variance. If there are multiple
samples, we will detect variance, but since our estimate for the mean will tend to be shifted from the true
mean in the direction of our samples, we will tend to underestimate the variance.
4.8
Estimation of σ 2 when µ is known
PRIVATE
2
We can re-do the derivation of σ̂M
LE , but instead of taking the maximum over µ , we just use the known
µ. We get:
n
1X
σ̂ 2 =
(xi − µ)2
n i=1
In calculating the expected value of the estimator, note that we get exactly the definition of a variance of a
19
random variable:
n
EX1 ,...,Xn ∼N (µ,σ) [σ̂ 2 (X1 , X2 , . . . , Xn ) =
1X
EXi ∼N (µ,σ) [(Xi − µ)2 ]
n i=1
n
=
1X
V ar[N (µ, σ)]
n i=1
=
1X 2
σ
n i=1
n
= σ2
So the estimator is unbiased.
4.9
Variance and MSE of the unbiased estimator for Gaussian variance
PRIVATE
We have
N −1 2
σ
unb = 2(N − 1)
σ2
(N − 1)2 2 V σunb = 2(N − 1)
σ4
2 2σ 4
V σunb
=
N −1
V
Hence the variance of the ML estimator is
2
2 N −1 2
N −1
2σ 4
2(N − 1) 4
V σ̂M L = V
σ̂N −1 =
=
σ
N
N
N −1
N2
(168)
(169)
(170)
(171)
Finally, we have
2 2
2
MSE(σmle
) = bias(σmle
)2 + V σmle
2
(172)
−σ 2 2(N − 1) 4
) +
σ
N
N2
2N − 1 4
σ
=
N2
(173)
2 2
2
MSE(σunb
) = bias(σunb
)2 + V σunb
(175)
2σ 4
=0+
N −1
(176)
=(
(174)
and
20
5
Solutions
5.1
Reject option in classifiers
1. We have to choose between rejecting, with risk λr , and choosing the most probable class, jmax =
arg maxj p(Y = j|x), which has risk
X
λs
p(Y = j|x) = λs (1 − p(Y = jmax |x))
(177)
j6=jmax
Hence we should pick jmax if
λr ≥ λs (1 − p(Y = jmax |x))
λr
≥ (1 − p(Y = jmax |x))
λs
λr
p(Y = jmax |x) ≥ 1 −
λs
(178)
(179)
(180)
otherwise we should reject.
For completeness, we should prove that when we decide to choose a class (and not reject), we always
pick the most probable one. If we choose a non-maximal category k 6= jmax , the risk is
X
λs
p(Y = j|x) = λs (1 − p(Y = k|x)) ≥ λs (1 − p(Y = jmax |x))
(181)
j6=k
which is always bigger than picking jmax .
2. If λr /λs = 0, there is no cost to rejecting, so we always reject. As λr /λs → 1, the cost of rejecting
increases. We find p(Y = jmax |x) ≥ 1 − λλrs is always satisfied, so we always accept the most probable
class.
5.2
Newsvendor problem
PRIVATE
Z ∞
Z Q
(P − C)Qf (D)dD +
Eπ(Q) =
Q
Z Q
(P − C)Df (D) −
0
C(Q − D)f (D)dD
(182)
0
Z Q
= (P − C)Q(1 − F (Q)) + P
Z Q
Df (D) − CQ
0
f (D)dD
(183)
0
Z Q
= (P − C)Q(1 − F (Q)) + P
Df (D) − CQF (Q)
(184)
0
Hence
d
Eπ(Q) = (P − C)(1 − F (Q)) − (P − C)Qf (Q) + P Qf (Q) − CF (Q) − CQf (Q)
dQ
= (P − C) − P F (Q)
(185)
(186)
(187)
=0
So
F (Q) =
P −C
P
21
(188)
5.3
Bayes factors and ROC curves
PRIVATE
p(H1 |D) is a monotonically increasing function of B, and therefore cannot affect the shape of the ROC
curve. This makes sense intuitively, since, in the two class case, it should not matter whether we threshold the
ratio p(H1 |D)/p(H0 |D), or the posterior, p(H1 |D), since they contain the same information, just measured
on different scales.
5.4
Posterior median is optimal estimate under L1 loss
To prove this, we expand the posterior expected loss as follows:
Z
Z
ρ(a|x) = Eθ|x |θ − a| =
(θ − a)p(θ|x)dθ +
(a − θ)p(θ|x)dθ
θ≥a
θ≤a
Z ∞
Z a
=
(θ − a)p(θ|x)dθ +
(a − θ)p(θ|x)dθ
(189)
(190)
−∞
a
Now recall the rule to differentiate under the integral sign:
d
da
Z B(a)
Z B(a)
φ(a, θ)dθ =
A(a)
φ0 (a, θ)dθ + φ(a, B(a))B 0 (a) + φ(a, A(a))A0 (a)
(191)
A(a)
d
where φ0 (a, θ) = da
φ(a, θ). Applying this to the first integral in Equation 190, with A(a) = a, B(a) = ∞,
φ(a, θ) = (θ − a)p(θ|x)dy, we have
Z ∞
Z ∞
(θ − a)p(θ|x)dθ =
−p(θ|x)dθ + 0 + 0
(192)
a
a
Analogously, one can show
Z a
Z a
(a − θ)p(θ|x)dθ =
−∞
p(θ|x)dθ
(193)
−∞
Hence
Z ∞
Z a
p(θ|x)dθ
(194)
= −P (θ ≥ a|x) + P (θ ≤ a|x) = 0
(195)
0
ρ (a|x) = −
p(θ|x)dθ +
−∞
a
So the value of a that makes ρ0 (a|x) = 0 satisfies
P (θ ≥ a|x) = P (θ ≤ a|x)
Hence the optimal a is the posterior median.
22
(196)
6
Solutions
6.1
Expressing mutual information in terms of entropies
PRIVATE
I(X, Y ) =
X
p(x, y) log
p(x, y)
p(x)p(y)
(197)
p(x, y) log
p(x|y)
p(x)
(198)
x,y
=
X
x,y
=−
X
X
p(x, y) log p(x) +
x,y
(199)
p(x, y) log p(x|y)
x,y
!
=−
X
p(x) log p(x) −
−
X
x
x,y
X
X
(200)
p(x, y) log p(x|y)
!
=−
p(x) log p(x) −
−
x
p(y)
X
y
p(x|y) log p(x|y)
(201)
x
(202)
= H(X) − H(X|Y )
We can show I(X, Y ) = H(Y ) − H(Y |X) by symmetry.
For the second. Using H(Y |X) = H(X, Y ) − H(X) we have
I(X; Y ) = H(X) + H(Y ) − H(X, Y )
(203)
Hence adding and subtracting H(Y ) we have
H(X, Y ) = H(X, Y ) + H(X, Y ) − H(X, Y )
= (H(X, Y ) − H(Y )) + H(X, Y ) + (H(Y ) − H(X, Y ))
(204)
(205)
Now adding and subtracting H(X) we have
H(X, Y ) = (H(X, Y ) − H(Y )) + (H(X, Y ) − H(X)) + (H(Y ) + H(X) − H(X, Y ))
(206)
Simplifying yields
H(X, Y ) = H(X|Y ) + H(Y |X) + I(X, Y )
23
(207)
6.2
Relationship between D(p||q) and χ2 statistic
We have
D(p||q) =
X
p(x)
q(x)
p(x) log
x
(208)
∆(x)
=
(Q(x) + ∆(x)) log 1 +
q(x)
x
X
∆(x) ∆(x)2
−
+ ···
=
(Q(x) + ∆(x))
q(x)
2q(x)
x
X
=
X
∆(x) +
x
=0+
X ∆(x)2
x
since
X
2q(x)
∆(x) =
x
6.3
∆(x)2
∆(x)2
−
+ ···
q(x)
2q(x)
+ ···
X
p(x) − q(x) = 0
(209)
(210)
(211)
(212)
(213)
x
Fun with entropies
PRIVATE
Here is p(x, y) and the marginals p(x) and p(y) written in the margins (hence the name).
x
p(y)
1
2
3
4
1 1/8
1/16 1/32 1/32 1/4
y
2 1/16 1/8
1/32 1/32 1/4
3 1/16 1/16 1/16 1/16 1/4
4 1/4
0
0
0
1/4
p(x)
1/2
1/4
1/8
1/8
Here is p(x|y) and H(X|y)
x
H(X|y)
1
2
3
4
1 1/2 1/4 1/8 1/8 7/4
y 2 1/4 1/2 1/8 1/8 7/4
3 1/4 1/4 1/4 1/4 2
4 1
0
0
0
0
1. H(X, Y ) = 27/8 = 3.375
2. H(X) = 7/4 = 1.75, H(Y ) = 2.0.
3. H(X|y) = 1.75, 1.75, 2.0, 0.0. Note that H(X|Y = 4) = 0 < H(X), so observing Y = 4 decreases
our entropy, but H(X|Y = 3) = 2.0 > H(X), so observing Y = 3 increases our entropy. Also,
H(X|Y = 1) = H(X|Y = 2) = H(X), so observing Y = 1 or Y = 2 does not change our entropy (but
does change our posterior distribution!).
4. H(X|Y ) = 11/8 = 1.375 < H(X), so on average, we are less uncertain after seeing Y .
5. I(X, Y ) = H(X) − HX|Y ) = 3/8 = 0.375
24
6.4
Forwards vs reverse KL divergence
1. We have
KL (pkq) =
X
(214)
p(x, y)[log p(x, y) − log q(x) − log q(y)]
xy
=
X
p(x, y) log p(x, y) −
X
xy
p(x) log q(x) −
x
X
p(y) log q(y)
(215)
y
We can
P optimize wrt q(x) and q(y) separately. Imposing a Lagrange multiplier to enforce the constraint
that x q(x) = 1 we have the Lagrangian
L(q, λ) =
X
p(x) log q(x) + λ(1 −
x
X
q(x))
(216)
x
Taking derivatives wrt q(x) (thinking of the function as a finite length vector, for simplicity), we have
∂L
p(x)
=
−λ=0
∂q(x)
q(x)
p(x)
q(x) =
λ
(217)
(218)
Summing both sides over x we get λ = 1 and hence
(219)
q(x) = p(x)
Analogously, q(y) = p(y).
2. We require q(x, y) = 0 whenever p(x, y) = 0, otherwise log q(x, y)/p(x, y) = ∞. Since q(x, y) =
qx (x)qy (y), it must be that qx (x) = qy (y) whenever x = y, and hence qx = qy are the same distribution.
There are only 3 possible distributions that put 0s in the right places and yet sum to 1. The first is:
x
1 2 3 4 q(y)
1
0 0 0 0 0
0 0 0 0 0
y 2
3
0 0 1 0 1
4
0 0 0 0 0
q(x) 0 0 1 0
The second one is
y
1
2
3
4
q(x)
1
0
0
0
0
0
x
2
0
0
0
0
0
3
0
0
0
0
0
4
0
0
0
1
1
q(y)
0
0
0
1
1
For both of these, we have KL (qkp) = 1 × log 1/4
= log 4. Furthermore, any slight perturbation of
these probabilities away from the designated values will cause the KL to blow up, meaning these are
local minima.
The final local optimum is
25
y
1
2
3
4
q(x)
1
1/4
1/4
0
0
1/2
x
2
1/4
1/4
0
0
1/2
3
0
0
0
0
0
4
0
0
0
0
0
q(y)
1/2
1/2
0
0
This has KL (qkp) = 4( 14 log 1/4
1/8 ) = log 2, so this is actually the global optimum.
To see that there are no other solutions, one can do a case analysis, and see that any other distribution
will not put 0s in the right places. For example, consider this:
x
1
2 3
4 q(y)
1
1/4 0 1/4 0 1/2
y 2
0
0 0
0 0
1/4 0 1/4 0 1/2
3
4
0
0 0
0 0
q(x) 1/2 0 1/2 0
Obviously if we set q(x, y) = p(x)p(y) = 1/16, we get KL (qkp) = ∞.
26
7
Solutions
7.1
Orthogonal matrices
1. Let c = cos(α) and s = sin(α). Using the fact that c2 + s2 = 1, we have
2
1
c s 0
c −s 0
c + s2 + 0 −cs + sc + 0 0
RT R = −s c 0 s c 0 = −sc + sc + 0 c2 + s2 + 0 0 = 0
0 0 1
0 0 1
0
0
1
0
0
1
0
0
0
1
(220)
2. The z-axis v = (0, 0, 1) is not affected by a rotation around z. We can easily check that (0, 0, 1) is an
eigenvector with eigenvalue 1 as follows:
c s 0
0
0
−s c 0 0 = 0
(221)
0 0 1
1
1
Of course, (0, 0, −1) is also a valid solution. We can check this using the symbolic math toolbox in
Matlab:
syms c s x y z
S=solve(’c*x-s*y=x’,’s*x+c*y=y’,’x^2+y^2+z^2=1’)
>> S.x
ans =
0
0
>> S.y
ans =
0
0
>> S.z
ans =
1
-1
If we ignore the unit norm constraint, we find that (0, 0, z) is a solution for any z ∈ R. We can see this
as follows:
c s 0
x
x
−s c 0 y = y
(222)
0 0 1
z
z
becomes
cx − sy = x
(223)
sx + cy = y
(224)
z=z
(225)
Solving gives
x(c − 1)
s
cx(x − 1)
sx + cy = sx +
s
x(c − 1)
y=
s
and hence x = y = 0. We can check this in Matlab:
y=
27
(226)
(227)
(228)
S=solve(’c*x-s*y=x’,’s*x+c*y=y’)
S =
x: [1x1 sym]
y: [1x1 sym]
>> S.x
0
>> S.y
0
7.2
Eigenvectors by hand
PRIVATE
In matlab, we find
>> [V,D]=eig([2 0; 0 3])
V =
1
0
0
1
D =
2
0
0
3
so λ1 = 2, λ2 = 3, v1 = (1, 0)T , v2 = (0, 1)T . Let us check this by hand. We want to solve
Avi = λi vi
(A − λi I)vi = 0
(229)
(230)
The characteristic equation is
|A − λI| = 0
0
=0
3 − λ2
(231)
(2 − λ1 )(3 − λ2 ) = 0
(233)
2 − λ1
0
so we see λ1 = 2,λ2 = 3. Solving for the eigenvectors
2 0
vi1
v
= λi i1
0 3
vi2
vi2
(232)
(234)
yields
2v11 = 2(v11 + v12 )
(235)
3v22 = 3(v21 + v22 )
(236)
so v11 = 1, v12 = 0, v21 = 0, v22 = 1.
28
8
Solutions
8.1
Subderivative of the hinge loss function
PRIVATE
At x = 0 and x = 2 the function is differentiable and ∂f (0) = {−1} and ∂f (2) = {0}. At x = 1 there is a
kink, and ∂f (1) = [−1, 0].
8.2
EM for the Student distribution
At first blush, it might not be apparent why EM can be used, since there is no missing data. The key idea
is to introduce an “artificial” hidden or auxiliary variable in order to simplify the algorithm. In particular,
we will exploit the fact that a Student distribution can be written as a Gaussian scale mixture [AM74;
Wes87] as follows:
Z
ν ν
T (xn |µ, Σ, ν) = N (xn |µ, Σ/zn )Ga(zn | , )dzn
(237)
2 2
This can be thought of as an “infinite” mixture of Gaussians, each one with a slightly different covariance
matrix.
Treating the zn as missing data, we can write the complete data log likelihood as
`c (θ) =
N
X
[log N (xn |µ, Σ/zn ) + log Ga(zn |ν/2, ν/2)]
(238)
n=1
N X
1
zn
ν
ν
ν
D
− log(2π) − log |Σ| − δn + log − log Γ( )
2
2
2
2
2
2
n=1
ν
D
+ (log zn − zn ) + ( − 1) log zn
2
2
=
(239)
(240)
where we have defined the Mahalanobis distance to be
δn = (xn − µ)T Σ−1 (xn − µ)
(241)
We can partition this into two terms, one involving µ and Σ, and the other involving ν. We have, dropping
irrelevant constants,
(242)
`c (θ) = LN (µ, Σ) + LG (ν)
N
X
1
1
LN (µ, Σ) , − N log |Σ| −
zn δn
2
2 n=1
N
1
1 X
LG (ν) , −N log Γ(ν/2) + N ν log(ν/2) + ν
(log zn − zn )
2
2 n=1
(243)
(244)
Let us first derive the algorithm with ν assumed known, for simplicity. In this case, we can ignore the LG
term, so we only need to figure out how to compute E [zn ] wrt the old parameters.
One can show that
p(zn |xn , θ) = Ga(zn |
ν + D ν + δn
,
)
2
2
(245)
Now if zn ∼ Ga(a, b), then E [zn ] = a/b. Hence the E step at iteration t is
h
i
ν (t) + D
(t)
z (t)
=
n , E zn |xn , θ
(t)
ν (t) + δn
29
(246)
The M step is obtained by maximizing E [LN (µ, Σ)] to yield
(t+1)
µ̂
P
(t)
n
= P
z n xn
(247)
(t)
n zn
1 X (t)
z (xn − µ̂(t+1) )(xn − µ̂(t+1) )T
N n n
!
#
"
N
X
1 X (t)
z xn xTn −
z (t)
µ̂(t+1) (µ̂(t+1) )T
=
n
N n n
n=1
Σ̂(t+1) =
(248)
(249)
These results are quite intuitive: the quantity z n is the precision of measurement n, so if it is small,
the corresponding data point is down-weighted when estimating the mean and covariance. This is how the
Student achieves robustness to outliers.
To compute the MLE for the degrees of freedom, we first need to compute the expectation of LG (ν),
which involves zn and log zn . Now if zn ∼ Ga(a, b), then one can show that
h
i
(t)
`n , E log zn |θ (t) = ψ(a) − log b
(250)
d
where ψ(x) = dx
log Γ(x) is the digamma function. Hence, from Equation (245), we have
(t)
(t)
ν (t) + D
ν (t) + δn
) − log(
)
2
2
ν (t) + D
ν (t) + D
= log(z (t)
) − log(
)
n ) + Ψ(
2
2
`n = Ψ(
(251)
(252)
Substituting into Equation (244), we have
E [LG (ν)] = −N log Γ(ν/2) +
Nν
ν X (t)
log(ν/2) +
(` − z (t)
n )
2
2 n n
(253)
The gradient of this expression is equal to
N
N
N
1 X (t)
d
E [LG (ν)] = − Ψ(ν/2) +
log(ν/2) +
+
(` − z (t)
n )
dν
2
2
2
2 n n
(254)
This has a unique solution in the interval (0, +∞] which can be found using a 1d constrained optimizer.
Performing a gradient-based optimization in the M step, rather than a closed-form update, is an example
of what is known as the generalized EM algorithm.
30
Part II
Linear models
31
9
Solutions
9.1
Derivation of Fisher’s linear discriminant
We have
f = wT SB w
(255)
0
(256)
f = 2SB w
(257)
T
g = w SW w
(258)
0
g = 2SW w
T
T
dJ(w)
(2sB w)(w SW w) − (w SB w)(2SW w)
=
=0
dw
(wT SW w)T (wT SW w)
(259)
(sB w) (wT SW w) = (wT SB w)(SW w)
| {z } | {z }
(260)
aSB w = bSW w
b
SB w = SW w
a
(261)
Hence
a
b
32
(262)
10
Solutions
10.1
Gradient and Hessian of log-likelihood for multinomial logistic regression
P
1. Let us drop the i subscript for simplicity. Let S = k eηk be the denominator of the softmax.
∂ ηk −1
∂µk
∂
=
e
S] · −S −2
S + e ηk · [
(263)
∂ηj
∂ηj
∂ηj
= (δij eηk )S −1 − eηk eηj S −2
e ηj
e ηk
δij −
=
S
S
(264)
= µk (δkj − µj )
(266)
(265)
2. We have
X X yik
X X ∂` ∂µik ∂ηij
=
µik (δjk − µij )xi
∂µik ∂ηij ∂wj
µik
i
i
k
k
XX
X
XX
=
yik (δjk − µij )xi =
yij xi −
(
yik )µij xi
∇wj ` =
i
=
i
k
X
i
(267)
(268)
k
(269)
(yij − µij )xi
i
3. We consider a single term xi in the log likelihood; we can sum over i at the end. Using the Jacobian
expression from above, we have
∇wc0 (∇wc `)T = ∇wc0 ((yic − µic )xTi )
10.2
(270)
= −(∇wc0 µic )xTi
(271)
= −(µic (δc,c0 − µi,c0 )xi )xTi
(272)
Regularizing separate terms in 2d logistic regression
PRIVATE
1. Unregularized solution: see Figure 1(a). There is a unique ML decision boundary which makes 0 errors,
but there are several possible decision boundaries which all makes 0 errors.
2. Heavily regularizing w0 sets w0 = 0, so the line must go through the origin. There are several possible
lines with different slopes, but all will make 1 error. see Figure 1(b).
3. Regularizing w1 makes the line horizontal since x1 is ignored. This incurs 2 errors. see Figure 1(c).
4. Regularizing w2 makes the line vertical, since x2 is ignored. This incurs 0 errors. see Figure 1(d).
10.3
Logistic regression vs LDA/QDA
PRIVATE
1. GaussI ≤ LinLog. Both have logistic (sigmoid) posteriors p(y|x, w) = σ(ywT x), but LinLog is the
logistic model which is trained to maximize p(y|x, w). (GaussI may have high joint p(y, x), but this
does not necessarily mean p(y|x) is high; LinLog can achieve the maximum of p(y|x), so will necessarily
do at least as well as GaussI.)
33
(a)
(b)
(c)
(d)
Figure 1: (a) Unregularized. (b) Regularizing w0 . (c) Regularizing w1 . (d) Regularizing w2 .
2. GaussX ≤ QuadLog. Both have logistic posteriors with quadratic features, but QuadLog is the model
of this class maximizing the average log probabilities.
3. LinLog ≤ QuadLog. Logistic regression models with linear features are a subclass of logistic regression
models with quadratic functions. The maximum from the superclass is at least as high as the maximum
from the subclass.
4. GaussI ≤ QuadLog. Follows from above inequalities.
5. Although one might expect that higher log likelihood results in better classification performance, in
general, having higher average log p(y|x) does not necessarily translate to higher or lower classification
error. For example, consider linearly separable data. We have L(linLog) > L(GaussI), since maximum
likelihood logistic regression will set the weights to infinity, to maximize the probability of the correct
labels (hence p(yi |xi , ŵ) = 1 for all i). However, we have R(linLog) = R(gaussI), since the data is
linearly separable. (The GaussI model may or may not set σ very small, resulting in possibly very large
class conditional pdfs; however, the posterior over y is a discrete pmf, and can never exceed 1.)
As another example, suppose the true label is always 1 (as opposed to 0), but model M always predicts
p(y = 1|x, M ) = 0.49. It will always misclassify, but it is at least close to the decision boundary.
By contrast, there might be another model M 0 that predicts p(y = 1|x, M 0 ) = 1 on even-numbered
inputs, and p(y = 1|x, M 0 ) = 0 on odd-numbered inputs. Clearly R(M 0 ) = 0.5 < R(M ) = 1, but
L(M 0 ) = −∞ < L(M ) = log(0.49).
34
11
Solutions
11.1
Multi-output linear regression
PRIVATE
We have
Ŵ = (ΦT Φ)−1 ΦT Y
1 1 1 0 0
T
Φ =
0 0 0 1 1
3 0
ΦT Φ =
0 3
−1 −1
−1 −2
−2 −1
Y=
1
1
1
2
2
1
−4 −4
ΦT Y =
4
4
(273)
0
1
(274)
(275)
(276)
(277)
(278)
(ΦT Φ)−1 = (1/3)I2
Ŵ = (ΦT Φ)−1 ΦT Y =
11.2
−4/3
4/3
−4/3
4/3
(279)
Centering and ridge regression
Suppose X is centered, so x = 0. Then
(280)
J(w, w0 ) = (y − Xw − w0 1)T (y − Xw − w0 1) + λwT w
T
T
T
T
T
T
T
T
= y y + w (X X)w − 2y (Xw) + λw w + −2w0 1 y + 2w0 1 Xw + w0 1 1w0
(281)
Consider the terms in brackets:
w0 1T y = w0 ny
X
w0 1T Xw = w0
xTi w = nxT w = 0
(282)
(283)
i
(284)
w0 1T 1w0 = nw02
Optimizing wrt w0 we find
∂
J(w, w0 ) = −2ny + 2nw0 = 0
∂w0
ŵ0 = y
(285)
(286)
Optimizing wrt w we find
∂
J(w, ŵ0 ) = [2XT Xw − 2XT y] + 2λw = 0
∂w
w = (XT X + λI)−1 XT y
35
(287)
(288)
11.3
Partial derivative of the RSS
PRIVATE
1. We have
RSS(w) =
X
(ri − wk xik )(ri − wk xik )
(289)
i
=
X
[ri2 − wk2 x2ik − 2wk xik ri ]
(290)
i
So
X
∂
RSS(w) =
[−2wk x2ik − 2xik ri ]
∂wk
i
(291)
Let us check this matches the standard expression
(292)
∇w = 2XT Xw − 2XT y
Recall that [XT X]jk =
i xij xik . For simplicity, suppose there are two features. Then
P
X
X
X
X
X
[∇w ]1 = 2[
x2i1 ,
xi1 xi2 ]T w − 2[
xi1 yi ] = 2w1
x2i1 + 2
xi1 (w2 xi2 − yi ) = a1 w1 − c1 (293)
i
i
i
i
i
2. Setting the derivative to zero, we have
X
X
wk (
x2ik ) =
(xik ri )
i
so
wk =
11.4
xT:,k r
xT:,k x:,k
(295)
Reducing elastic net to lasso
We have
and
11.5
(294)
i
J1 (w) = y T y + (Xw)T (Xw) − 2y T (Xw) + λ2 wT w + λ1 |w|1
(296)
J2 (w) = y T y + c2 (Xw)T (Xw) − 2c2 y T (Xw) + λ2 c2 wT w + cλ1 |w|1 = J1 (cw)
(297)
Shrinkage in linear regression
PRIVATE
1. From the book, we have that ŵkOLS = xTk y = ck /2. Hence the solid line (1) is OLS and has slope 1/2.
ck
Also, ŵkridge = ŵkOLS /(1 + λ2 ) = 2(1+λ
. Hence the dotted line (2) is ridge and has slope 1/4. Finally,
2)
lasso must be the dashed line (3), since it sets coefficients to zero if −λ1 ≤ ck ≤ λ1 .
2. λ1 = 1
3. λ2 = 1
36
11.6
EM for mixture of linear regression experts
In the E step, we compute the conditional responsibilities
rnk = p(zn = k|xn , yn ) =
p(yn |xn , zn = k)p(zn = k|xn )
p(yn |xn )
(298)
In the M step, we update the parameters of the gating funtion by maximizing the weighted likelihood
XX
`(θg ) =
rnk log p(zn = k|xn , θg )
(299)
n
k
and the parameters of the k’th expert by maximizing the weighted likelihood
X
`(θk ) =
rnk log p(yn |xn , zn = k, θk )
(300)
n
If the gating function and experts are linear models, these M steps correspond to convex subproblems that
can be solved efficiently.
For example, consider a mixture of linear regression experts using logistic regression gating functions. In
the M step, we need to maximize Q(θ, θ old ) wrt wk , σk2 and V. For the regression parameters for model k,
the objective has the form
N
X
1
(301)
Q(θk , θ old ) =
rnk − 2 (yn − wTk xn )
σk
n=1
We recognize this as a weighted least squares problem, which makes intuitive sense: if rnk is small, then data
point n will be downweighted when estimating model k’s parameters. From the weighted least square section,
we can immediately write down the MLE as
wk = (XT Rk X)−1 XT Rk y
(302)
where Rk = diag(r:,k ). The MLE for the variance is given by
σk2 =
PN
T
2
n=1 rnk (yn − wk xn )
PN
n=1 rnk
(303)
We replace the estimate of the unconditional mixing weights π with the estimate of the gating parameters,
V. The objective has the form
XX
`(V) =
rnk log πn,k
(304)
n
k
We recognize this as equivalent to the log-likelihood for multinomial logistic regression, except we replace the
“hard” 1-of-C encoding yi with the “soft” 1-of-K encoding ri . Thus we can estimate V by fitting a logistic
regression model to soft target labels.
37
12
Solutions
38
Part III
Deep neural networks
39
13
Solutions
13.1
Backpropagation for a 1 layer MLP
To compute δ 1 , let L = CrossEntropyWithLogits(y, a). Then
∂L
= (p − y)T
∂a
(305)
∂L ∂a ∂h
∂a ∂h
∂L
=
= δ1
∂z
∂a ∂h ∂z
∂h ∂z
∂h
= δ1 U
since a = Uh + b2
∂z
= δ 1 U ◦ ReLU0 (z) since h = ReLU(z)
(306)
= δ 1 U ◦ H(h)
(309)
δ1 =
where p = S(a).
To compute δ 2 , we have
δ2 =
(307)
(308)
Now we compute the gradients wrt the parameters:
∂L ∂a
∂L
=
= δ 1 hT
∂U
∂a ∂U
∂L
∂L ∂a
=
= δ1
∂b2
∂a ∂b2
∂L ∂z
∂L
=
= δ 2 xT
∂W
∂z ∂W
∂L
∂L ∂z
=
= δ2
∂b1
∂z ∂b1
∂L ∂z
∂L
=
= δ2 W
∂x
∂z ∂x
40
(310)
(311)
(312)
(313)
(314)
14
Solutions
41
15
Solutions
42
Part IV
Nonparametric models
43
16
Solutions
44
17
Solutions
17.1
Fitting an SVM classifier by hand
PRIVATE
1. The perpendicular to the decision boundary is a line through φ(x1 ) to φ(x2 ), and hence is parallel to
φ(x2 ) − φ(x1 ) = (1, 2, 2) − (1, 0, 0) = (0, 2, 2). Of course, any scalar multiple of this is acceptable, e.g.,
(0, 1, 1).
2. There are 2 support vectors, namely the two data points. The decision boundary will be half way
between them. This midpoint is m = (1,2,2)+(1,0,0)
= (1, 1, 1). The distance of each of the training
2
points to this midpoint is
p
√
||φ(x1 ) − m|| = ||φ(x2 ) − m|| = ||(0, 1, 1)|| = 02 + 12 + 12 = 2
(315)
√
Hence the margin is 2.
p
√
3. We have w = (0, 1/2, 1/2), which is parallel to (0, 2, 2) and has ||w|| = (1/2)2 + (1/2)2 = 1/ 2 as
required.
4. Using the equations we have
(316)
−1([1, 0, 0]T w + w0 ) = 1
(317)
T
+1([1, 2, 2] w + w0 ) = 1
Hence w0 = −1.
5. We have
T
T
√
√
2
w0 + w φ(x) = −1 + (0, 1/2, 1/2) (1, 2x, x ) = −1 +
45
1
2
x + x2
2
2
(318)
18
Solutions
46
Part V
Beyond supervised learning
47
19
Solutions
19.1
Information gain equations
To see this, let us define pn , p(θ|D) as the current belief state, and pn+1 , p(θ|D, x, y) as the belief state
after observing (x, y). Then
Z
pn+1
KL (pn+1 kpn ) = pn+1 log
dθ
(319)
pn
Z
Z
= pn+1 log pn+1 dθ − pn+1 log pn dθ
(320)
Z
= − H pn+1 − pn+1 log pn dθ
(321)
Next we need to take expectations wrt p(y|x, D). We will use the following identity:
Z
X
p(y|x, D) pn+1 log pn dθ
(322)
y
=
X
Z
p(y|x, D)
p(θ|D, x, y) log p(θ|D)dθ
(323)
y
=
Z X
p(y, θ|x, D) log p(θ|D)dθ
(324)
y
Z
=
(325)
p(θ|D) log p(θ|D)dθ
Using this, we have
Z
0
U (x) = Ep(y|x,D) [− H pn+1 ] −
pn log pn dθ
= Ep(y|x,D) [− H p(θ|x, y, D) + H p(θ|D)] = U (x)
48
(326)
(327)
20
Solutions
20.1
EM for FA
In the E step, we can compute the posterior for zn as follows:
(328)
p(zn |xn , θ) = N (zn |mn , Σn )
T
−1
T
−1
Σn , (I + W Ψ
mn , Σn (W Ψ
−1
(329)
(xn − µ))
(330)
W)
We now discuss the M step. We initially assume µ = 0. Using the trace trick we have
i
X X h T −1
E (x̃i − Wzi )T Ψ−1 (x̃i − Wzi ) =
x̃i Ψ x̃i + E ziT WT Ψ−1 Wzi − 2x̃Ti Ψ−1 WE [zi ]
i
(331)
i
=
Xh
tr(Ψ−1 x̃i x̃Ti ) + tr(Ψ−1 WE zi ziT WT )
(332)
i
−tr(2Ψ−1 WE [zi ] x̃Ti )
i
(333)
, tr(Ψ−1 G(W))
(334)
Hence the expected complete data log likelihood is given by
Q=
1
N
log |Ψ−1 | − tr(Ψ−1 G(W))
2
2
(335)
∂
∂
Using the chain rule and the facts that ∂W
tr(WT A) = A ∂W
WT AW = (A + AT )W we have
1
∇W Q(W) = − Ψ−1 ∇W G(W) = 0
2
X X
∇W G(W) = 2W
E zi ziT − 2(
E [zi ] x̃Ti )T
i
Wmle =
(336)
(337)
i
"
X
#"
#−1
X T
x̃i E [zi ]
E zi zi
i
i
T
(338)
Using the facts that ∇X log |X| = X−T and ∇X tr(XA) = AT we have
1
N
Ψ − G(Wmle ) = 0
2
2
1
Ψ = diag(G(Wmle ))
N
∇Ψ−1 Q =
(339)
(340)
We can simplify this as follows, by plugging in the MLE (this simplification no longer holds if we use MAP
estimation). First note that
X X
T
E zi ziT Wmle
=
E [zi ] x̃Ti
(341)
i
i
so
1 X
x̃i x̃Ti + Wmle E [zi ] x̃Ti − 2Wmle E [zi ] x̃Ti
N i
!
X
1 X
T
T
x̃i x̃i − Wmle
=
E [zi ] x̃i
N
i
i
Ψ=
49
(342)
(343)
To estimate µ and W at the same time, we can define W̃ = (W, µ) and z̃ = (z, 1), Also, define
bn , E [z̃|xn ] = [mn ; 1]
h
i E zzT |x E [z|x ]
n
n
T
Ωn , E z̃z̃ |xn =
T
E [z|xn ]
1
(344)
(345)
Then the M step is as follows:
"
#"
#−1
X
X
ˆ
T
W̃ =
x n bn
Ωn
n
(346)
n
(
)
X
1
ˆ
T
xn − W̃b
Ψ̂ = diag
n xn
N
n
(347)
It is interesting to apply the above equations to the PPCA case in the limit where σ 2 → 0. This provides
an alternative way to fit PCA models, as shown by [Row97]. Let Z̃ be a L × N matrix storing the posterior
means (low-dimensional representations) along its columns. Similarly, let X̃ = XT be an D × N matrix
storing the original data along its columns. From the PPCA posterior, when σ 2 = 0, we have
(348)
Z̃ = (WT W)−1 WT X̃
This constitutes the E step. Notice that this is just an orthogonal projection of the data.
From Equation (346), the M step is given by
Ŵ =
"
X
T
xn E [zn ]
#"
X
n
#−1
T
E [zn ] E [zn ]
(349)
n
where we exploited the fact that Σ = Cov [zn |xn , θ] = 0I when σ 2 = 0. It P
is worth comparing
this expression
P
to the MLE for multi-output linear regression, which has the form W = ( n yn xTn )( n xn xTn )−1 . Thus we
see that the M step is like linear regression where we replace the observed inputs by the expected values of
the latent variables.
In summary, here is the entire algorithm:
• E step: Z̃ = (WT W)−1 WT X̃
T
T
• M step: W = X̃Z̃ (Z̃Z̃ )−1
[TB99] showed that the only stable fixed point of the EM algorithm is the globally optimal solution.
That is, the EM algorithm converges to a solution where W spans the same linear subspace as that defined
by the first L eigenvectors. However, if we want W to be orthogonal, and to contain the eigenvectors in
descending order of eigenvalue, we have to orthogonalize the resulting matrix (which can be done quite
cheaply). Alternatively, we can modify EM to give the principal basis directly [AO03].
20.2
EM for mixFA
PRIVATE
We state the results below without proof (the derivation can be found in [GH96]).
In the E step, we compute the posterior responsibility of cluster c for data point n using
rnc , p(sn = c|xn , θ) ∝ πc N (xn |µc , Wc WTc + Ψ)
50
(350)
The conditional posterior for zn is given by
p(zn |xn , sn = c, θ) = N (zn |mnc , Σnc )
−1
Σnc , (I + WTc Ψ−1
c Wc )
mnc , Σnc (WTc Ψ−1
c (xn − µc ))
(351)
(352)
(353)
For the M step, we define W̃c = (Wc , µc ), and z̃ = (z, 1). Also, define
bnc , E [z̃|xn , cn = c] = [mnc ; 1]
h
i E zzT |x , c = c E [z|x , c = c]
n n
n n
T
Ωnc , E z̃z̃ |xn , cn = c =
T
E [z|xn , cn = c]
1
(354)
(355)
Then the M step is as follows:
"
#"
#−1
X
X
ˆ
T
W̃c =
rnc xn bc
rnc Ωnc
n
(
)
X
1
ˆ
T
rnc xn − W̃c bnc xn
Ψ̂ = diag
N
nc
1 X
rnc
π̂c =
N n
20.3
(356)
n
(357)
(358)
Deriving the second principal component
1. Dropping terms that no not involve z2 we have
J=
n
n
1 X
1 X
2 T
2
−2zi2 v2T (xi − zi1 v1 ) + zi2
v2 v2 =
−2zi2 v2T xi + zi2
n i=1
n i=1
(359)
since v2T v2 = 1 and v1T v2 = 0. Hence
so
2. We have
∂J
= −2v2T xi + 2zi2 = 0
∂zi2
(360)
zi2 = v2T xi
(361)
∂ J˜
= −2Cv2 + 2λ2 v2 + λ12 v1 = 0
∂v2
(362)
0 = −2v1T Cv2 + 2λ2 v1T v2 + λ12 v1T v1
(363)
Premultiplying by v1T yields
Now v1T Cv2 = v1T (λ1 v2 ) = 0, and v1T v2 = 0, and v1T v1 = 1, so λ12 = 0. Hence
0 = −2Cv2 + 2λ2 v2
Cv2 = λ2 v2
(364)
(365)
So v2 is an eigenvector of C. Since we want to maximize the variance, we want to pick the eigenvector
with the largest eigenvalue, but the first one is already taken. Hence v2 is the evector with the second
largest evalue.
51
20.4
Deriving the residual error for PCA
PRIVATE
1. From the previous exercise we have
z
z
* i1
* i2 2
2
(xi − zi1 v1 − zi2 v2 )T (xi − zi1 v1 − zi2 v2 ) = xTi xi − 2zi1
xTi
v1 − 2zi2
xTi
v2 + zi1
+ zi2
2
2
= xTi xi − zi1
− zi2
= xTi xi − v1T xTi xTi v1 − v2T xTi xTi v2
(366)
(367)
(368)
This generalizes to K > 2 in the obvious way.
2. Hence
n
K
X
1 X T
xi xi −
vjT xi xTi vj
JK =
n i=1
j=1
=
(369)
n
n
X
1X
1X T
xi xi −
vjT (
xi xTi )vj
n i=1
n
j
i=1
(370)
n
X
1X T
xi xi −
=
vjT Cvj
n i=1
j
(371)
n
K
X
1X T
xi xi −
λj
n i=1
j=1
(372)
=
3. If K = d there is no truncation and Jd = 0. Hence
0=
n
K
d
d
X
X
X
1X T
xi xi −
λj −
λj = JK −
λj
n i=1
j=1
j=K+1
(373)
j=K+1
Hence the residual error arising from only using K < d terms is
JK =
d
X
λj
(374)
j=K+1
20.5
PCA via successive deflation
1. We have
1
(I − v1 v1T )XT X(I − v1 v1T )
n
1 T
=
(X X − v1 v1T XT X)(I − v1 v1T )
n
1 T
=
X X − v1 (v1T XT X) − (XT Xv1 )v1T + v1 (v1T XT Xv1 )v1T
n
1 T
=
X X − v1 (nλ1 v1T ) − (nλ1 v1 )v1T + v1 (v1T nλ1 v1 )v1T
n
1 T
=
X X − nλ1 v1 v1T − nλ1 v1 v1T + nλ1 v1 v1T
n
1
= XT X − λ1 v1 v1T
n
C̃ =
52
(375)
(376)
(377)
(378)
(379)
(380)
2. Since X̃ lives in the d − 1 subspace orthogonal to v1 , the vector u must be orthogonal to v1 . Hence
uT v1 = 0 and uT u = 1, so u = v2 .
3. We have
function [V, lambda] = simplePCA(C, K, f)
d = length(C);
V = zeros(d,K);
for j=1:K
[lambda(j), V(:,j) = f(C);
C = C - lambda(j)*V(:,j)*V(:,j)’; % deflation
end
20.6
PPCA variance terms
1
Define A = (ΛK − σ 2 I) 2 .
1. We have
v T Cv = v T (UARRT AT UT + σ 2 I)v = v T σ 2 Iv = σ 2
(381)
2. We have
v T Cv = uTi (UARRT AT UT + σ 2 I)ui
= uTi U(Λ − σ 2 I)UT ui + σ 2
= eTi (Λ − σ 2 I)ei + σ 2 = λi − σ 2 + σ 2
20.7
(382)
(383)
(384)
Posterior inference in PPCA
PRIVATE
Using Bayes rule for Gaussians with the substitutions Λ = I, L−1 = σ 2 I, A = W, b = µ, µ = 0, we have
−1
1 2
T 1
−1
T
Cov [z|y] = (I + W 2 w) =
(σ I + W W)
= σ 2 M−1
(385)
σ
σ2
and
E[z|y] = Cov [z|y] WT σ −2 (x − µ) = M−1 WT (x − µ)
20.8
(386)
Imputation in a FA model
PRIVATE
Letting C = WWT + Ψ, we have
−1
p(xh |xv , θ) = N (xh |µh + Chv C−1
vv (xv − µv ), Chh − Chv Cvv Cvh )
20.9
(387)
Efficiently evaluating the PPCA density
Since C is not full rank, we can use matrix inversion lemma to invert it efficiently:
C−1 =
1 I − W WT W + σ 2 I WT
2
σ
53
(388)
Plugging in the MLE we find
1
W = UK (ΛK − σ 2 I) 2
1 T
C−1 = 2 I − UK (ΛK − σ 2 I)Λ−1
K UK
σ
1 = 2 I − UK JUTK
σ
J = diag(1 − σ 2 /λj )
(389)
(390)
(391)
(392)
Similarly it can be shown that
log |WWT + σ 2 I| = (d − K) log σ 2 +
K
X
i=1
54
log λi
(393)
21
Solutions
55
22
Solutions
56
23
Solutions
57
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )