Uploaded by tijmenvds

homework1 lintl

advertisement
Philipp Lintl
12152498
lintl.philipp@gmail.com
1
Homework Assignment 1
2018-11-12
Machine Learning 1, 18/19
Basic Linear Algebra and Derivatives
1.1

2
Given A = 3
4
3
4
5


4
1
5 and B = 0
6
1
1
1
0

0
1 and b = 2
1
3
T
4
(a) Compute AB:

2
AB = 3
4
3
4
5

4 1
5 0
6 1
1
1
0
 
0
2+4
1 = 3 + 5
1
4+6
2+3
3+4
4+5
 
3+4
6 5
4 + 5 =  8 7
5+6
10 9

7
9
11
(b) Are A and B invertible? How do you test for it? If so, calculate the inverses. Provide details oly for your
analysis of A.
A matrix A is invertible, if the row (respectively column) vectors are independent of each other. That means,
that they can not be exposed as a linear combination of the others. A quick and easy way to test for invertibility
is enabled by calculating the determinant of the respective matrix. According to det(X) = 0 → X is not
invertible. As asked in the task description, only A requires detailed analysis. Therefore, the determimant is
calculated first:
det(A) = 2 · 4 · 6 + 3 · 5 · 4 + 4 · 3 · 5 − (4 · 4 · 4 + 5 · 5 · 2 + 6 · 3 · 3)
= 48 + 60 + 60 − 64 − 50 − 54 = 0
The calculation is obtained via a trick which specifically for 3 × 3 matrices works as follows (Cramers rule
would also have worked):
The matric A is extended by its first two columns on the right. Now, its elements can be multiplied diagonally.
So, now an extended 3 × 5 matrix with rows denoted by i = 1, . . . , 3, columns by j = 1, . . . , 5 and entries by
Aij is subject to calculation.


2 3 4 2 3
3 4 5 3 4
4 5 6 4 5
The first multiplication consists of terms A11 ·A22 ·A33 · which is summed with A12 ·A23 ·A34 · and A13 ·A24 ·A35 ·.
Then three multiplied terms A31 · A22 · A13 ·, A32 · A23 · A14 · and A33 · A24 · A15 · are subtracted.
As seen, det(A) = 0 yielding that A is in fact not invertible.
B on the other hand is invertible, as det(B) = 2. B−1 now is obtained with the Gauss-Jordan elimination.
Shortely described, the matrix to be inverted is given on the left and the identity matrix I is given on the right.
Then, B is transformed via rowwise (or columnwise) subtractions among the rows (or columns), such that the
Identity matrix I is gotten on the left. The respective inverse B−1 is then given on the right, provided all the
transformations are conducted in the same fashion. For B this emerges as:






1 1 0
1 0 0
1 1 0
1 0 0
1 1 0
1 0 0
0 1 1
0 1 0 −→ 0 1 1
0 1 0 −→ 0 1 1
0 1 0
−→
III
III−I
III+II
0 0 1
−1 0 1
−1 1 1 2 and II−III
1 0 1
0 −1 1
0 0 2




1
1
1 1 0
1
0
0
1 0 0
− 12
2
2
1
1
1
1
0 1 0
− 12  −→ 0 1 0
− 12 
2
2
2
2
I−II
1
1
1
1
1
1
0 0 1
−2 2
0 0 1
−2
2
2
2
1
Thus,
1
2
 1
2
− 12

B−1 =
− 21
To test the result, BB−1 = I must hold.



1 1 0
1
1
BB−1 = 0 1 1 ·  1
2
1 0 1
−1
1
2
1
2
1
2
− 12 
1
2
−1
1
1


1
1 
= · 1
2
−1


2
1
1
−1 = 0
2
0
1
−1
1
1
0
2
0

1
−1
1
 
0
1
0 = 0
2
0
0
1
0

0
0 = I
1
(c) Is AB invertible? Why (not)?
No, it is not invertible, as a calculation rule regarding determinants shows:
det(AB) = det(A) ·det(B) = 0
| {z }
=0
So, det(AB) is zero which means that AB is not invertible.
(d) Compare the solution set for the systems Bx = b and Ax = b. What can we say about the second system
due to the invertibility property you determined previously?
For Bx = b, the solution is obtained once again by Gaussian elimination:





1 1 0
1 1 0
1 1 0
1
1
0 1 1
0 1 1
2 −→ 0 1 1
2
−→
III
III−I
1 0 1
5
0 −1 1
4 III+II and 2 0 0 1

1
2
3

2
→ x3 = 3, x2 + x3 = 2 → x2 = −1 and x1 + x2 = 1 → x1 = 2, which yields x = −1. Once again testes by
3


1
Bx = 0
1
1
1
0
   
1
2
0
1 −1 = 2 = b,
5
3
1
so Bx=b holds.
For the second system, we dont even need to start calculating, as non invertibility of A implies, that the
column/row vectors are not independent and thus at least one row/column can be represented as 0’s or in
other words can be spared. That way, the equation system is not solvable anymore, as no or infinitely many
solutions exist.
1.2
Find the derivative of the folowing functions with respect to x.
(a) f1 (x) =
2
x2
+ x−7 + x3
∂
4
f1 (x) = − 3 − 7x−8 + 3x2
∂x
x
(b) f2 (x) = xe−
√
5
x
√
√
∂
1 −4
5
−5x
5
f2 (x) = xe
− x
+ e− x
∂x
5
√
1
1
5
− x
5
=e
− x +1
5
√
1√
5
− x
5
=e
−
x+1
5
2
(c) f3 (x) =
1
x
+ ln(x2 )
∂
1
1
f3 (x) = − 2 + 2 · 2x
∂x
x
x
2x − 1
=
x2
(d) σ(x) =
1
1+e−x
−2
∂
· e−x · −1
σ(x) = − 1 + e−x
∂x
e−x
=
2
(1 + e−x )
(e) f4 (x) = max{0, x}
∂
f4 (x) =
∂x
(
0, x <= 0.
1, x > 0.
What is the shape of the following gradients:
(g)
∂f (x)
∂x
with f : R → R, x ∈ R
As f (x) is a one-dimensional, reel function, this gradient then has the shape of a 1 × 1 vector, which in this
case is a scalar made of the derivative of f(x) according to x.
(h)
df (x)
dx
with f : Rn → R, x ∈ Rn
Now, x is a n-dimensional vector which is mapped on a scalar by f and can be represented as such f (x1 , . . . , xn ).
As specified, the gradient is defined as a row vector, which affects the shape in the following way:
i
∂f (x) h ∂f (x)
(x)
= ∂x1 , . . . , ∂f
∂x
n
∂x
Thus, the gradient for this function is a n-dimensional row vector consisting of the partial derivatives with
respect to x1 , . . . , xn in the according order.
(i)
∂f (x)
∂x
with f : Rn → Rm , x ∈ Rn
Now, x is a n-dimensional vector which maps on a m-dimensional vector, meaning that there are m functions
involved in the mapping. f (x1 , . . . , xn ) = (f1 (x1 , . . . , xn ), . . . , fm (x1 , . . . , xm )). The resulting gradient has the
following shape:
 ∂f (x)

1
. . . ∂f∂x1 (x)
∂x1
n
∂f (x) 
..
.. 
..

=
.
.
.


∂x
∂fm (x)
∂fm (x)
.
.
.
∂x1
∂xn
Thus, the gradient for this function is a m × n matrix, which has the partial derivatives according to all xj with
j ∈ {1, . . . , n} of each fi with i ∈ {1, . . . , m} in its rows.
FInd the gradient of the folowing functions. Make their shapes explicit.
(j)
∂(x)
∂x
2
3
with f (x) = 2exp x2 − ln(x−1
1 ) − sin(x3 x1 ) , x ∈ R
3
∂f (x) h ∂(x)
= ∂x1
∂x
∂f (x)
∂x2
∂f (x)
∂x3
i
with
∂f (x)
2
= 2exp x2 − ln(x−1
1 ) − sin(x3 x1 ) ·
∂x1
1
2
− 2x3 x1 cos(x3 x1 )
x1
∂f (x)
2
= 2exp x2 − ln(x−1
1 ) − sin(x3 x1 ) · 1
∂x2
∂f (x)
2
2
2
= 2exp x2 − ln(x−1
1 ) − sin(x3 x1 ) · (−x1 cos(x3 x1 ))
∂x3
(k) ∇y h with h(y) = (g ◦f )(y), where g(x) = x31 +exp(x2 ) and x := f (y) = [ysin(y), ycos(y)]T . First show your
understanding of the application of the chain rule in this example before ”plugging in”the actual derivatives.
In other words x1 = ysin(y), x2 = ycos(y) which then leads to the formal derivation of the searched gradient:
∇y h = ∇y (g ◦ f ) =
∂g ∂x
·
,
∂x ∂y
whereas
 ∂x 
1
h
∂g
∂g
= ∂x
1
∂x
∂g
∂x2
i
and
∂x  ∂y 
=
.
∂x2
∂y
∂y
Then looking at the respective terms:
∂x1
∂g
= 3x21 ·
= 3x21 ;
∂x1
∂x1
|{z}
∂g
∂x2
= exp(x2 ) ·
= exp(x2 )
∂x2
∂x2
|{z}
∂x1
= sin(y) + ycos(y) ;
∂y
∂x2
= cos(y) − ysin(y)
∂y
=1
=1
Plugged into the general forms of above:
∂g
= 3x21 exp(x2 )
and
∂x 

∂x sin(y) + ycos(y)
=
∂y
cos(y) − ysin(y)
Now, the entire gradient can be assembled as such:
∇y h = ∇y (g ◦ f ) =
∂g ∂x
·
∂x ∂y
= 3x21


sin(y) + ycos(y)

exp(x2 ) · 
cos(y) − ysin(y)
= 3x21 · (sin(y) + ycos(y)) + exp(x2 ) · (cos(y) − ysin(y))
= 3(ysin(y))2 · (sin(y) + ycos(y)) + exp(ycos(y)) · (cos(y) − ysin(y))
(l) We now assume that x := f (y, z) = [ysin(y) + z, ycos(y) + z 2 ]T . Provide ∇y,z h. Hint: To determine the
correct shape of ∇y,z h, view the input pair y and z as vecor [y, z]T .
In other words x1 = ysin(y) + z, x2 = ycos(y) + z 2 which then leads to the formal derivation of the searched
4
gradient:
∇y,z h = ∇y,z (g ◦ f ) =
∂g
=
∂x
∂g
∂x
·
,
∂x ∂y∂z
whereas
 ∂x
1
h
∂g
∂x1
∂g
∂x2
i
and
∂x
∂y
=  ∂x
2
∂y∂z
∂y
∂x1 
∂z 
∂x2
∂z
Then looking at the respective terms from above and the changed ones:
∂g
∂g
= 3x21 ;
= exp(x2 )
∂x1
∂x2
∂x1
∂x1
= sin(y) + ycos(y) ;
=1
∂y
∂z
∂x2
∂x2
= cos(y) − ysin(y) ;
= 2z,
∂y
∂z
which assembled leads to the following final gradient:
∂g
∂x
·
∂x ∂y∂z
 ∂x
∂x1 
1
i
∂g
∂z 
 ∂y
∂x2 · ∂x2
∂x2
∂y
∂z
ysin(y) + ycos(y) 1
exp(x2 ) ·
cos(y) − ysin(y) 2z
∇y,z h = ∇y,z (g ◦ f ) =
=
h
∂g
∂x1
= 3x21
=
2
T
3x1 · (sin(y) + ycos(y)) + exp(x2 ) · (cos(y) − ysin(y))
3x21 + exp(x2 )2z
T
3(ysin(y) + z)2 · (sin(y) + ycos(y)) + exp(ycos(y) + z 2 ) · (cos(y) − ysin(y))
=
3(ysin(y) + z)2 + exp(ycos(y) + z 2 )2z
1.3
The following questions are good practive in manipulating vectors and matrices and they are very important
for solving for posterior distributions. Compute the following gradients, assuming Σ−1 is symmetric, positive
semi-definite and invertible.
(a) ∇µ xT Σ−1 µ
2
Some transformations and suppose σi,j
= Σ−1
i,j :
 
µ1
T −1
−1  .. 
x Σ µ = x1 , . . . , x n · Σ ·  . 
µn
 
µ1
 
2
2
2
2
+ · · · + xn σn,1
), . . . , (x1 σn,1
+ · · · + xn σn,n
) ·  ... 
= (x1 σ1,1
µn
XX
2
=
xi σij
µj , with an individual partial derivative of this looking like the following
i
j
∂xT Σ−1 µ X
2
=
xi σik
∂µk
i
5
If those partial derivatives are taken with respect to each dimension, the shape of this gradient then arises as
R1×n and in matrix form:
∇µ xT Σ−1 µ =
P
P
∂xT Σ−1 µ
T −1
2
2
=
,
i xi σi1 , . . . ,
i xi σin = x Σ
∂µ
(b) ∇µ µT Σ−1 µ
2
Again suppose σij
= Σ−1
ij :,


µ1
 
µT Σ−1 µ = µ1 , . . . , µn · Σ−1 ·  ... 
µn
 
µ1
 .. 
2
2
= µ1 σ11 + · · · + µn σn1 · µ1 , . . . , (µ1 σn,1 + · · · + µn σn,n ) · µn ·  . 
µn
2
2
2
2
= (µ21 σ1,1
+ · · · + µ1 µn σn,1
) + . . . + (µn µ1 σn,1
+ · · · + µ2n σn,n
)
2
2
As Σ−1 is symmetric, σij
= σji
holds. With that information, the equation can similarly to before be represented
as such:
XX
µT Σ−1 µ =
µi σij µj
i
j
Again, one partial derivative looks like:
X
∂µT Σ−1 µ X
2
2
=
µi σik
+
σkj
µj
∂µk
i
j
Then putting all of the partial derivativies together:
P
2
...
∇µ µT Σ−1 µ = 2 j µj σj1
2
symmetric
=
2
X
2
µj σjk
j
P
j
2
µj σjn
= 2 · µT Σ−1
(c) ∇W f where f = Wx and assume W ∈ R2×3 and x ∈ R3 . Follow example of 5.11 of the book mathematics
for machine learning to solve this.

∂f1


∇W f =  ∂W  ∈ R2×3×2 ,
∂f2
∂W

with following characteristics:
•
∂fi
∂W
•
∂fi
∂Wij
•
∂fi
Wi,:
•
∂fi
∂Wj6=i,:
∈ R1×3×2
• fi =
•
∂fi
∂Aiq
= xj
= xT ∈ R1×3×1
= 0T ∈ R1×3×1
P3
j=1
Aij xj which for i=1 is for instance: f1 = A11 · x1 + A12 · x2 + A13 · x3
= xq
If we now partially look at the terms:

x1
∂f1
= x2
∂W
x3

0
0
0
and

0
∂f2
= 0
∂W
0

x1
x2 
x3
6
So in total, the searched gradient looks like:

x
0
1







x
0
2

 

∂f1



x
0
3
 ∂W  


∇W f = 
= 

∂f2  
 0 x1 


∂W


 0 x2 


0 x3

and thus is of shape R2×3×2
(d) ∇W f , where f = (µ − y(W, x)T Σ−1 (µ − y(W, x)) and y(W, x) = WT x. Assume W ∈ RM ×K Follow
these steps:
∂f
- Hint: Use the chainrule by computing
1. Compute ∂W
i,j
helpful to define the kth row in the matrix Σ−1 .
With
∂f (y)
∂W
=
∂f
∂y
·
∂y
∂W
∂f ∂yk
∂yk ∂Wi,j .
Hint: You might find the notation Σ−1
:;k
in mind, at first f is multiplied:
f = µT Σ−1 µ − yT Σ−1 µ − µT Σ−1 y + yT Σ−1 y
= µT Σ−1 µ − 2µT Σ−1 y + yT Σ−1 y
Thus the derivative
∂f
∂y
emerges as:
∂f
= −2µT Σ−1 + 2yT Σ−1
∂y
∈ R1×K
Then also
 ∂y1 
∂W
∂y


=  ...  ∈ RK×(K×M )
∂W
∂y
k
∂W
Similar to before the following hold:
PM
∂yi
= xj (?)
• yi = j=1 Wji xj and W
ji
•
•
∂yi
T
1×M
∂W:,i = x ∈ R
∂yi
T
1×M
∂W:,j6=i = 0 ∈ R
• Putting them together yields: for a single yi :
All of those are used when assembled for
∂f
∂Wij .
∂yi
∂W
= 0 ...
At first
∂f
∂yk
x
T
0 ∈ RK×M .
...
is needed:
∂f
T −1
T −1
= −2µT Σ−1
:,k + 2y Σ:,k = 2(y − µ) Σ:,k
∂yk
So,
X ∂f ∂yk
∂f
=
∂Wij
∂yk ∂Wij
k
Now, the partial results from above come into play:
X ∂f ∂yk
= 2(y − µ)T Σ−1
:,j · xi
∂yk ∂Wij
∈ R1×1
(?)
k
2. Combine your insights to answer the original question.
As a final result, ∇W f emerges as:

2(y − µ)T Σ−1
:,1 · x1

.
.
∇W f = 
.
2(y − µ)T Σ−1
:,1 · xM
...
..
.
...

2(y − µ)T Σ−1
:,K · x1

..
,
.
2(y − µ)T Σ−1
·
x
M
:,K
which has the same shape as the W matrix ∈ RM ×K
7
2
Probability Theory
For these questions you will practice manipulating probabilities and probability density functions using the
sum and product rules.
2.1
Suppose you meet Bart. Bart is a quiet, introvert man in hus mid-fourties. He is very shy and withdrawn,
invariably helpful but with little interest in people or in the world reality. A meek and tidy soul, he has a need
for order and structure and a passion for detail.
(a) What do you think: Is Bart a farmer or is he a librarian?
Well, he could be either. I personally see him more as a librarian, but thats just a matter of stereotype i guess.
(b) Suppose we formulate our belief such that we think that Bart is a librarian, given that he is an introvert,
with 0.8 probability. If he were an extrovert, that probability is 0.1. Since the Netherlands is a social country,
you assume that the probability of meeting an extrovert is 0.7. Define the random variables and values they
can take on, both with symbols and numerically.
• random variable X : {L, F } representing whether Bart is a farmer (F) or a Lirbarian (L)
• random variable Y : {I, E} indicating that the person you meet is an extrovert (E) and introvert (I).
• P (X = L|Y = I) = 0.8 representing the probability that Bart is a Librarian given he is an introvert.
• P (X = L|Y = E) = 0.1 representing the probability that Bart is a Librarian given he is an extrovert.
• P (Y = E) = 0.7 probability that the person you meet is an extrovert.
• P (Y = I) = 0.3 probability that the person you meet is an introvert.
(c) What is the probability that Bart is a librarian?
So P (X = L is sought. Therefore we use the ... rule:
P (X = L) = P (X = L|Y = I) · P (Y = I) + P (X = L|Y = E) · P (Y = E)
= 0.8 · 0.3 + 0.1 · 0.7
= 0.31
(d) Suppose you looked up the actual statistics and find out that there are 1000 times more farmers than librarians in the Netherlands. Additionally, Bart’d friend is telling you that if Bart were a librarian, the probability
of him being an extrovert would be low, and equal to 0.1. Given this information, how should we update our
belief in that Bart is a librarian, if he is an introvert? How does does influence the answer to the previous
question: What is the probability that Bart is a librarian?
So according to Barts friend: P (Y = E|X = L) = 0.1 (thats why P (Y = I|X = L) = 0.9) and there are 1000
more farmers than librarians. The second information means the following:
P (X = F ) = 1000 · P (X = L)
Or in other words, that P (X = L) =
1
1001 ,
as P (X = F ) =
1000
1001 .
Now according to Bayes rule:
P (Y = I|X = L) · P (X = L)
P (Y = I)
1
0.9 · 1001
=
0.3
3
=
1001
P (X = L|Y = I) =
8
2.2
For this question you will compute the expression for the posterior parameter distributions for a simple data
problem. Assume we observe N univariate data points {x1 , x2 , . . . , xN }. Further, we assume that they are
generated by observing the outcome of a coin flip. Therefore the random variable representing the outcome of
the coin flip is described by a Bernoulli distribution. The coin takes on the values 0,1, where the probability of
throwing a 1 is equal to ρ ∈ [0, 1]- A Bernoulli random variable c has the following probability density function:
p(x) = ρx (1 − ρ)(1−ρ) .
(a) Write down the expression for the likelihood of the ovserved datapoints.
Likelihood of the observed data points:
p(x|ρ) |{z}
=
N
Y
p(xi |ρ)
iid i=1
=
N
Y
ρxi (1 − ρ)(1−xi )
i=1
(b) Assume you observed the following coin throws: [1,1]. Which value of ρ most likely generated this data-set?
x1 = 1; x2 = 1; x = {1, 1}. It is most likely that given x1 and x2 ρ was ρ = 1. As, when the likelihood for the
both samples is written:
p(x|ρ) = p(x1 |ρ)· (x2 |ρ)
= ρ1 (1 − ρ)(1−1) · ρ1 (1 − ρ)(1−1)
=ρ·ρ
If we now look for the ρ leading to biggest likelihood, then ρ = 1 is the case, as then ρ2 will be the largest:
ρ̂ = argmaxρ∈[0,1] ρ2 → ρ̂ = 1
We now assume some prior p(ρ) over our model’s parameter ρ.
(c) Write down the general expression for the posterior over the parameter ρ assuming we observe the dataset
D. Indicate which terms correspond to prior, likelihood, evidence and posterior. (Do not bother to fill in the
(bernoulli) form of the likelihood).
p(ρ|D) =
p(D|ρ) · p(ρ)
,
p(D)
whereas
• prior: p(ρ)
• posterior: p(ρ|D)
• evidence: p(D|ρ)
(d) Assume our prior p(ρ) reflects our belief that the coin is fair. Under this prior, how will our posterior belief
into the value of ρ change, assuming we observed the throws [1,1]?
Fair coin → equal prior: ρ = 0.5.
As not noted otherwise, the uniform distribution U ∈ [0,1] is chosen for a prior distribution over ρ. With that
1
information, p(ρ = 0.5) = 1−0
= 1 holds (density of uniform distribution). As determined in (b), p(x|ρ) = ρ2 .
So the denominator in the general form of the posterior then is according to the following:
Z 1
p(D = x) = p(D = {1, 1}) =
p(ρ) · p(D|ρ)dρ
0
Z
=
0
1
ρ3
1 · ρ2 dρ =
3
1
=
0
1
3
9
p(ρ|D) =
p(D|ρ) · p(ρ)
p(D)
⇒
p(ρ|D) =
0.52 · 1
1
3
=
3
4
10
Download