Uploaded by Varun Rawal

hw2 writeup written

advertisement
Homework 2 Written Assignment
10-605/10-805: Machine Learning with Large Datasets
Due Tuesday, September 29th at 1:30:00 PM Eastern Time
Submit your solutions via Gradescope, with your solution to each subproblem on a separate page,
i.e., following the template below. Note that Homework 2 consists of two parts: this written assignment,
and a programming assignment. The written part is worth 30% of your total HW2 grade (programming
part makes up the remaining 70%).
Name : Varun Rawal
Andrew ID : vrawal
1
1
Nyström Method (30 points)
Nyström method. Define the following block
representation
of a kernel
matrix:
W
W K>
21
and C =
K=
.
K21
K21 K22
e =
The Nyström method uses W ∈ Rl×l , C ∈ Rm×l and K ∈ Rm×m to generate the approximation K
CW† C> ≈ K.
e
(a) [5 points] Show that W is symmetric positive semi-definite (SPSD) and that K − K
where k.kF is the Frobenius norm.
F
= K22 − K21 W† K>
21
Solution.
We have been given that
W
K=
K21
K>
21
K22
Since K is SPSD because K is a Kernel / Gram matrix, hence, K is symmetric =⇒ K> = K
Hence,
>
W
K>
W K>
>
21
21
K=
=K =
=⇒ W = W>
K21 K22
K21 K>
22
=⇒ W is symmetric. Also, since K is a PSD matrix, we can prove that W is also PSD as follows :
For any matrix M such that
A B
M=
C D
we have that z ∗ M z ≥ 0 for all complex z, and in particular for z = [v, 0]T .
∗
A B v
v 0
= v ∗ Av ≥ 0.
C D 0
A similar argument can be applied to D, and thus we conclude that both A and D must be positive
definite matrices, as well.
Hence, W is symmetric and PSD =⇒ W is SPSD.
e = CW† C> then,
Also, since the approximation has been generated as follows : K
e = CW† C>
K
Since C =
W
K21
>
e = W W† W
=⇒ K
K21
K21
K>
21
e = W W† W K21 = W
=⇒ K
K21
K21 K21 W† K>
21
Here, K22 is approximated by the Schur Complement of W in K.
e from K, we get :
Now, subtracting equations K
0
e = 0
K−K
0 (K22 − K21 W† K>
21 )
Taking Frobenius norms on both sides,
0
0
e
K−K
=
0 (K22 − K21 W† K>
F
21 )
Hence, proved.
2
= K22 − K21 W† K>
21
F
F
F
,
(b) [10 points] Let K = X> X for some X ∈ RN ×m , and let X0 ∈ RN ×l be the first l columns of X.
e = X> PU 0 X, where PU 0 is the orthogonal projection onto the span of the left singular
Show that K
X
X
vectors of X0 .
Solution.
Since K = X> X for some X ∈ RN ×m , we may choose to define a zero-one sampling matrix, S ∈ Rn×l ,
that selects l columns from K, i.e., C = KS.
Each column of S has exactly one non-zero entry per column.
Further, W = S> KS = (XS)> XS = X0> X0 , where X0 ∈ Rn×l contains l sampled columns of X
>
0
and X0 = UX 0 ΣX 0 VX
0 is the SVD of X .
e = CW† C> , we can substitute / express C and W in terms of X and S as follows
Now since, K
:
e = CW† C>
K
e = KS[W† ](KS)>
=⇒ K
e = KS[S> KS]† S> K>
=⇒ K
e = X> X0 (X0> X0 )† X0> X
=⇒ K
e = X > UX 0 U> 0 X
=⇒ K
X
0
Now, UX 0 U>
X 0 is the orthogonal projection onto the span of the left singular vectors of X , and thus
>
can be expressed as PUX 0 = UX 0 UX 0
Hence, we have :
e = X> PU 0 X
K
X
3
e symmetric positive semi-definite (SPSD)?
(c) [5 points] Is K
Solution.
e can be expressed as :
Since K
e = X> UX 0 U> 0 X = [U> 0 X]> [U> 0 X]
K
X
X
X
= Y> Y
for some Y = U>
X0 X ,
e is indeed a symmetric positive semi-definite (SPSD) matrix.
hence K
4
e = K. Note: this statement holds whenever
(d) [5 points] If rank(K) = rank(W) = r m, show that K
rank(K) = rank(W), but is of interest mainly in the low-rank setting.
Solution.
Since K = X> X, rank(K) = rank(X) = r.
Similarly, W = X0> X0 implies rank(X0 ) = r.
Thus the columns of X0 span the columns of X and UX0 is an ortho-normal basis for X, i.e.,
I N − U X0 U >
X0 ∈ Null(X)
. Since k ≥ r, from the equation from the previous solution, we have :
e
K−K
F
= X > I N − UX 0 U>
X0 X
F
=0
Since Frobenius norm can ONLY be 0 when the difference in all the elements are zero, hence:
e
=⇒ K = K
5
(e) [5 points] If m = 20M and K is a dense matrix, how much space is required to store K if each entry
is stored as a double? How much space is required by the Nyström method if l = 10K?
Solution.
Assuming that each array element of Double data-type consumes 8 bytes worth of storage, the dense
matrix K occupies O(m2 ) space.
Hence, (20M )2 ∗ 8 bytes ≈ 2910 TeraBytes
the matrix W for the Nyström method would occupy O(l2 ) space.
Hence, (10K ∗ 10K) ∗ 8 bytes ≈ 763 MegaBytes
However, we would also need C for the purpose of reconstruction which occupies O(ml) space, which
is (20M ∗ 10K) ∗ 8 bytes ≈ 1490 GigaBytes
6
Download