Homework 2 Written Assignment 10-605/10-805: Machine Learning with Large Datasets Due Tuesday, September 29th at 1:30:00 PM Eastern Time Submit your solutions via Gradescope, with your solution to each subproblem on a separate page, i.e., following the template below. Note that Homework 2 consists of two parts: this written assignment, and a programming assignment. The written part is worth 30% of your total HW2 grade (programming part makes up the remaining 70%). Name : Varun Rawal Andrew ID : vrawal 1 1 Nyström Method (30 points) Nyström method. Define the following block representation of a kernel matrix: W W K> 21 and C = K= . K21 K21 K22 e = The Nyström method uses W ∈ Rl×l , C ∈ Rm×l and K ∈ Rm×m to generate the approximation K CW† C> ≈ K. e (a) [5 points] Show that W is symmetric positive semi-definite (SPSD) and that K − K where k.kF is the Frobenius norm. F = K22 − K21 W† K> 21 Solution. We have been given that W K= K21 K> 21 K22 Since K is SPSD because K is a Kernel / Gram matrix, hence, K is symmetric =⇒ K> = K Hence, > W K> W K> > 21 21 K= =K = =⇒ W = W> K21 K22 K21 K> 22 =⇒ W is symmetric. Also, since K is a PSD matrix, we can prove that W is also PSD as follows : For any matrix M such that A B M= C D we have that z ∗ M z ≥ 0 for all complex z, and in particular for z = [v, 0]T . ∗ A B v v 0 = v ∗ Av ≥ 0. C D 0 A similar argument can be applied to D, and thus we conclude that both A and D must be positive definite matrices, as well. Hence, W is symmetric and PSD =⇒ W is SPSD. e = CW† C> then, Also, since the approximation has been generated as follows : K e = CW† C> K Since C = W K21 > e = W W† W =⇒ K K21 K21 K> 21 e = W W† W K21 = W =⇒ K K21 K21 K21 W† K> 21 Here, K22 is approximated by the Schur Complement of W in K. e from K, we get : Now, subtracting equations K 0 e = 0 K−K 0 (K22 − K21 W† K> 21 ) Taking Frobenius norms on both sides, 0 0 e K−K = 0 (K22 − K21 W† K> F 21 ) Hence, proved. 2 = K22 − K21 W† K> 21 F F F , (b) [10 points] Let K = X> X for some X ∈ RN ×m , and let X0 ∈ RN ×l be the first l columns of X. e = X> PU 0 X, where PU 0 is the orthogonal projection onto the span of the left singular Show that K X X vectors of X0 . Solution. Since K = X> X for some X ∈ RN ×m , we may choose to define a zero-one sampling matrix, S ∈ Rn×l , that selects l columns from K, i.e., C = KS. Each column of S has exactly one non-zero entry per column. Further, W = S> KS = (XS)> XS = X0> X0 , where X0 ∈ Rn×l contains l sampled columns of X > 0 and X0 = UX 0 ΣX 0 VX 0 is the SVD of X . e = CW† C> , we can substitute / express C and W in terms of X and S as follows Now since, K : e = CW† C> K e = KS[W† ](KS)> =⇒ K e = KS[S> KS]† S> K> =⇒ K e = X> X0 (X0> X0 )† X0> X =⇒ K e = X > UX 0 U> 0 X =⇒ K X 0 Now, UX 0 U> X 0 is the orthogonal projection onto the span of the left singular vectors of X , and thus > can be expressed as PUX 0 = UX 0 UX 0 Hence, we have : e = X> PU 0 X K X 3 e symmetric positive semi-definite (SPSD)? (c) [5 points] Is K Solution. e can be expressed as : Since K e = X> UX 0 U> 0 X = [U> 0 X]> [U> 0 X] K X X X = Y> Y for some Y = U> X0 X , e is indeed a symmetric positive semi-definite (SPSD) matrix. hence K 4 e = K. Note: this statement holds whenever (d) [5 points] If rank(K) = rank(W) = r m, show that K rank(K) = rank(W), but is of interest mainly in the low-rank setting. Solution. Since K = X> X, rank(K) = rank(X) = r. Similarly, W = X0> X0 implies rank(X0 ) = r. Thus the columns of X0 span the columns of X and UX0 is an ortho-normal basis for X, i.e., I N − U X0 U > X0 ∈ Null(X) . Since k ≥ r, from the equation from the previous solution, we have : e K−K F = X > I N − UX 0 U> X0 X F =0 Since Frobenius norm can ONLY be 0 when the difference in all the elements are zero, hence: e =⇒ K = K 5 (e) [5 points] If m = 20M and K is a dense matrix, how much space is required to store K if each entry is stored as a double? How much space is required by the Nyström method if l = 10K? Solution. Assuming that each array element of Double data-type consumes 8 bytes worth of storage, the dense matrix K occupies O(m2 ) space. Hence, (20M )2 ∗ 8 bytes ≈ 2910 TeraBytes the matrix W for the Nyström method would occupy O(l2 ) space. Hence, (10K ∗ 10K) ∗ 8 bytes ≈ 763 MegaBytes However, we would also need C for the purpose of reconstruction which occupies O(ml) space, which is (20M ∗ 10K) ∗ 8 bytes ≈ 1490 GigaBytes 6