1.2 Connection between Eigenvectors and Singular Vectors 2 Introduction to Machine Learning CSE474/574: Singular Value Decomposition Varun Chandola <chandola@buffalo.edu> Outline Contents 1 Singular Value Decompostion 1.1 Economy Sized SVD . . . . . . . . . . . . . . 1.2 Connection between Eigenvectors and Singular 1.3 PCA Using SVD . . . . . . . . . . . . . . . . 1.4 Low Rank Approximations Using SVD . . . . 1.5 The Matrix Approximation Lemma . . . . . . 1.6 Equivalence Between PCA and SVD . . . . . 1.7 SVD Applications . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 3 3 3 4 2 Probabilisitic PCA 2.1 EM for PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 1 . . . . . Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Singular Value Decompostion 1.2 Connection between Eigenvectors and Singular Vectors • Let X = USV> • For any matrix X (N × D) X> X = VS> U> USV> > = V(S> S)V > = VDV V> X = |{z} U |{z} S |{z} |{z} N ×D N ×N N ×D D×D U is a N × N matrix and all columns of U are orthonormal, i.e., U> U = IN . V is a D × D matrix whose rows and columns are orthonormal (i.e., V> V = ID and VV> = ID ). S is a N × D matrix containing the r = min(N, D) singular values σi ≥ 0 on the main diagonal and 0s in the rest of the matrix. The columns of U are the left singular vectors and the columns of V are the right singular vectors. The lower panel above shows the truncated SVD approximation of rank L. • where D = S2 is a diagonal matrix containing squares of singular values. 1.1 • Similarly one can show that the columns of U are the eigenvectors of XX> and D contains the eigenvalues. Economy Sized SVD • Assume that N > D X = |{z} Ũ |{z} S̃ |{z} Ṽ> |{z} N ×D N ×D D×D D×D • Hence, (X> X)V = VD • Which means that the columns of V are the eigenvectors of X> X and D contains the eigenvalues. Remember that both U and V are orthonormal matrices. 1.3 PCA Using SVD • Assuming that X is centered (zero mean) the principal components are equal to the right singular vectors of X. 1.4 Low Rank Approximations Using SVD 1.4 3 Low Rank Approximations Using SVD • Choose only first L singular values 1.7 SVD Applications Singular Value Decomposition - Recap • What is the (column) rank of a matrix? X ≈ |{z} Ũ |{z} S̃ |{z} Ṽ> |{z} N ×D • Maximum number of linearly independent columns in the matrix. N ×L L×L L×D • For X = USV> (SVD): • Only need N L + LD + L values to represent N × D matrix > – What is the rank of X̂(1) = U:1 σ1 V:1 ? • Also known as rank L approximation of the matrix X (Why?) Because the rank of the approximate matrix will be L. 1.5 – The rank is 1 because each column of X̂(1) is a scaled version of the vector U:1 . • How much storage is needed for a rank 1 matrix? The Matrix Approximation Lemma – O(N ) • Among all possible rank L approximations of a matrix X, SVD gives the best approximation – In the sense of minimizing the Frobenius norm Importance of the Matrix Approximation Lemma • There are many ways to “approximate” a matrix with a lower rank approximation kX − XL k • Low rank approximation allows us to store the matrix using much less than N × D bits (O(N × L) bits only) • Also known as the Eckart Young Mirsky theorem 1.6 4 • SVD gives the best possible approximation Equivalence Between PCA and SVD • For data X (assuming it to be centered) kX − X̂k22 • Principal components are the eigenvectors of X> X 1.7 • Or, principal components are the columns of V SVD Applications • A faster way to do PCA (truncated SVD, sparse SVD) W=V • Other applications as well: – Image compression • Or Ŵ = V̂ – Recommender Systems ∗ There are better methods • Ŵ are the first L principal components and V̂ are the first L right singular vectors. – Topic modeling (Latent Semantic Indexing) • For PCA, data in latent space: Ẑ = XŴ = ÛŜV̂> V = ÛŜ 2 Probabilisitic PCA • Recall the Factor Analysis model p(xi |zi , θ) = N (Wzi + µ, Ψ) • Optimal reconstruction for PCA: X̂ = ẐŴ> = ÛŜV̂> • For PPCA, Ψ = σ 2 I • Covariance for each observation x is given by: WW> + σ 2 I • Optimal reconstruction is same as truncated SVD approximation!! • If we maximize the log-likelihood of a data set X, the MLE for W is: 1 Ŵ = V(Λ − σ 2 I) 2 • V - first L eigenvectors of S = 1 X> X N • Λ - diagonal matrix with first L eigen values 2.1 EM for PCA 2.1 EM for PCA • PPCA formulation allows for EM based learning of parameters • Z is a matrix containing N latent random variables Benefits of EM • EM can be faster • Can be implemented in an online fashion • Can handle missing data References 5