Introduction to Machine Learning Varun Chandola <>

advertisement
1.2 Connection between Eigenvectors and
Singular Vectors
2
Introduction to Machine Learning
CSE474/574: Singular Value Decomposition
Varun Chandola <chandola@buffalo.edu>
Outline
Contents
1 Singular Value Decompostion
1.1 Economy Sized SVD . . . . . . . . . . . . . .
1.2 Connection between Eigenvectors and Singular
1.3 PCA Using SVD . . . . . . . . . . . . . . . .
1.4 Low Rank Approximations Using SVD . . . .
1.5 The Matrix Approximation Lemma . . . . . .
1.6 Equivalence Between PCA and SVD . . . . .
1.7 SVD Applications . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
1
1
2
2
3
3
3
4
2 Probabilisitic PCA
2.1 EM for PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
5
1
. . . . .
Vectors
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Singular Value Decompostion
1.2
Connection between Eigenvectors and Singular Vectors
• Let X = USV>
• For any matrix X (N × D)
X> X = VS> U> USV>
>
= V(S> S)V
>
= VDV
V>
X = |{z}
U |{z}
S |{z}
|{z}
N ×D
N ×N N ×D D×D
U is a N × N matrix and all columns of U are orthonormal, i.e., U> U = IN . V is a D × D matrix whose
rows and columns are orthonormal (i.e., V> V = ID and VV> = ID ). S is a N × D matrix containing
the r = min(N, D) singular values σi ≥ 0 on the main diagonal and 0s in the rest of the matrix. The
columns of U are the left singular vectors and the columns of V are the right singular vectors.
The lower panel above shows the truncated SVD approximation of rank L.
• where D = S2 is a diagonal matrix containing squares of singular values.
1.1
• Similarly one can show that the columns of U are the eigenvectors of XX> and D contains the
eigenvalues.
Economy Sized SVD
• Assume that N > D
X = |{z}
Ũ |{z}
S̃ |{z}
Ṽ>
|{z}
N ×D
N ×D D×D D×D
• Hence,
(X> X)V = VD
• Which means that the columns of V are the eigenvectors of X> X and D contains the eigenvalues.
Remember that both U and V are orthonormal matrices.
1.3
PCA Using SVD
• Assuming that X is centered (zero mean) the principal components are equal to the right singular
vectors of X.
1.4 Low Rank Approximations Using SVD
1.4
3
Low Rank Approximations Using SVD
• Choose only first L singular values
1.7 SVD Applications
Singular Value Decomposition - Recap
• What is the (column) rank of a matrix?
X ≈ |{z}
Ũ |{z}
S̃ |{z}
Ṽ>
|{z}
N ×D
• Maximum number of linearly independent columns in the matrix.
N ×L L×L L×D
• For X = USV> (SVD):
• Only need N L + LD + L values to represent N × D matrix
>
– What is the rank of X̂(1) = U:1 σ1 V:1
?
• Also known as rank L approximation of the matrix X (Why?) Because the rank of the approximate
matrix will be L.
1.5
– The rank is 1 because each column of X̂(1) is a scaled version of the vector U:1 .
• How much storage is needed for a rank 1 matrix?
The Matrix Approximation Lemma
– O(N )
• Among all possible rank L approximations of a matrix X, SVD gives the best approximation
– In the sense of minimizing the Frobenius norm
Importance of the Matrix Approximation Lemma
• There are many ways to “approximate” a matrix with a lower rank approximation
kX − XL k
• Low rank approximation allows us to store the matrix using much less than N × D bits (O(N × L)
bits only)
• Also known as the Eckart Young Mirsky theorem
1.6
4
• SVD gives the best possible approximation
Equivalence Between PCA and SVD
• For data X (assuming it to be centered)
kX − X̂k22
• Principal components are the eigenvectors of X> X
1.7
• Or, principal components are the columns of V
SVD Applications
• A faster way to do PCA (truncated SVD, sparse SVD)
W=V
• Other applications as well:
– Image compression
• Or
Ŵ = V̂
– Recommender Systems
∗ There are better methods
• Ŵ are the first L principal components and V̂ are the first L right singular vectors.
– Topic modeling (Latent Semantic Indexing)
• For PCA, data in latent space:
Ẑ = XŴ
= ÛŜV̂> V
= ÛŜ
2
Probabilisitic PCA
• Recall the Factor Analysis model
p(xi |zi , θ) = N (Wzi + µ, Ψ)
• Optimal reconstruction for PCA:
X̂ = ẐŴ>
= ÛŜV̂>
• For PPCA, Ψ = σ 2 I
• Covariance for each observation x is given by:
WW> + σ 2 I
• Optimal reconstruction is same as truncated SVD approximation!!
• If we maximize the log-likelihood of a data set X, the MLE for W is:
1
Ŵ = V(Λ − σ 2 I) 2
• V - first L eigenvectors of S =
1
X> X
N
• Λ - diagonal matrix with first L eigen values
2.1 EM for PCA
2.1
EM for PCA
• PPCA formulation allows for EM based learning of parameters
• Z is a matrix containing N latent random variables
Benefits of EM
• EM can be faster
• Can be implemented in an online fashion
• Can handle missing data
References
5
Download