An Introduction To Matrix Decomposition Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17 Outline • Matrix Decomposition – PCA, SVD, NMF – LDA, ICA, Sparse Coding, etc. What Is Matrix Decomposition? • We wish to decompose the matrix A by writing it as a product of two or more matrices: An×m = Bn×kCk×m • Suppose A, B, C are column matrices – An×m = (a1, a2, …, am), each ai is a n-dim data sample – Bn×k = (b1, b2, …, bk), each bj is a n-dim basis, and space B consists of k bases. – Ck×m = (c1, c2, …, cm), each ci is the k-dim coordinates of ai projected to space B Why We Need Matrix Decomposition? • Given one data sample: a1 = Bn×kc1 (a11, a12, …, a1n)T = (b1, b2, …, bk) (c11, c12, …, c1k)T • Another data sample: a2 = Bn×kc2 • More data sample: am = Bn×kcm • Together (m data samples): (a1, a2, …, am) = Bn×k (c1, c2, …, cm) An×m = Bn×kCk×m Why We Need Matrix Decomposition? (a1, a2, …, am) = Bn×k (c1, c2, …, cm) An×m = Bn×kCk×m • We wish to find a set of new basis B to represent data samples A, and A will become C in the new space. • In general, B captures the common features in A, while C carries specific characteristics of the original samples. • • • • In PCA: B is eigenvectors In SVD: B is right (column) eigenvectors In LDA: B is discriminant directions In NMF: B is local features PRINCIPLE COMPONENT ANALYSIS Definition – Eigenvalue & Eigenvector Given a m x m matrix C, for any λ and w, if Then λ is called eigenvalue, and w is called eigenvector. Definition – Principle Component Analysis – Principle Component Analysis (PCA) – Karhunen-Loeve transformation (KL transformation) • Let A be a n × m data matrix in which the rows represent data samples • Each row is a data vector, each column represents a variable • A is centered: the estimated mean is subtracted from each column, so each column has zero mean. • Covariance matrix C (m x m): Principle Component Analysis • C can be decomposed as follows: C=UΛUT • Λ is a diagonal matrix diag(λ1 λ2,…,λn), each λi is an eigenvalue • U is an orthogonal matrix, each column is an eigenvector UTU=I U-1=UT Maximizing Variance • The objective of the rotation transformation is to find the maximal variance • Projection of data along w is Aw. • Variance: σ2w= (Aw)T(Aw) = wTATAw = wTCw where C = ATA is the covariance matrix of the data (A is centered!) • Task: maximize variance subject to constraint wTw=1. Optimization Problem • Maximize λ is the Lagrange multiplier • Differentiating with respect to w yields • Eigenvalue equation: Cw = λw, where C = ATA. • Once the first principal component is found, we continue in the same fashion to look for the next one, which is orthogonal to (all) the principal component(s) already found. Property: Data Decomposition • PCA can be treated as data decomposition a=UUTa =(u1,u2,…,un) (u1,u2,…,un)T a =(u1,u2,…,un) (<u1,a>,<u2,a>…,<un,a>)T =(u1,u2,…,un) (b1, b2, …, bn)T = Σ bi·ui Face Recognition – Eigenface • Turk, M.A.; Pentland, A.P. Face recognition using eigenfaces, CVPR 1991 (Citation: 2654) • The eigenface approach – – – – images are points in a vector space use PCA to reduce dimensionality face space compare projections onto face space to recognize faces PageRank – Power Iteration • Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total) • Row i has nonzero element in positions corresponding to inlinks Ii. Column-Stochastic & Irreducible • Column-Stochastic • where • Irreducible Iterative PageRank Calculation • For k=1,2,… • Equivalently (λ=1, A is a Markov chain transition matrix) • Why can we use power iteration to find the first eigenvector? Convergence of the power iteration • Expand the initial approximation r0 in terms of the eigenvectors SINGULAR VALUE DECOMPOSITION SVD - Definition • Any m x n matrix A, with m ≥ n, can be factorized • Singular Values And Singular Vectors • The diagonal elements σj of are the singular values of the matrix A. • The columns of U and V are the left singular vectors and right singular vectors respectively. • Equivalent form of SVD: Matrix approximation • Theorem: Let Uk = (u1 u2 … uk), Vk = (v1 v2 … vk) and Σk = diag(σ1, σ2, …, σk), and define • Then • It means, that the best approximation of rank k for the matrix A is SVD and PCA • We can write: • Remember that in PCA, we treat A as a row matrix • V is just eigenvectors for A – Each column in V is an eigenvector of row matrix A – we use V to approximate a row in A • Equivalently, we can write: • U is just eigenvectors for AT – Each column in U is an eigenvector of column matrix A – We use U to approximate a column in A Example - LSI • Build a term-by-document matrix A • Compute the SVD of A: • Approximate A by A = UΣVT – Uk: Orthogonal basis, that we use to approximate all the documents – Dk: Column j hold the coordinates of document j in the new basis – Dk is the projection of A onto the subspace spanned by Uk. SVD and PCA • For symmetric A, SVD is closely related to PCA • PCA: A = UΛUT – U and Λ are eigenvectors and eigenvalues. • SVD: A = UΛVT – U is left(column) eigenvectors – V is right(row) eigenvectors – Λ is the same eigenvalues • For symmetric A, column eigenvectors equal to row eigenvectors • Note the difference of A in PCA and SVD – SVD: A is directly the data, e.g. term-by-document matrix – PCA: A is covariance matrix, A=XTX, each row in X is a sample Latent Semantic Indexing (LSI) 1. Document file preparation/ preprocessing: – Indexing: collecting terms – Use stop list: eliminate ”meaningless” words – Stemming 2. Construction term-by-document matrix, sparse matrix storage. 3. Query matching: distance measures. 4. Data compression by low rank approximation: SVD 5. Ranking and relevance feedback. Latent Semantic Indexing • Assumption: there is some underlying latent semantic structure in the data. • E.g. car and automobile occur in similar documents, as do cows and sheep. • This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD. Similarity Measures • Term to term AAT = UΣ2UT = (UΣ)(UΣ)T UΣ are the coordinates of A (rows) projected to space V • Document to document ATA = VΣ2VT = (VΣ)(VΣ)T VΣ are the coordinates of A (columns) projected to space U Similarity Measures • Term to document A = UΣVT = (UΣ½)(VΣ½)T UΣ½ are the coordinates of A (rows) projected to space V VΣ½ are the coordinates of A (columns) projected to space U HITS (Hyperlink Induced Topic Search) • Idea: Web includes two flavors of prominent pages: – authorities contain high-quality information, – hubs are comprehensive lists of links to authorities • A page is a good authority, if many hubs point to it. • A page is a good hub if it points to many authorities. • Good authorities are pointed to by good hubs and good hubs point to good authorities. Hubs Authorities Power Iteration • Each page i has both a hub score hi and an authority score ai. • HITS successively refines these scores by computing • Define the adjacency matrix L of the directed web graph: • Now HITS and SVD • L: rows are outlinks, columns are inlinks. • a will be the dominant eigenvector of the authority matrix LTL • h will be the dominant eigenvector of the hub matrix LLT • They are in fact the first left and right singular vectors of L!! • We are in fact running SVD on the adjacency matrix. HITS vs PageRank • PageRank may be computed once, HITS is computed per query. • HITS takes query into account, PageRank doesn’t. • PageRank has no concept of hubs • HITS is sensitive to local topology: insertion or deletion of a small number of nodes may change the scores a lot. • PageRank more stable, because of its random jump step. NMF – NON-NEGATIVE MATRIX FACTORIZATION Definition • Given a nonnegative matrix Vn×m, find non-negative matrix factors Wn×k and Hk×m such that Vn×m≈Wn×kHk×m • V: column matrix, each column is a data sample (n-dimension) • Wi: k-basis represents one base • H: coordinates of V projected to W vj ≈ Wn×khj Motivation • Non-negativity is natural in many applications... • Probability is also non-negative • Additive model to capture local structure Multiplicative Update Algorithm • Cost function Euclidean distance • Multiplicative Update Multiplicative Update Algorithm • Cost function Divergence – Reduce to Kullback-Leibler divergence when – A and B can be regarded as normalized probability distributions. • Multiplicative update • PLSA is NMF with KL divergence NMF vs PCA • n = 2429 faces, m = 19x19 pixels • Positive values are illustrated with black pixels and negative values with red pixels • NMF Parts-based representation • PCA Holistic representations Reference • D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. (pdf) NIPS, 2001. • D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. (pdf) Nature 401, 788-791 (1999). Major Reference • Saara Hyvönen, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki (Highly recommend)