An Introduction To Matrix Decomposition

An Introduction To Matrix Decomposition Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17 Outline • Matrix Decomposition – PCA, SVD, NMF – LDA, ICA, Sparse Coding, etc. What Is Matrix Decomposition? • We wish to decompose the matrix A by writing it as a product of two or more matrices: An×m = Bn×kCk×m • Suppose A, B, C are column matrices – An×m = (a1, a2, …, am), each ai is a n-dim data sample – Bn×k = (b1, b2, …, bk), each bj is a n-dim basis, and space B consists of k bases. – Ck×m = (c1, c2, …, cm), each ci is the k-dim coordinates of ai projected to space B Why We Need Matrix Decomposition? • Given one data sample: a1 = Bn×kc1 (a11, a12, …, a1n)T = (b1, b2, …, bk) (c11, c12, …, c1k)T • Another data sample: a2 = Bn×kc2 • More data sample: am = Bn×kcm • Together (m data samples): (a1, a2, …, am) = Bn×k (c1, c2, …, cm) An×m = Bn×kCk×m Why We Need Matrix Decomposition? (a1, a2, …, am) = Bn×k (c1, c2, …, cm) An×m = Bn×kCk×m • We wish to find a set of new basis B to represent data samples A, and A will become C in the new space. • In general, B captures the common features in A, while C carries specific characteristics of the original samples. • • • • In PCA: B is eigenvectors In SVD: B is right (column) eigenvectors In LDA: B is discriminant directions In NMF: B is local features PRINCIPLE COMPONENT ANALYSIS Definition – Eigenvalue & Eigenvector Given a m x m matrix C, for any λ and w, if Then λ is called eigenvalue, and w is called eigenvector. Definition – Principle Component Analysis – Principle Component Analysis (PCA) – Karhunen-Loeve transformation (KL transformation) • Let A be a n × m data matrix in which the rows represent data samples • Each row is a data vector, each column represents a variable • A is centered: the estimated mean is subtracted from each column, so each column has zero mean. • Covariance matrix C (m x m): Principle Component Analysis • C can be decomposed as follows: C=UΛUT • Λ is a diagonal matrix diag(λ1 λ2,…,λn), each λi is an eigenvalue • U is an orthogonal matrix, each column is an eigenvector  UTU=I  U-1=UT Maximizing Variance • The objective of the rotation transformation is to find the maximal variance • Projection of data along w is Aw. • Variance: σ2w= (Aw)T(Aw) = wTATAw = wTCw where C = ATA is the covariance matrix of the data (A is centered!) • Task: maximize variance subject to constraint wTw=1. Optimization Problem • Maximize λ is the Lagrange multiplier • Differentiating with respect to w yields • Eigenvalue equation: Cw = λw, where C = ATA. • Once the first principal component is found, we continue in the same fashion to look for the next one, which is orthogonal to (all) the principal component(s) already found. Property: Data Decomposition • PCA can be treated as data decomposition a=UUTa =(u1,u2,…,un) (u1,u2,…,un)T a =(u1,u2,…,un) (<u1,a>,<u2,a>…,<un,a>)T =(u1,u2,…,un) (b1, b2, …, bn)T = Σ bi·ui Face Recognition – Eigenface • Turk, M.A.; Pentland, A.P. Face recognition using eigenfaces, CVPR 1991 (Citation: 2654) • The eigenface approach – – – – images are points in a vector space use PCA to reduce dimensionality face space compare projections onto face space to recognize faces PageRank – Power Iteration • Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total) • Row i has nonzero element in positions corresponding to inlinks Ii. Column-Stochastic & Irreducible • Column-Stochastic • where • Irreducible Iterative PageRank Calculation • For k=1,2,… • Equivalently (λ=1, A is a Markov chain transition matrix) • Why can we use power iteration to find the first eigenvector? Convergence of the power iteration • Expand the initial approximation r0 in terms of the eigenvectors SINGULAR VALUE DECOMPOSITION SVD - Definition • Any m x n matrix A, with m ≥ n, can be factorized • Singular Values And Singular Vectors • The diagonal elements σj of are the singular values of the matrix A. • The columns of U and V are the left singular vectors and right singular vectors respectively. • Equivalent form of SVD: Matrix approximation • Theorem: Let Uk = (u1 u2 … uk), Vk = (v1 v2 … vk) and Σk = diag(σ1, σ2, …, σk), and define • Then • It means, that the best approximation of rank k for the matrix A is SVD and PCA • We can write: • Remember that in PCA, we treat A as a row matrix • V is just eigenvectors for A – Each column in V is an eigenvector of row matrix A – we use V to approximate a row in A • Equivalently, we can write: • U is just eigenvectors for AT – Each column in U is an eigenvector of column matrix A – We use U to approximate a column in A Example - LSI • Build a term-by-document matrix A • Compute the SVD of A: • Approximate A by A = UΣVT – Uk: Orthogonal basis, that we use to approximate all the documents – Dk: Column j hold the coordinates of document j in the new basis – Dk is the projection of A onto the subspace spanned by Uk. SVD and PCA • For symmetric A, SVD is closely related to PCA • PCA: A = UΛUT – U and Λ are eigenvectors and eigenvalues. • SVD: A = UΛVT – U is left(column) eigenvectors – V is right(row) eigenvectors – Λ is the same eigenvalues • For symmetric A, column eigenvectors equal to row eigenvectors • Note the difference of A in PCA and SVD – SVD: A is directly the data, e.g. term-by-document matrix – PCA: A is covariance matrix, A=XTX, each row in X is a sample Latent Semantic Indexing (LSI) 1. Document file preparation/ preprocessing: – Indexing: collecting terms – Use stop list: eliminate ”meaningless” words – Stemming 2. Construction term-by-document matrix, sparse matrix storage. 3. Query matching: distance measures. 4. Data compression by low rank approximation: SVD 5. Ranking and relevance feedback. Latent Semantic Indexing • Assumption: there is some underlying latent semantic structure in the data. • E.g. car and automobile occur in similar documents, as do cows and sheep. • This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD. Similarity Measures • Term to term AAT = UΣ2UT = (UΣ)(UΣ)T UΣ are the coordinates of A (rows) projected to space V • Document to document ATA = VΣ2VT = (VΣ)(VΣ)T VΣ are the coordinates of A (columns) projected to space U Similarity Measures • Term to document A = UΣVT = (UΣ½)(VΣ½)T UΣ½ are the coordinates of A (rows) projected to space V VΣ½ are the coordinates of A (columns) projected to space U HITS (Hyperlink Induced Topic Search) • Idea: Web includes two flavors of prominent pages: – authorities contain high-quality information, – hubs are comprehensive lists of links to authorities • A page is a good authority, if many hubs point to it. • A page is a good hub if it points to many authorities. • Good authorities are pointed to by good hubs and good hubs point to good authorities. Hubs Authorities Power Iteration • Each page i has both a hub score hi and an authority score ai. • HITS successively refines these scores by computing • Define the adjacency matrix L of the directed web graph: • Now HITS and SVD • L: rows are outlinks, columns are inlinks. • a will be the dominant eigenvector of the authority matrix LTL • h will be the dominant eigenvector of the hub matrix LLT • They are in fact the first left and right singular vectors of L!! • We are in fact running SVD on the adjacency matrix. HITS vs PageRank • PageRank may be computed once, HITS is computed per query. • HITS takes query into account, PageRank doesn’t. • PageRank has no concept of hubs • HITS is sensitive to local topology: insertion or deletion of a small number of nodes may change the scores a lot. • PageRank more stable, because of its random jump step. NMF – NON-NEGATIVE MATRIX FACTORIZATION Definition • Given a nonnegative matrix Vn×m, find non-negative matrix factors Wn×k and Hk×m such that Vn×m≈Wn×kHk×m • V: column matrix, each column is a data sample (n-dimension) • Wi: k-basis represents one base • H: coordinates of V projected to W vj ≈ Wn×khj Motivation • Non-negativity is natural in many applications... • Probability is also non-negative • Additive model to capture local structure Multiplicative Update Algorithm • Cost function  Euclidean distance • Multiplicative Update Multiplicative Update Algorithm • Cost function  Divergence – Reduce to Kullback-Leibler divergence when – A and B can be regarded as normalized probability distributions. • Multiplicative update • PLSA is NMF with KL divergence NMF vs PCA • n = 2429 faces, m = 19x19 pixels • Positive values are illustrated with black pixels and negative values with red pixels • NMF  Parts-based representation • PCA  Holistic representations Reference • D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. (pdf) NIPS, 2001. • D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. (pdf) Nature 401, 788-791 (1999). Major Reference • Saara Hyvönen, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki (Highly recommend)

An Introduction To Matrix Decomposition

Related documents

Products

Support

An Introduction To Matrix Decomposition

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib