Randomized Algorithms for Low-Rank Matrix Decomposition Ben Sapp, WPE II May 6, 2011 1 Low-Rank Decomposition: The Goal where 2 Advantages • Requires mk+nk numbers to represent the matrix, instead of mn. I.e., compression. • Less numbers = less storage space and faster matrix multiplication. • In many applications, exposes the structure of the data. 3 Exposing structure: an example Let’s say your matrix is full of points in n-dimensional space which have some underlying linear structure: This data can be almost exactly described by 2-dimensions, i.e., a rank-2 matrix. description of the 2 axes in Coefficients (embedding) of the points in 4 Formulation: The Fixed-Rank Problem • • • Non-convex constraint set But, global optimum exists And there is a known solution… 5 Classical Solution: The Fixed-Rank Problem Singular Value Decomposition of A: Optimal Solution: truncated SVD (proof to come later) 6 Truncated SVD Properties • Properties (via power method): – O(mnk), ok, but… – O(k) passes through the data – Expects cheap random access Netflix dataset FERET Face DB – Iterative 2GB 5GB Wiki English 14GB Watson KB 1TB • Issues: Datasets are huge! Data access is expensive 7 Architecture is parallel and decentralized Low-Rank Algorithm Desiderata • 1-2 passes over the matrix as pre-processing / examination step. • Remainder of the work sub-O(mn), and should depend on desired rank k, rather than ambient dimensions mxn. • Trade off accuracy with computation time • Decentralized, simple and numerically stable 8 Randomization to the Rescue! Randomized meta-algorithm: 1. Given A (mxn), randomly obtain Y (mxs or sxs) in 1 pass. 2. Compute exact SVD on Y in O(ms2) or O(s3). 3. Use Y’s decomposition to approximate A’s decomposition. 1-2 passes over the matrix in pre-processing/examination step. Remainder of the work sub-O(mn), depending on underlying rank, rather than ambient dimensions Tradeoff accuracy with computation time Decentralized, simple and numerically stable 9 Outline • Introduction • Linear Algebra preliminaries and intuition • The algorithms – SampleCols – AdaSample – RandProj • A comparison 10 Singular Value Decomposition Any real rectangular matrix A can be factored into the form Tall & skinny: (m > n) Short & fat: (m < n) 11 SVD Properties U and V are unitary matrices with mutually orthonormal columns These columns are called the left and right singular vectors left singular vectors right singular vectors 12 SVD Properties contains the singular values on the diagonal, in order: which correspond to the left and right singular vectors: , , 13 SVD Properties For a vector , the geometric interpretation of : 1. Rotate x by rotation matrix VT. 2. Scale the result along the coordinate axes. 3. Either discard n-m dimensions (m < n), or add zero-pad m-n (m > n) to map to . 4. Rotate the result by U. 14 Fundamental Subspaces Define range(A) as the set of all vectors which A maps to: synonymous: range(A) is the linear subspace spanned by the columns of A If A has rank k {u1,…,uk} are an orthonormal basis for range(A) u3 u1 u2 15 Frobenius norm • Equivalent definitions in Matlab: sum(X(:).^2) • Is unitarily invariant: since tr(XY) = tr(YX) 16 The Optimal Low-Rank Solution Theorem (Eckart-Young, 1936). Let . Then is minimized by with optimal value 17 The Optimal Low-Rank Solution Proof. First, the Frobenius norm is unitarily invariant: Thus, diagonal should be diagonal too To make the right term diagonal, can construct A’ of the form 18 The Optimal Low-Rank Solution Proof (continued). At this point, we have Since rank(A’) is at most k, at most k singular values can be non-zero. Conclude: In summary: 19 Orthogonal Projections • Vector x projection onto unit vector: • Project matrix A onto orthonormal basis Q: 20 Randomized Meta-algorithm Input: Matrix A, mxn, target rank k, number of samples s Output: which approximately solves 1. Form lower dimensional Y from sampling s rows and/or columns from A, or by applying s random projections. Y is mxs or sxs. 2. Compute the top k left singular vectors of Y to form the best rank-k basis Q for range(Y). 3. Project A onto the subspace spanned by Q: 21 The Main Idea Meta-Algorithm 1. Form Y from A randomly 2. Get optimal rank-k basis Q for span of Y via SVD 3. Project A onto Q Bounds: How far is from ? 22 Outline • Introduction • Linear Algebra preliminaries and intuition • The algorithms – SampleCols – AdaSample – RandProj • A comparison 23 Comparing the algorithms Method Running Time # Passes Error w.h.p. SampleCols ? ? ? SampleRowsCols ? ? ? AdaSample ? ? ? RandProj ? ? ? O(mnk) O(k) Exact partial SVD 24 Sampling Rows & Columns • Simple idea: too much data to deal with? Subsample! • But, sample proportional to squared magnitude more informative not so useful 25 First pass: SampleCols [Frieze, Kannan & Vempala, 1998] Input: Matrix A, mxn, target rank k, number of samples s Output: which approximately solves 1. 2. Sample s columns from A: A(:,i1),…,A(:,is), proportional to their squared magnitude: Form 3. 4. Compute Q = [q1 … qk], the top k left singular values of Y. Project A onto the subspace spanned by Q: 26 Running time: SampleCols Input: Matrix A, mxn, target rank k, number of samples s Output: which approximately solves 1. Sample s columns from A: A(:,i1),…,A(:,is), proportional to their squared magnitude: O(mn) 2. Form 3. Compute Q = [q1 … qk], the top k left singular values of Y. 4. Project A onto the subspace spanned by Q: O(ms) O(ms2) 27 Analysis: SampleCols • How different is A from Y on average? • Let’s start by analyzing a randomized matrix-vector multiplication algorithm: Exact: Randomly use s cols: 28 Analysis: SampleCols Random matrix-vector multiplication Exact: Randomly use s rows: Let random variable X take on value with probability pi. Then How do we easily bound this variance? Then and set 29 Analysis: SampleCols Handle on the variance of matrix-vector multiplication: With s samples, variance gets s times better: Let’s extend the idea to matrix-matrix multiplication, randomly choosing col i from A and corresponding row i from matrix B: Define random variable Z col ij A B row ij Then 30 Analysis: SampleCols Handle on the variance of matrix-matrix multiplication: Now, let’s look at variance when B = AT. Let Z = YYT. Then plugging into the above we have and we have the following lemma: Lemma (Distortion from sampling). When Y is a sampled version of A as in SampleCols, then 31 Analysis: SampleCols Lemma (Distortion from sampling). When Y is a sampled version of A as in SampleCols, then Need one more lemma, which quantifies distortion when projecting on matrix onto another’s range…. 32 Analysis: SampleCols Lemma (Distortion from projection). Given any two matrices A and B, let Q be a top-k basis of range(B). Then projection of A onto best k-basis of B projection of A onto best k-basis of A differs by at most this much range(B) range(A) 33 Analysis: SampleCols Lemma (Distortion from sampling). When Y is a sampled version of A as in SampleCols, then Lemma (Distortion from projection). Given any two matrices A and B, let Q be a basis of the top k left singular values of B. Then Taking expectation w.r.t. sampled columns of the second lemma, we obtain a bound for SampleCols: Theorem (SampleCols average Frobenius error). SampleCols finds a rank k matrix such that 34 Comparing the algorithms Running Time # Passes O(mn+ms2) 2 SampleRowsCols ? ? ? AdaSample ? ? ? RandProj ? ? ? O(mnk) O(k) Method SampleCols Exact partial SVD Error w.h.p. 35 One step further: SampleRowsCols [Frieze, Kannan & Vempala, 2004] Input: Matrix A, mxn, target rank k, number of samples s Output: which approximately solves 1. 2. 3. 4. 5. Sample s columns from A: A(:,i1),…,A(:,is), proportional to their squared magnitude. Scale and form Y, mxs. Sample s rows from Y: Y(i1,:),…,Y(is,:), proportional to their squared magnitude. Scale and form W, sxs. Compute V = [v1 … vs], the top s right singular values of W. Compute Q = [q1 … qk], where qi = Project A onto the subspace spanned by Q: 36 Running time: SampleRowsCols Input: Matrix A, mxn, target rank k, number of samples s Output: which approximately solves 1. 2. 3. 4. 5. Sample s columns from A: A(:,i1),…,A(:,is), proportional to their squared magnitude. Scale and form Y, mxs. O(mn) Sample s rows from Y: Y(i1,:),…,Y(is,:), proportional to their squared magnitude. Scale and form W, sxs. O(ms) O(s3) Compute V = [v1 … vs], the top s right singular values of W. O(ms2) Compute Q = [q1 … qk], where qi = Project A onto the subspace spanned by Q: Total running time: O(mn + s3) 37 Analysis: SampleRowsCols sample cols project A onto Q SampleCols sample rows convert from row basis of W to col basis of Y compute basis for rows of W exact 38 Analysis: SampleRowCols Lemma (Good left projections from good right projections). Let as in SampleRowsCols. Then It turns out this additive error dominates the errors incurred from other steps of the algorithm, i.e., thus, we can bound the algorithm as follows Theorem (SampleRowsCols average Frobenius error). SampleRowsCols finds a rank k matrix such that 39 Comparing the algorithms Running Time # Passes O(mn+ms2) 2 O(mn+s3) 2 AdaSample ? ? ? RandProj ? ? ? O(mnk) O(k) Method SampleCols SampleRowsCols Exact partial SVD Error w.h.p. 40 SampleCols is easy to break • Error is additive, and we have no control over • Consider: a few important points This data has a near-perfect rank-2 decomposition. But, SampleCols will almost surely miss the outliers! 41 Improvement: AdaSample • Sample some and form a basis. • Next round of sampling should be proportional to the residual part of A not captured by the current sample. 42 AdaSample [Deshpande, Rademacher, Vempala & Wang, 2006] Input: Matrix A, mxn, target rank k, number of samples s Output: which approximately solves 1. 2. Start with empty sample set S = { }, E := A; For t = 1 to T a. b. c. 3. Pick a subset St of s rows of A, with row i chosen according to Update Update Return 43 Analysis: AdaSample Let’s look at one iteration of AdaSample: Lemma (AdaSample, one round). Let A be mxn and be a linear subspace. Let collection of s rows of A sampled proportional to and S a . Then where is the optimal projection of A onto subspace L with rank k. Proof is similar to derivation of bound on SampleCols. 44 Analysis: AdaSample Lemma (AdaSample, one round). Let A be mxn and be a linear subspace. Let collection of s rows of A sampled proportional to and S a . Then Applying this lemma T times: Theorem (AdaSample average Frobenius error). After T > 1 iterations, AdaSample obtains a sample S such that 45 Running Time: AdaSample At iteration t, we need to basis for S A 1. Extend the orthonormal basis for S to an orthonormal basis for Orthogonalizing s new vectors against st orthogonal, nx1 vectors: O(nts2). 2. Project A onto new portion of the basis spanned by St, takes O(mns) So the iterative process takes Finally, we need to compute singular vectors to obtain our final rank-k basis, taking O(ms2T2). Total running time: 46 Comparing the algorithms Method SampleCols SampleRowsCols AdaSample RandProj Exact partial SVD Running Time # Passes O(mn+ms2) 2 O(mn+s3) 2 O(mnsT + s2T2(m+n)) 2T ? ? O(mnk) O(k) Error w.h.p. ? 47 Recap so far • Previous algorithms attempted to capture range(A) by (intelligently) sampling rows and/or columns. • What if instead, we probed A with a variety of vectors to get a feel for range(A)? 48 RandProj: Geometric intuition 49 RandProj [Halko, Martinsson & Tropp, 2010] Input: Matrix A, mxn, target rank k, number of samples s Output: which approximately solves 1. Draw a random test matrix 2. Form 3. Compute an orthonormal basis 4. Return for the range of Y via SVD. 50 Running time: RandProj Input: Matrix A, mxn, target rank k, number of samples s Output: which approximately solves 1. Draw a random test matrix 2. Form 3. Compute an orthonormal basis 4. Return O(mns) for the range of Y via SVD. O(ms2) Total Running Time: O(mns + ms2) 51 Analysis: RandProj Lemma (Random projection error bound). Partition the SVD of A like so: Let and . Then optimal error extra cost proportional to tail singular values: “wasted sampling” 52 Analysis: RandProj Lemma (Random projection error bound). Take expectations w.r.t. Theorem (SampleRowsCols average Frobenius error). RandProj finds a rank k matrix such that 53 Comparing the algorithms Method SampleCols SampleRowsCols AdaSample RandProj Exact partial SVD Running Time # Passes O(mn+ms2) 2 O(mn+s3) 2 O(mnsT + s2T2(m+n)) 2T O(mns + ms2) 2 O(mnk) O(k) Error w.h.p. 54 RandProj refinement • Can combine RandProj with power iterations: – Drives noise down from tail n-k singular values exponentially fast: – But pay in running time & # of passes: O((q+1)mns + ms2) & 2q passes through data 55 Comparing the algorithms Running Time # Passes O(mn+ms2) 2 O(mn+s3) 2 O(mnsT + s2T2(m+n)) 2T O(mns + ms2) 2 RandProj+power O((q+1)mns + ms2) 2q Exact partial SVD O(mnk) O(k) Method SampleCols SampleRowsCols AdaSample RandProj Error w.h.p. 57 Outline • Introduction • Linear Algebra preliminaries and intuition • The algorithms – SampleCols – AdaSample – RandProj • A comparison 58 Which one is best? Fix k and s, assume bounds are tight, and that SampleRowsCols O(mn+s3) SampleCols O(mn+ms2) Error RandProj O((q+1)mns + ms2) optimal error Time AdaSample O(mnsT + s2T2(m+n)) rounds of adaptive sampling power iterations 59 Caveats • Bounds are not tight • Relative scaling of time-vs-error axes depend on m, n, s, k, and • Which is best in practice? 60 Experiments function function [U,S,V] [U,S,V] = = sampleColsRows(A,s) adaSample(A,s,T) %subsample S = []; Y E = = subsample(A,s); A; function [U,S,V] = randProj(A,s,q) W for = subsample(Y’,s)’; t=1:T Omega = randn(size(A,2),s); %sample columns Y = A*Omega; %SVD of = W sqrt(sum(E.^2,1)); for p i=1:q [Uw,S,Vw] p = A'*Y; p = / svd(W); sum(p); Y = S = sqrt(S); idx = sample_from_weights(p,s); Y = A*Y; U end = Y*Vw; St = A(:,idx); [U,R] S = = = qr(U); [S St]; [Q,R] qr(Y,0); V B==(U'*A)'; %orthogonalize & project Q'*A; V [U0,S,V] = bsxfun(@times,V,1./sqrt(sum(V.^2))); [Q,R]==svds(B,s); qr(S,0); proj_A_on_Qt = Q*(Q'*A); U = Q*U0; subfunE[Xsub] = A - proj_A_on_Qt; = subsample(X,s) p_col end = sqrt(sum(A.^2)); p_col %wrap = up p_col / sum(p_col); colidx B = Q'*A; = sample_from_weights(p_col,s); Xsub [U,S,V] = X(:,colidx); = svds(B,s*T); Xsub U ==Q*U; bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); 61 Experiments Eigenfaces from Labeled Faces in the Wild • 13,233 images, each 96x96 pixels, collected online in 2007 • A is a 13233x9216 matrix, 975.6Mb double precision 62 Eigenfaces pixels examples “loadings” k-dim. face basis We want the top k = 25 eigenfaces. 63 Time (seconds) in ~ 4.5 minutes via Matlab’s svds(A,25) 64 SampleRowsCols RandProj AdaSample Time (seconds) 65 in ~ 4 minutes SampleRowsCols RandProj AdaSample log(Time) 66 in ~ 4 minutes SampleRowsCols RandProj AdaSample RandProj + power iters q=1 q=2 4.6 secs! q=3 log(Time) 67 in ~ 4 minutes Summary Exact eigenfaces in 4+ minutes Approximate eigenfaces in 4 seconds (75 random projections + 1 power iteration) 68 Conclusion • Classical truncated SVD – ill-suited for large datasets • Randomized algorithms – allow an error-vs-computation tradeoff – require only a few passes through the data – simple and robust 69 Thanks! 70 LFW: svds takes ~8 GB of memory and uses up to all 8 cores, takes 263 secs, k=25 for X = 13233 x 9216 (96x96 images) function [U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega; for i=1:q function [U,S,V] = sampleColsRows(A,s) Y = A'*Y; function [U,S,V] = adaSample(A,s,T) Y = A*Y; %subsample S = []; end Y = subsample(A,s); E = A; [Q,R] = qr(Y,0); W = subsample(Y’,s)’; for t=1:T B = Q'*A; %sample columns [U0,S,V] = svds(B,s); %SVD of W'*W p = sqrt(sum(E.^2,1)); U = Q*U0; [Uw,S,Vw] = svd(W'*W); p = p / sum(p); S = sqrt(S); idx = sample_from_weights(p,s); U = Y*Vw; St = A(:,idx); [U,R] = qr(U); S = [S St]; V = (U'*A)'; %orthogonalize & project V = bsxfun(@times,V,1./sqrt(sum(V.^2))); [Q,R] = qr(S,0); proj_A_on_Qt = Q*(Q'*A); subfun [Xsub] = subsample(X,s) E = A - proj_A_on_Qt; p_col = sqrt(sum(A.^2)); end p_col = p_col / sum(p_col); %wrap up colidx = sample_from_weights(p_col,s); B = Q'*A; Xsub = X(:,colidx); [U,S,V] = svds(B,s*T); Xsub = bsxfun(@times,Xsub,... 71 U = Q*U; 1./sqrt(p_col(colidx)))/sqrt(s);