Matrix Completion IT530 Lecture Notes Matrix Completion in Practice: Scenario 1 • Consider a survey of M people where each is asked Q questions. • It may not be possible to ask each person all Q questions. • Consider a matrix of size M by Q (each row is the set of questions asked to any given person). • This matrix is only partially filled (many missing entries). • Is it possible to infer the full matrix given just the recorded entries? Matrix Completion in Practice: Scenario 2 • Some online shopping sites such as Amazon, Flipkart, Ebay, Netflix etc. have recommender systems. • These websites collect product ratings from users (especially Netflix). • Based on user ratings, these websites try to recommend other products/movies to the user that he/she will like with a high probability. • Consider a matrix with the number of rows equal to the number of users, and number of columns equal to the number of movies/products. • This matrix will be HIGHLY incomplete (no user has the patience to rate too many movies!!) – maybe only 5% of the entries will be filled up. • Can the recommender system infer user preferences from just the defined entries? Matrix Completion in Practice: Scenario 2 • Read about the Netflix Prize to design a better recommender system: http://en.wikipedia.org/wiki/Netflix_Prize Matrix Completion in Practice: Scenario 3 • Consider an image or a video with several pixel values missing. • This is not uncommon in range imagery or remote sensing applications! • Consider a matrix whose each column is a (vectorized) patch of M pixels. Let the number of columns be K. • This M by K matrix will have many missing entries. • Is it possible to infer the complete matrix given just the defined pixel values? • If the answer were yes, note the implications for image compression! Matrix Completion in Practice: Scenario 4 • Consider a long video sequence of F frames. • Suppose I mark out M salient (interesting points) {Pi}, 1<=i<=M, in the first frame. • And try to track those points in all subsequent frames. • Consider a matrix F of size M x 2F where row j contains the X and Y coordinates of points on the motion trajectory of initial point Pj. • Unfortunately, many salient points may not be trackable due to occlusion or errors from the tracking algorithms. • So F is highly incomplete. • Is it possible to infer the true matrix from only the available measurements? A property of these matrices • Scenario 1: Many people will tend to give very similar or identical answers to many survey questions. • Scenario 2: Many people will have similar preferences for movies (only a few factors affect user choices). • Scenario 3: Non-local self-similarity! • This makes the matrices in all these scenarios low in rank! A property of these matrices • Scenario 4: The true matrix underlying F in question has been PROVED to be of low rank (in fact, rank 3) under orthographic projection (ref: Tomasi and Kanade, “Shape and Motion from Image Streams Under Orthography: a Factorization Method”, IJCV 1992) and a few other more complex camera models up to rank 9 (ref: Irani, “Multiframe correspondence estimation using subspace constraints”, IJCV 2002). • F (in the rank 3 case) can be expressed as a product of two matrices – a rotation matrix of size 2F x 3, and a shape matrix of size 3 x P. • F is useful for many computer vision problems such as structure from motion, motion segmentation and multiframe point correspondences. (Many) low-rank matrices are cool! • The answer to the four questions/scenarios is a big NO in the general case. • But it’s a big YES if we assume that the underlying matrix has low rank (and which, as we have seen, is indeed the case for all four scenarios) and obeys a few more constraints. Ref: Candes and Recht, “Exact Matrix Completion via Convex Optimization”, 2008. Theorem 1 (Informal Statement) • Consider an unknown matrix F of size n1 by n2 having rank r < min(n1, n2). • Suppose we observe only a fraction of entries of F in the form of matrix G, where G(i,j) = F(i,j) for all (i,j) belonging to some uniformly randomly sampled set W and G(i,j) undefined elsewhere. • If (1) F has row and column spaces that are “sufficiently incoherent” with the canonical basis (i.e. identity matrix), (2) r is “sufficiently small”, and (3) W is “sufficiently large”, then we can accurately recover F from G by solving the following rank minimization problem: ˆ) F* min rank(F ˆ F subject to ˆ (i, j) Γ(i, j)(i, j ) W Φ Cool theorem, but … • The afore-mentioned optimization problem is NP-hard (in fact, it is known to have double exponential complexity!) Theorem 2 (Informal Statement) • Consider an unknown matrix F of size n1 by n2 having rank r < min(n1, n2). • Suppose we observe only a fraction of entries of F in the form of matrix G, where G(i,j) = F(i,j) for all (i,j) belonging to some uniformly randomly sampled set W and G(i,j) undefined elsewhere. • If (1) F has row and column spaces that are “sufficiently incoherent” with the canonical basis (i.e. identity matrix), (2) r is “sufficiently small”, and (3) W is “sufficiently large”, then we can accurately recover F from G by solving the following “traceˆ norm” minimization problem: F* min ˆ F F * subject to ˆ (i, j) Γ(i, j)(i, j) Ω Φ What is the trace-norm of a matrix? • The trace-norm of a matrix is the sum of its singular values. • It is also called nuclear norm. • It is a softened version of the rank of a matrix, just like the L1-norm of a vector is a softened version of the L0norm of the vector. • Minimization of the trace-norm (even under the given constraints) is a convex optimization problem and can be solved efficiently (no local minima issues). • This is similar to the L1-norm optimization (in compressive sensing) being efficiently solvable. More about trace-norm minimization • The efficient trace-norm minimization procedure is provably known to give the EXACT SAME result as the NP-hard rank minimization problem (under the same constraints and same conditions on the unknown matrix F and the sampling set W). • This is analogous to the case where L1-norm optimization yielded the same result as L0-norm optimization (under the same set of constraints and conditions). • Henceforth we will concentrate only on Theorem 2 (and beyond). The devil is in the details • Beware: Not all low-rank matrices can be recovered from partial measurements! • Example consider a matrix containing zeroes everywhere except the top-right corner. • This matrix is low rank, but it cannot be recovered from knowledge of only a fraction of its entries! • Many other such examples exist. • In reality, Theorems 1 and 2 work for low-rank matrices whose singular vectors are sufficiently spread out, i.e. sufficiently incoherent with the canonical basis (i.e. with the identity matrix). Coherence of a basis • The coherence of subspace U of Rn and having dimension r with respect to the canonical basis {ei} is defined as: (U ) n max 1i n Uei r 2 Formal definition of key assumptions • Consider an underlying matrix M of size n1 by n2. Let the SVD of M be given as follows: r M k uk vkT k 1 • We make the following assumptions about M: 1. (A0) 0 such thatmax((U ), (V )) 0 2. (A1) Ther maximum entry in the n1 by n2 matrix uk vkT is upper bounded by 1 r /(n1n2 ) , 1 0 k 1 What do these assumptions mean (in English)? • (A0) means that the singular vectors of the matrix are sufficiently incoherent with the canonical basis. • (A1) means that the singular vectors of the matrix are not spiky (e.g. canonical basis vectors are spiky signals – the spike has magnitude 1 and the rest of the signal is 0; a vector of n elements with all values equal to 1/square-root(n) is not spiky). Theorem 2 (Formal Statement) the trace-norm minimizer (in the informal statement of theorem 2) Comments on Theorem 2 • Theorem 2 states that more entries of M must be known (denoted by m) for accurate reconstruction if (1) M has larger rank r, (2) greater value of 0 in (A0), (3) greater value of 1 in (A1). • Example: If 0 = O(1) and the rank r is small, the reconstruction is accurate with high 1.2 m Cn r log(n) . probability provided Comments on Theorem 2 • It turns out that if the singular vectors of matrix M have bounded values, the condition (A1) almost always holds for the value 1 = O(log n). Matrix Completion under noise • Consider an unknown matrix F of size n1 by n2 having rank r < min(n1, n2). • Suppose we observe only a fraction of entries of F in the form of matrix G, where G(i,j) = F(i,j) + Z(i,j) for all (i,j) belonging to some set W and G(i,j) undefined elsewhere. • Here Z refers to a white noise process which obeys the constraint that: 2 Z ij , 0 ( i , j )W Matrix Completion under noise • In such cases, the unknown matrix F can be recovered by solving the following minimization procedure (called as a semidefinite program): ˆ F * minFˆ F * subject t o 2 ˆ ( Φ (i, j) Γ(i, j) ) (i, j)W Theorem 3 (informal statement) • The reconstruction result from the previous procedure is accurate with an error bound given by: FF * 2 F 4 ( 2 p ) min(n1 , n2 ) 2 , p W m where p fractionof knownentries n1n2 n1n2 A Minimization Algorithm • Consider the minimization problem: ˆ F * minFˆ F * subject t o 2 ˆ ( Φ (i, j) Γ(i, j) ) (i, j)W • There are many techniques to solve this problem (http://perception.csl.illinois.edu/matrixrank/sample_code.html) • Out of these, we will study one method called “singular value thresholding”. Ref: Cai et al, “A singular value thresholding algorithm for matrix completion”, SIAM Journal on Optimization, 2010. Singular Value Thresholding (SVT) soft threshold(Y R n1n2 ; ) F * SVT (G, 0) { { n1 n2 Y 0 R k 1 while(convergencecriterionnot met) { (0) F (k) soft threshold(Y ( k 1) ; ) Y ( k ) Y ( k 1) k PW (G F (k) ); k k 1; } Φ* Φ (k) ; } Y USV T ( using svd) for (k 1 : n2 ) S (k , k ) max(0, S (k , k ) ); } The soft-thresholding procedure obeys the following property (which we state w/o proof). soft threshold(Y ; ) 1 2 arg minX X Y F X 2 * (i, j ) W, PW (i, j ) 1, else PW (i, j ) 0 Properties of SVT (stated w/o proof) • The sequence {F(k)} converges to the true solution of the main problem provided the step-sizes {k} all lie between 0 and 2, and the value of is large. Results • The SVT algorithm works very efficiently and is easily implementable in MATLAB. • The authors report reconstruction of a 30,000 by 30,000 matrix in just 17 minutes on a 1.86 GHz dual-core desktop with 3 GB RAM and with MATLAB’s multithreading option enabled. Results (Data without noise) Results (Noisy Data) Results on real data • Dataset consists of a matrix M of geodesic distances between 312 cities in the USA/Canada. • This matrix is of approximately low-rank (in fact, the relative Frobenius error between M and its rank-3 approximation is 0.1159). • 70% of the entries of this matrix (chosen uniformly at random) were blanked out. Results on real data • The underlying matrix was estimated using SVT. • In just a few seconds and a few iterations, the SVT produces an estimate that is as accurate as the best rank-3 approximation of M. Results on real data Applications to Video Denoising • Consider a video I(x,y,t) corrupted by Gaussian and impulse noise. • Impulse noise usually has high magnitude and can be spatially sparse. • Impulse noise can be “removed” by local median filtering, but this can also attenuate edges and corners. • Instead, the median filtering is used as an intermediate step for collecting together K patches that are “similar” to a reference patch. Applications to Video Denoising • The similar patches are assembled in the form of a matrix Q of size n x K (where n = number of pixels in each patch). • If the noise were only Gaussian, we could perform an SVD on Q, attenuate appropriate singular values and then reconstruct the patches. In fact, this is what non-local PCA does (with minor differences). Applications to Video Denoising • But the entries in Q are also corrupted by impulse noise, and this adversely affects the SVD computation. • Hence we can regard those pixels that are affected by impulse noise as “incomplete entries” (how will you identify them?) and pose this as a matrix completion under noise problem. Applications to Video Denoising • The formal problem statement is: P minP P * Matrix of true/clean patches subject to Matrix of noisy patches * (P (i,j) Q(i, j)) 2 W 2 (i, j)W Set of indices of pixels in Q having well-defined values (i.e. values not corrupted by impulse noise). Standard deviation of Gaussian noise Applications to Video Denoising • We know this can be solved using singular value thresholding (though other algorithms also exist). • This procedure is repeated throughout the image or the video. Result: effect of median filter Denoising results Denoising results Matrix Completion in Practice: Scenario 5 • Consider N points of dimension d each, denoted as {xi},1 ≤ i ≤ N. • Consider a matrix D of size N x N, where Dij = |xi - xj|2 = -2xTi xj + xTi xi + xTi xj • D can be written in the form D = 1zT+z1T-2XXT where z is a vector of length N where zi = xTi xi, where X is a matrix of size N by d, and where 1 is a vector of length N containing all ones. Matrix Completion in Practice: Scenario 5 • Hence the rank of D = 2 + rank(XXT)=2+d.