Sparse Representation and Compressed Sensing: Theory and Algorithms Yi Ma1,2 Allen Yang3 Microsoft Research Asia 1 University of Illinois at Urbana-Champaign 2 John Wright1 University of California Berkeley 3 CVPR Tutorial, June 20, 2009 MOTIVATION – Applications to a variety of vision problems • Face Recognition: Wright et al PAMI ’09, Huang CVPR ’08, Wagner CVPR ’09 … • Image Enhancement and Superresolution: Elad TIP ’06, Huang CVPR ‘08, … • Image Classification: Mairal CVPR ‘08, Rodriguez ‘07, many others … • Multiple Motion Segmentation: Rao CVPR ‘08, Elfamhir CVPR ’09 … • … and many others, including this conference MOTIVATION – Applications to a variety of vision problems • Face Recognition: Wright et al PAMI ’09, Huang CVPR ’09, Wagner CVPR ’09 … • Image Enhancement and Superresolution: Elad TIP ’06, Huang CVPR ‘08, … • Image Classification: Mairal CVPR ‘08, Rodriguez ‘07, … When and why can we expect such good performance? A closer look • Multiple Motion Segmentation : at the theory … Rao CVPR ‘08, Elfamhir CVPR ’09 … • … and many others, including this conference SPARSE REPRESENTATION – Model problem Underdetermined system of linear equations, y = Ax ? ? = … Observation A 2 Rm £ n ; m ¿ n ? y 2 Rm ? Two interpretations: • Compressed sensing: A as sensing matrix • Sparse representation: A as overcomplete dictionary Unknown x 2 Rn SPARSE REPRESENTATION – Model problem Underdetermined system of linear equations, y = Ax ? ? = … Observation A 2 Rm £ n ; m ¿ n ? y 2 Rm ? Many more unknowns than observations → no unique solution. • Classical answer: minimum ` 2-norm solution • Emerging applications: instead desire sparse solutions Unknown x 2 Rn SPARSE SOLUTIONS – Uniqueness Look for the sparsest solution: min kxk0 subj k ¢k0 - number of nonzero elements y = Ax: SPARSE SOLUTIONS – Uniqueness Look for the sparsest solution: min kxk0 subj y = Ax: k ¢k0 - number of nonzero elements Is the sparsest solution unique? spark(A) - size of smallest set of linearly dependent columns of A. y x1 = A1 x2 = A2 ) A 1 x 1 ¡ A 2 x 2 = 0: SPARSE SOLUTIONS – Uniqueness Look for the sparsest solution: min kxk0 subj y = Ax: k ¢k0 - number of nonzero elements Is the sparsest solution unique? spark(A) - size of smallest set of linearly dependent columns of A. Proposition [Gorodnitsky & Rao ‘97]: If y = Ax 0 with kx 0 k0 < spark(A ) 2 , then x 0 is the unique solution to min kxk0 subj y = Ax: SPARSE SOLUTIONS – So How Do We Compute It? Looking for the sparsest solution: (P0) min kxk0 subj y = Ax: Bad News: (P0 ) NP-hard in the worst case, hard to approximate within certain constants [Amaldi & Kann ’95]. SPARSE SOLUTIONS – So How Do We Compute It? Looking for the sparsest solution: (P0) min kxk0 subj y = Ax: Bad News: (P0 ) NP-hard in the worst case, hard to approximate within certain constants [Amaldi & Kann ’95]. Maybe we can still solve important cases? • Greedy algorithms: Matching Pursuit, Orthogonal Matching Pursuit [Mallat & Zhang ‘93] CoSAMP [Needell & Tropp ‘08] • Convex programming [Chen, Donoho & Saunders ‘94] SPARSE SOLUTIONS – The Heuristic Looking for the sparsest solution: (P0) min kxk0 subj y = Ax: Intractable. y = Ax: Linear program, solvable in polynomial time. convex relaxation (P1) min kxk1 subj `0 Why ` 1 ? Convex envelope of ` 0 over the unit cube: `1 Rich applied history – geosciences, sparse coding in vision, statistics EQUIVALENCE – A stronger motivation In many cases, the solutions to (P0) and (P1) are exactly the same: Theorem [Candes & Tao ’04, Donoho ‘04]: For Gaussian A, with overwhelming probability, whenever kx 0 k0 < ½? m x 0 = argmin kxk1 subj “ ` 1-minimization recovers any sufficiently sparse solution” Ax = Ax 0 : GUARANTEES – “Well-Spread” A Mutual coherence: largest inner product between distinct columns of A Low mutual coherence: vectors are well-spread in the space GUARANTEES – “Well-Spread” A Mutual coherence: Theorem [Elad & Donoho ’03, Gribvonel & Nielsen ‘03]: ` 1 minimization uniquely recovers any x 0 with . Strong point: checkable condition. Weakness: low coherence can only guarantee recovery up to nonzeros. GUARANTEES – Beyond Coherence Low coherence: “any submatrix consisting of two columns of A is well-conditioned” Stronger bounds by looking at larger submatrices? Restricted Isometry Constants: s.t. for all -sparse , Low RIC: “Column submatrices of A are uniformly well-conditioned” GUARANTEES – Beyond Coherence Restricted Isometry Constants: s.t. for all -sparse , Theorem [Candes & Tao ’04, Candes ‘07]: If , then ` 1 -minimization recovers any k-sparse x 0 . For random A, this guarantees recovery up to linear sparsity: kx 0 k0 < ½? m GUARANTEES – Sharp Conditions? Necessary and sufficient condition: solves iff polytope spanned by columns of A and their negatives GUARANTEES – Geometric Interpretation Necessary and sufficient condition: [Donoho ’06] [Donoho + Tanner ’08] uniquely recovers Uniform guarantees for with support and signs is a simplicial face of -sparse P centrally iff . -neighborly. GUARANTEES – Geometric Interpretation Geometric understanding gives sharp thresholds for sparse recovery with Gaussian A [Donoho & Tanner ‘08]: Failure almost always Weak threshold Sparsity Success almost always Success always Aspect ratio of A Strong threshold GUARANTEES – Geometric Interpretation Explicit formulas in the wide-matrix limit [Donoho & Tanner ‘08]: Weak threshold: Strong threshold: GUARANTEES – Noisy Measurements What if there is noise in the observation? y = Ax + z: Gaussian or bounded 2-norm Natural approach: relax the constraint: min kxk1 subj Studied in several literatures Statistics – LASSO Signal processing – BPDN. ky ¡ Axk22 · " 2 GUARANTEES – Noisy Measurements What if there is noise in the observation? y = Ax + z: Natural approach: min kxk1 subj ky ¡ Axk22 · " 2 Theorem [Donoho, Elad & Temlyakov ‘06]: Recovery is stable: See also k^ x ¡ x 0 k2 · 4kzk 22 1¡ ¹ ( A )(4kx 0 k 0 ¡ 1) [Candes-Romberg-Tao ‘06], [Wainwright ‘06], [Meinshausen & Yu ’06], [Zhao & Yu ‘06], … GUARANTEES – Noisy Measurements What if there is noise in the observation? y = Ax + z: Natural approach: min kxk1 subj ky ¡ Axk22 · " 2 Theorem [Candes-Romberg-Tao ‘06]: Recovery is stable – for A satisfying an appropriate RIP4S condition, k^ x ¡ x 0 k2 · C1 kzk2 + C2 x 0;S kx 0 ¡ x 0; S k 1 p S – best S-term approximation See also [Donoho ‘06], [Wainwright ‘06], [Meinshausen & Yu ’06], [Zhao & Yu ‘06], … CONNECTIONS – Sketching and Expanders Similar sparse recovery problems explored in data streaming community: 0 2 0 = … Sketch y 2 Rm A 2 Rm £ n ; m ¿ n 5 0 0 0 1 Data stream x 2 Rn Combinatorial algorithms → fast encoding/decoding at expense of suboptimal # of measurements Based on ideas from group testing, expander graphs [Gilbert et al ‘06], [Indyk ‘08], [Xu & Hassibi ‘08] CONNECTIONS – High dimensional geometry Sparse recovery guarantees can also be derived via probabilistic constructions from high-dimensional geometry: • The Johnson-Lindenstrauss lemma Given n points x 1 : : : x n ½ Rm a random projection into dimensions preserves pairwise distances: C log(m ) "2 (1 ¡ ")kx i ¡ x j k2 · kPx i ¡ Px j k · kx i ¡ x j k2 : • Dvoretsky’s almost-spherical section theorem: There exist subspaces ¡ ½ Rm of dimension as high as on which the ` 1 and ` 2 norms are comparable: 8x 2 ¡ ; p C mkxk2 · kxk1 · p mkxk2 c¢m THE STORY SO FAR – Sparse recovery guarantees Sparse solutions can often be recovered by linear programming. Performance guarantees for arbitrary matrices with “uniformly well-spread columns”: • • (in)-coherence Restricted Isometry Sharp conditions via polytope geometry Very well-understood performance for random matrices What about matrices arising in vision… ? PRIOR WORK - Face Recognition as Sparse Representation Linear subspace model for images of same face under varying illumination: Subject i Training If test image is also of subject , then for some . . Can represent any test image wrt the entire training set as Test image Combined training dictionary coefficients corruption, occlusion PRIOR WORK - Face Recognition as Sparse Representation Underdetermined system of linear equations in unknowns : Solution is not unique … but should be sparse: ideally, only supported on images of the same subject expected to be sparse: occlusion only affects a subset of the pixels Seek the sparsest solution: convex relaxation Wright, Yang, Ganesh, Sastry, and Ma. Robust Face Recognition via Sparse Representation, PAMI 2008 GUARANTEES – What About Vision Problems? Behavior under varying levels of random pixel corruption: Recognition rate 99.3% 90.7% 37.5% Can existing theory explain this phenomenon? PRIOR WORK - Error Correction by minimization Candes and Tao [IT ‘05]: • Apply parity check matrix s.t. , yielding Underdetermined system in sparse e only • Set • Recover from clean system Succeeds whenever in the reduced system . PRIOR WORK - Error Correction by minimization Candes and Tao [IT ‘05]: • Apply parity check matrix s.t. , yielding Underdetermined system in sparse e only • Set • Recover from clean system Succeeds whenever This work: in the reduced system • Instead solve Can be applied when A is wide (no parity check). . PRIOR WORK - Error Correction by minimization Candes and Tao [IT ‘05]: • Apply parity check matrix s.t. , yielding Underdetermined system in sparse e only • Set • Recover from clean system Succeeds whenever This work: in the reduced system . • Instead solve Succeeds whenever in the expanded system . GUARANTEES – What About Vision Problems? Highly coherent ( volume very sparse: ) # images per subject, often nonnegative (illumination cone models). as dense as possible: robust to highest possible corruption. Results so far: should not succeed. SIMULATION - Dense Error Correction? As dimension , an even more striking phenomenon emerges: SIMULATION - Dense Error Correction? As dimension , an even more striking phenomenon emerges: SIMULATION - Dense Error Correction? As dimension , an even more striking phenomenon emerges: SIMULATION - Dense Error Correction? As dimension , an even more striking phenomenon emerges: SIMULATION - Dense Error Correction? As dimension , an even more striking phenomenon emerges: SIMULATION - Dense Error Correction? As dimension , an even more striking phenomenon emerges: Conjecture: If the matrices are sufficiently coherent, then for any error fraction , as , solving corrects almost any error with . DATA MODEL - Cross-and-Bouquet Our model for should capture the fact that the columns are tightly clustered around a common mean : L^-norm of deviations wellcontrolled ( -> v ) Mean is mostly incoherent with standard (error) basis We call this the “Cross-and-Bouquet’’ (CAB) model. ASYMPTOTIC SETTING - Weak Proportional Growth • Observation dimension • Problem size grows proportionally: • Error support grows proportionally: • Support size sublinear in : ASYMPTOTIC SETTING - Weak Proportional Growth • Observation dimension • Problem size grows proportionally: • Error support grows proportionally: • Support size sublinear in Sublinear growth of Need at least : is necessary to correct arbitrary fractions of errors: “clean” equations. Empirical Observation: If grows linearly in , sharp phase transition at . NOTATION - Correct Recovery of Solutions Whether Call is recovered depends only on -recoverable if and the minimizer is unique. with these signs and support MAIN RESULT - Correction of Arbitrary Error Fractions Recall notation: “ recovers any sparse signal from almost any error with density less than 1” SIMULATION - Arbitrary Errors in WPG Fraction of correct successes for increasing m ( , ) SIMULATION - Phase Transition in Proportional Growth What if grows linearly with m? Asymptotically sharp phase transition, similar to that observed by Donoho and Tanner for homogeneous Gaussian matrices SIMULATION - Comparison to Alternative Approaches “L1 - [A I]”: “L1 - comp”: “ROMP”: Candes + Tao ‘05 Regularized orthogonal matching pursuit Needell + Vershynin ‘08 SIMULATION - Error Correction with Real Faces For real face images, weak proportional growth corresponds to the setting where the total image resolution grows proportionally to the size of the database. Fraction of correct recoveries Above: corrupted images. ( 50% probability of correct recovery ) Below: reconstruction. SUMMARY – Sparse Representation in Theory and Practice So far: Face recognition as a motivating example Sparse recovery guarantees for generic systems New theory and new phenomena from face data After the break: Algorithms for sparse recovery Many more applications in vision and sensor networks Matrix extensions: missing data imputation and robust PCA