Over-complete Representations for Signals/Images IT 530, Lecture Notes Introduction: Complete and overcomplete bases • Signals are often represented as a linear combination of basis functions (e.g. Fourier or wavelet representation). • The basis functions always have the same dimensionality as the (discrete) signals they represent. • The number of basis vectors is traditionally the same as the dimensionality of the signals they represent. • These bases may be orthonormal (Fourier, wavelet, PCA) or may not be orthonormal (non-orthonormalized ICA). Introduction: Complete and overcomplete bases • A more general representation for signals uses so called “over-complete bases”, where the number of basis functions is MORE than the dimensionality of the signals. • Complete and over-complete bases: x As x As x R , A R n sR n nn x R n , A R nm s R ( m n) m Introduction: Construction of overcomplete bases • Over-complete bases can be created by union of multiple sets of complete bases. • Example 1: A signal with n values can be represented using a union of n x n Fourier and n x n Haar wavelet bases, yielding a n x 2n basis matrix. • Example 2: A signal with n values can be represented by adding sinusoids of more frequencies to an existing Fourier basis matrix with n vectors. Introduction: uniqueness? • With complete bases, the representation of the signal is always unique. • Example: Signals are uniquely defined by their wavelet or Fourier transforms, the eigencoefficients of any signal (given PCA basis) are uniquely defined. • This uniqueness is LOST with over-complete basis. Introduction: compactness! • Advantage: over-complete bases afford much greater compactness in signal representation. • Example: Consider two types of audio-signals – whistles and claps. Signals of either type can be represented in a complete Fourier or wavelet basis (power-law of compressibility will apply). • BUT: imagine two complete bases respectively learned for whistles and claps – B1 and B2. Introduction: compactness! • Suppose B1 and B2 are such that a whistle (resp. clap) signal will likely have a compact representation in the whistle (resp. clap) basis and not in the other one. • A whistle+clap signal will NOT have a compact representation in either basis - B1 or B2 ! • But the whistle+clap signal WILL have a compact representation in the overcomplete basis B = [B1 B2]. More problems • Since a signal can have many representations in an over-complete basis, which one do we pick? • Pick the sparsest one, i.e. the one with least number of non-zero elements which either perfectly reconstructs the signal, or reconstructs the signal up to some error. • Finding the sparsest representation for a signal in an over-complete basis is an NP-hard problem (i.e. the best known algorithm for this task has exponential time complexity ) More problems • In other words, the following problem is NPhard: x As x R , A R n nm s R ( m n) m s min s 0 subject tox As * Solving (?) those problems • The NP-hard problem has several methods for approximation – basis pursuit (BP), matching pursuit (MP), orthogonal matching pursuit (OMP) and many others. • None of them will give you the sparsest solution – but they will (under different conditions) yield a solution that is sparse enough. Bayesian approach • Consider x As x R n , A R nm s R m ( m n) • Assume a suitable prior P(s) on s. • Assume: 2 x As x As log( p ( x | A, s )) 2 2 * s arg maxs p ( x | A, s ) p ( s ) Bayesian approach • For zero noise and a Gaussian prior on s, the solution for s is obtained by solving: min s 2 subject tox As Pseudo-inverse s ( A A I ) A x A s • For zero noise and a Laplacian prior on s, the solution for s is obtained by solving: * T 1 T min s 1 subject tox As Basis Pursuit Approximation no closed formsoln.,use linear programming Linear programming: canonical form Linear objective (energy) function Linear equality and inequality constraints Linear programming for Basis Pursuit min s 1 subject tox As no closed formsolution,use linear programming Vector of the same size as s, with the negative elements set to 0, and positive elements same as in s. s [u;v] Vector of the same size as s, with the non-negative elements set to 0, and negative elements multiplied by -1 min u; v 1 subject to[ A; A][u; v] x and u 0, v 0 Linear programming problems can be solved in polynomial time! There are various algorithms like simplex (worst-case exponential), interior-point method and ellipsoidal algorithm (both polynomial) L1 norm and L0 norm • There is a special relationship between the following two problems (which we will study in compressive sensing later on): x As x As x R n , A R nm x R n , A R nm s R m ( m n) s R m ( m n) s * min s 0 subject tox As s * min s subject tox As 1 The L1 norm is a “softer” version of the L0 norm. Other Lp-norms where 0 < p < 1 are possible and impose a stronger form of sparsity, but they lead to non-convex problems. Hence L1 is preferred. p 1/ p x p n | xi | i 1 Matching Pursuit • One of the simplest approximation algorithms to obtain the coefficients s of a signal y in an over-complete basis A. • Developed by Mallat and Zhang in 1993 (ref: S. G. Mallat and Z. Zhang, Matching Pursuits with Time-Frequency Dictionaries, IEEE Transactions on Signal Processing, December 1993) • Based on successively choosing that vector in A which has maximal inner product with a socalled residual vector (initialized to y in the beginning). Pseudo-code (0) (0) r y, s 0, i 0 (i ) 2 while ( r ) { “j” or “l” is an index for dictionary columns ( i )T 2 j arg maxl | r al / al | (i )T (i 1) ( i ) sj r aj;r r s ja j ;i i 1 } OUT P UT: {s j } Properties of matching pursuit (i ) 2 • The reconstruction error, i.e. || r || is always guaranteed to decrease. The decrease is at an exponential rate. • At any iteration, the following relationship holds true: i 1 || y ||2 s 2j || r (i ) ||2 j 0 Orthogonal Matching Pursuit (OMP) • More sophisticated algorithm as compared to matching pursuit (MP). • The signal is approximated by successive projection onto those dictionary columns (i.e. columns of A) that are associated with a current “support set”. • The support set is also successively updated. Pseudo-code (0) r y, s 0, T ( 0 ) , i 0 (i ) 2 while ( r ) Support set { ( i )T (1) a j arg max j | r a j / a j (2) T (i ) T (i ) 2 | Several coefficients are re-computed in each iteration j y AT ( i ) sT ( i ) (3) sT ( i ) arg mins ( i ) T (i 1) (4) r y As ; i i 1 } OUT P UT: {s j , a j } 2 A T ( i ) y Sub-matrix containing only those columns which lie in the support set OMP versus MP • Unlike MP, OMP never re-selects any element. • Unlike MP, in OMP, the residual at an iteration is always orthogonal to all currently selected elements. • OMP is costlier per iteration (due to pseudo-inverse ) but generally more accurate than MP. • Unlike MP, OMP converges in K iterations for a dictionary with K elements. • OMP always gives the optimal approximation w.r.t. the selected subset of the dictionary (note: this does not mean that the selected subset itself was optimal). OMP and MP for noisy signals • It is trivial to extend OMP and MP for noisy signals. • The stopping criterion is a small residual magnitude (not zero). BP under noise y x z , z ~ N (0,1) x As s arg mins A quadratic programming problem that is structurally similar to a linear program 1 2 y As s 1 2 minw, p cT w p 2 2 subject to [ A; A]w p b; w 0; 1 where c [1;1],s u - v, w [u;-v] Learning the bases • So far we assumed that the basis (i.e. A) was fixed, and optimized for the sparse representation. • Now, we need to learn A as well! • We’ve learned about PCA, ICA. But they don’t always give the best representation! PCA ICA ICA Over-complete Learning the bases: analogy with Kmeans • In K-means, we start with a bunch of K cluster centers and assign each point in the dataset to the nearest cluster center. • The cluster centers are re-computed by taking the mean of all points assigned to a cluster. • The assignment and cluster-center computation problems are iterated until a convergence criterion is met. Learning the bases: analogy with Kmeans • K-means is a special sparse coding problem where each point is represented by strictly one of K dictionary elements. • Our dictionary (or bases) learning problem is more complex: we are trying to express each point as a linear combination of a subset of dictionary elements (or a sparse linear combination of dictionary elements). Learning the Bases! • Find model (i.e. over-complete basis) A for which the likelihood of the data is maximized. • Above integral is not available in closed form for most priors on s (e.g. Laplacian - intractable). • Approximation (Method 1): Assume that the volume of the pdf is concentrated around the mode (w.r.t. s). Ref: Olshausen and Field, “Natural image statistics and efficient coding” Learning the bases: Method 1 P( s ) P( x k k | A, sk )ds P( s ) P( xk | A, s ) * k * k N A arg maxA P( s ) P( xk | A, s ) k 1 * k N arg min A ( s k 1 * k 1 * k * 2 k xk As Gradient descent A ( t 1) N A ( A s xk )s (t ) k 1 (t ) * k *T k ) Two-step iterative procedure • Fix the basis A and obtain the sparse coefficients for each signal using MP, OMP or BP (some papers – like the one by Olshausen and Field - use gradient descent for this step!). • Now fix the coefficients, and update the basis vectors (using various techniques, one of which was described on the previous slide). • Normalize each basis vector to unit norm. • Repeat the previous two steps until some error criterion is met. Toy Experiment 1 Result of basis learning (dictionary with 144 elements) with sparsity constraints on the codes. Training performed on 12 x 12 patches extracted from natural images. Ref: Olshausen and Field, “Natural image statistics and efficient coding” Toy Experiment 2 • Data-points generated as a (Laplacian/superLaplacian) random linear combination of some arbitrarily chosen basis vectors. • In the no-noise situation, the aim is to extract the basis vectors and the coefficients of the linear combination. • The fitting results are shown on the next slide. The true and estimated directions agree quite well. Ref: Lewicki and Sejnowski, “Learning overcomplete representations” Learning the Bases: Method of Optimal Directions (MOD) - Method 2 • Given a fixed dictionary A, assume sparse codes for every signal are computed using OMP, MP etc. • The overall error is now given as n E ( A) yk Ask 2 Y AS 2 k 1 • We want to find dictionary A that minimizes this error. Ref: Engan et al, “Method of optimal directions for frame design” Learning the Bases: Method of Optimal Directions (MOD) - Method 2 • Take the derivative of E(A) w.r.t. A and set it to 0. This gives us the following update: ( t 1) (Y AS)S 0 A T YS ( t )T (t ) (S S ( t )T 1 ) • Following the update of A, each column in A is independently rescaled to unit norm. • The updates of A and S alternate with each other till some convergence criterion is reached. • This method is more efficient than the one by Olshausen and Field. Learning the Bases: Method 3- Union of Orthonormal Bases • Like before, we represent a signal in the following way: X AS ( A, S ) min A, S X AS S 2 1 • A is an over-complete dictionary, but let us assume that it is a union of ortho-normal bases, in the form A [ A1 | A2 | ...| AM ] i,1 i M , Ai A I T i Learning the Bases: Method 3- Union of Ortho-normal Bases • The coefficient matrix S can now be written as follows (M subsets, each corresponding to a single orthonormal basis): S [S1 | S2 | ... | SM ] • Assuming we have fixed bases stored in A, the coefficients in S can be estimated using block coordinate descent, described on the following slide. Learning the Bases: Method 3- Union of Ortho-normal Bases for t 1 : T { t 0 (1 t / T ); UpdateS using BCR with parametert } BCR( t , S ){ for m 1 : M { X m X Aj S j There is a quick way of performing this optimization given an ortho-normal basis – SOFT THRESHOLDING (could be replaced by hard thresholding if you had a stronger sparseness prior than an L1 norm jm UpdateS m as follows: S m arg minS * X m Am S * t S * 1 2 }} Learning the Bases: Method 3- Union of Ortho-normal Bases • Given the coefficients, we now want to update the dictionary which is done as follows: for m 1 : M { X m X Aj S j ; jm S m X UV ; T m Am VU T ; } T Why are we doing this? It is related to the so-called orthogonal Procrustes problem - a well-known application of SVD. We will see this on the next slide. The specific problem we are solving is given below. Note that it cannot be solved using a pseudo-inverse as that will not impose orthonormality constraint, if there is noise in the data or if the coefficients are perturbed or thresholded. min A X m Am S m s.t. Am AmT I 2 A* min A X AS s.t. AT A I 2 min A X AS min A trace(( X AS )T ( X AS )) 2 min A trace( X T X 2 X T AS S T S ) maxA trace( X T AS ) maxA trace( ASX T ) trace( FG) trace(GF ) - -Let SX T Q, SVD of Q gives Q UDV T maxA trace( AUDV T ) maxA trace(V T AUD) maxA trace( Z ( A) D) where Z ( A) V T AU maxA zii d ii d ii ( Z ( A)T Z ( A) I ) i i T hemaximumis achievedfor Z ( A) I , i.e. V T AU I A VU T Learning the Bases: Method 3- Union of Ortho-normal Bases • Keeping all bases in A fixed, update the coefficients in S using a known sparse coding technique. • Keeping the coefficients in S fixed, update the bases in A using the aforementioned SVDbased method. • Repeat the above two steps until a convergence criterion is reached. Learning the Bases: Method 4 – K-SVD • Recall: we want to learn a dictionary and sparse codes on that dictionary given some data-points: min A, S Y AS subject to i, si 2 0 T0 • Starting with a fixed dictionary, sparse coding follows as usual – OMP, BP, MP etc. The criterion could be based on reconstruction error or L0norm of the sparse codes. • The dictionary is updated one column at a time. 2 K Y AS Y a s 2 j 1 j j T 2 K Y a s a s j k j j T k 2 k T Ek a s Does NOT depend on the k-th dictionary column k k T Row ‘k’ (NOT column) of matrix S How to find ak, given the above expression? We have decomposed the original error matrix, i.e. Y-AS, into a sum of rank-1 matrices, out of which only the last term depends on ak. So we are trying to find a rank-1 approximation for Ek, and this can be done by computing the SVD of Ek, and using the singular vectors corresponding to the largest singular value. Ek UV T , ak U1T , skT (1,1)V1T Problem! The dictionary codes may no more be sparse! SVD does not have any in-built sparsity constraint in it! So, we proceed as follows: k {i | s (i) 0} k T k {0,1} p |k | , k (k (i), i ) 1 Consider the error matrix defined as follows: (Y AS ) k E UV R k Ek k ( a s ) k 2 k 2 k R E a s R k Considers only those columns of Y (i.e. only those data-points) that actually USE the k-th dictionary atom, effectively yielding a smaller matrix, of size p by |k| 2 k k T . T ak U1T , sRk (1,1)V1T This update affects the sparse codes of only those data-points that used the k-th dictionary element KSVD and K-means • Limit the sparsity factor T0 = 1. • Enforce all the sparse codes to be either 0 or 1. • Then you get the K-means algorithm! Implementation issues • KSVD is a popular and effective method. But some implementation issues haunt K-SVD (life is never easy ). • KSVD is susceptible to local minima and over-fitting if K is too large (just like you can get meaningless clusters if your number of cluster is too small, or get meaningless densities if the number of histogram bins is too many). • KSVD convergence is not fully guaranteed. The dictionary updates given fixed sparse codes ensure the error decreases. However the sparse codes given fixed dictionary may not decrease the error – it is affected by the behaviour of the sparse coding approximation algorithms. • You can speed up the algorithm by removing extremely “unpopular” dictionary elements, or removing duplicate (or nearduplicate) columns of the dictionary. (Many) Applications of KSVD • • • • • • • Image denoising Image inpainting Image deblurring Blind compressive sensing Classification Compression And you can work on many more Application: Image Compression (Training Phase) • Training set for dictionary learning: a set of 11000 patches of size 8 x 8 – taken from a face image database. Dictionary size K = 441 atoms (elements). • OMP used in the sparse coding step during training – stopping criterion is a fixed number of coefficients T0 = 10. • Over-complete Haar and DCT dictionaries – of size 64 x 441 – and ortho-normal DCT basis of size 64 x 64 (JPEG), also used for comparison. Application: Image Compression (Testing Phase) • A lossy image compression algorithm is evaluated using an ROC curve – the X axis contains the average number of bits to store the signal. The Yaxis is the associated error or PSNR. Normally, the acceptable error is fixed and the number of bits is calculated. • The test image is divided into non-overlapping patches of size 8 x 8. • Each patch is projected onto the trained dictionary and its sparse code is obtained using OMP given a fixed error e. Application: Image Compression (Testing Phase) • The encoded image contains the following: 1. Sparse codes for each patch and the indices of each coefficient (in the dictionary). 2. The number of coefficients used to represent each patch (different patches will need different number of coefficients). Application: Image Compression (Testing Phase) • The average number of bits per pixel (RPP) is Sum total of the calculated as: number of a# patches # coeffs (b Q) RPP # pixels Number of bits required to store the number of coefficients for each patch Number of bits required to store the dictionary index for each coefficient Huffman encoding coefficients representing each patch Number of bits required to code each coefficient (quantization level) Even if the error ‘e’ for the OMP was fixed, we need to compute the total MSE between the true and the compressed image. This is due to effects of quantization while storing the sparse coefficient values for each patch. Application: Image Denoising • KSVD for denoising seeks to minimize the following objective function: ( X , D,{ ij }) arg min X , D ,{ ij } X Y ij ij D ij Rij X 2 i, j 0 i, j Y = noisy image X = underlying clean image (to be estimated) ij = sparse dictionary coefficients for patch at location (i,j) D = dictionary Rij = matrix that extracts the patch xij from image X, i.e. xij = Rij X 2 Application: Image Denoising • Note: The dictionary may be learned a priori from a corpus of image patches. The patches from the noisy image can then be denoised by mere sparse coding. • The more preferable method is to train the dictionary directly on the noisy image in tandem with the sparse coding step (as in the previous slide). • This avoids having to depend on the training set and allows for tuning of the dictionary to the underlying image structure (as opposed to the structure of some other images). KSVD Algorithm for Denoising (Dictionary learned on the noisy image) • Set X = Y, D = overcomplete DCT • Until some “convergence” criterion is satisfied, repeat the following: 1. Obtain the sparse codes for every patch (typically using OMP) as follows: i, j, min ij s.t. Rij X D ij 2 0 C 2 2. Perform the dictionary learning update typical for KSVD. • Estimate the final image X by averaging the reconstructed overlapping patches, OR estimate X given D and ij: X arg minX X Y D ij Rij X 2 ij X (I RijT Rij ) 1 (Y RijT D ij ) ij ij 2 KSVD Algorithm for Denoising (Dictionary learned on the noisy image) 2 2 X arg min X X Y D ij Rij X ij X (I RijT Rij ) 1 (Y RijT D ij ) ij set to ij 30 This equation is a mathematically rigorous way to show how X was reconstructed by averaging overlapping denoised patches together with the noisy image as well. Baseline for comparison: method by Portilla et al, “Image denoising using scale mixture of Gaussians in the wavelet domain”, IEEE TIP 2003. http://www.cs.technion.ac.il/~elad/Variou s/KSVD_Matlab_ToolBox.zip Application of KSVD: Filling in Missing Pixels Application of KSVD: Filling in Missing Pixels • An over-complete dictionary is trained a priori on a set of face images. • A test face image (not part of the training set) is synthetically degraded by masking out 50 to 70 percent of the pixels. • Patches from the degraded image are sparse coded by projection onto the trained dictionary using OMP. • OMP is modified so that only the non-degraded pixels are considered during any error computation (the dictionary elements are therefore appropriately rescaled to unit norm). Application of KSVD: Filling in Missing Pixels • OMP is modified so that only the nondegraded pixels are considered during any error computation (the dictionary elements are therefore appropriately re-scaled to unit norm). 2 yij Dsij such that sij 0 T 2 ( yij Dsij ) m askij such that sij 0 T Application of KSVD: Filling in Missing Pixels • The patches are reconstructed in the following manner: yij Dsij NOTE: Dictionary elements without masking!! Dictionary learning for classification • Dictionary learning is useful for signal reconstruction (we have seen many techniques). • We now explore applications for classification: example, texture classification or object recognition. • Application in texture classification: Ref: Skretting and Husoy, “Texture classification using sparse frame-based representations”, EURASIP Journal on Signal Processing, 2006. Texture classification/segmentation: dictionary learning • Training phase: Patches from representative texture images of a given class Ci are arranged in a matrix Xi of size N by L. • We will present Xi in the following form: X i FiWi , X i R NL , Fi R NK ,Wi R KL • Here Fi is a dictionary and Wi is a set of sparse coefficients. • Starting with an initial Fi , we estimate Wi with a typical sparse coding step (with constraints). Texture classification: dictionary learning • Then given the new Wi , the dictionary Fi is updated using the following: T 1 Fi X iWi (WiWi ) T • The afore-mentioned two steps are repeated until some convergence criterion is met. Texture classification: Test phase • Blocks from a test image are projected onto each of the dictionaries and the residual errors are computed after finding the coefficients with a sparse coding method (assuming a fixed number of nonzero elements in the sparse code). • Each block is assigned to the class Ci if the dictionary Fi gave the least reconstruction error. • Unfortunately, this method produces very high errors! Texture classification: Test phase • Significant improvement is obtained if the error images are smoothed prior to classification. The smoothing can be done with a Gaussian filter. • If the Gaussian is low, errors will be high in many places. • If the Gaussian is large, the classification errors will be low within a segment, though high along the borders between two or more segments. • The large Gaussian imposes the belief that nearby pixels are from the same class (a belief that gets violated at the borders). Texture classification: observations • It is observed that class-specific dictionary learning outperforms classification with universal bases like wavelets or DCT. • Classification performance is affected by parameters such as patch size, number of dictionary atoms, Gaussian . Discriminative Dictionary Learning • The earlier scheme enforces good reconstruction of patches from a given texture class if a good dictionary is available. • But it is not discriminative enough – it does not enforce poorer reconstructions given the “wrong” class! • This issue is dealt with in the paper – Mairal et al, “Discriminative Learned Dictionaries for Local Image Analysis”, CVPR 2008. Discriminative Dictionary Learning • The main discriminative function is given as N follows: ( y y ) Ci ( y1 , y2 ,..., y N ) log( e j i ) j 1 • This is called as a “softmax” discriminative function. • It is close to zero when yi is distinctly the smallest amongst {y1, y2,…, yN}. In that case the summation will be dominated by the term e ( y y ) 1 whereas the other terms will be close to zero (this is especially true for large value of ). i i Discriminative Dictionary Learning • The objective function for our task is as follows: Dictionary Sparse code Sparsity level Datapoint ( x, D) min x D subject to 0 L 2 R ( x, D) x D ( x, D) C N min{ D }N j j 1 i 1, lS i i 2 ({R ( xl , D j )} ) R ( xl , Di ) N j 1 High = purely reconstructive Low = purely discriminative Discriminative Dictionary Learning: Overall Algorithm • Sparse coding step proceeds exactly as before. • Dictionary vectors are updated by optimizing the aforementioned objective function: C N min{D }N j j 1 i 1, lSi i ({R ( xl , D j )} ) R ( xl , Di ) N j 1 Discriminative Dictionary Learning: MOD-like update • Setting the derivative of the energy with respect to Dj to zero, we get the following: C j ({R( xl , D j , lj )}Nj1 ) D j lS j ( xl D j lj ) lj 0 • This update is very much like the MOD algorithm for dictionary learning (reconstruction). C j ({R( xl , D j , lj )}Nj1 ) D j Solved by computing the SVD of the following matrix B and taking the singular vectors corresponding to the largest singular value (just as in standard KSVD – go back and compare): Discriminative Dictionary Learning: Parameter Choice • is initially chosen to be small. Its value is increased gradually (more discriminative) across iterations to update the dictionary. • is initially large (more reconstructive). Its value is gradually decreased (less reconstructive) across iterations to update the dictionary. Application 1: Texture Segmentation/Classification • Experiments performed on a mosaic of textures from the well-known Brodatz texture database (see next slide). • Experiments performed on 12 x 12 patches. Sparsity level set to L = 4. Dictionary size was K = 128. • During training, 30 iterations of discriminative learning were performed. Application 1: Texture Segmentation/Classification • During testing, 12 x 12 patches from the test image were classified by first computing the reconstruction errors on the respective dictionaries corresponding to each class, followed by smoothing of the obtained error images. • The patch is assigned to the class yielding the least of the smoothed error-values. Method of Skretting and Husoy (cited earlier in the slides) Application 2: Learning Discriminative Patches • Consider a distinct object (e.g. a car, bicycle etc.) against a background. • Consider a rectangular bounding box drawn around the object. • Let small-sized patches from within the bounding box form a set S1. • Let the corresponding set from the background be denoted as S2. Application 2: Learning Discriminative Patches • These patches are bad features to distinguish between the foreground object and the background. • Because many patches from background and foreground are similar. • And because the bounding box does not coincide with the object boundary. • However SOME patches of the foreground object carry discriminative information! Discriminative patch! Application 2: Learning Discriminative Patches • Two discriminative dictionaries were learned – D1 for S1, and D2 for S2, with K = 128, L = 4. • After the first few iterations, in each subsequent iteration for dictionary updates, only the 90% best classified patches were retained (as per the “C” function for discriminative learning).