An Introduction to Sparse Coding, Sparse Sensing, and Optimization Speaker: Wei-Lun Chao Date: Nov. 23, 2011 DISP Lab, Graduate Institute of Communication Engineering, National Taiwan University 1 Outline • • • • • • Introduction The fundamental of optimization The idea of sparsity: coding V.S. sensing The solution The importance of dictionary Applications 2 Introduction 3 Introduction • What is sparsity? Projection bases • Usage: Reconstruction bases Compression Analysis Representation Fast / sparse sensing 4 Introduction • Why do we use Fourier transform and its modifications for image and acoustic compression? Differentiability (theoretical) Intrinsic sparsity (data-dependent) Human perception (human-centric) • Better bases for compression or representation? Wavelets How about data-dependent bases? How about learning? 5 Introduction • Optimization Frequently faced in algorithm design Used to implement you creative idea • Issue What kinds of mathematical form and its corresponding optimization algorithms do guarantee the convergence to local or global optima? 6 The Fundamental of Optimization 7 A Warming-up Question • How do you solve the following problems? (1) Local minima Global minima (2) min f (w) (w 5)2 (a) Plot w 5 (b) Take derivatives, check = 0 8 An Advanced Question • How about the following questions? N (3) min fn (w) w (a) Plot? (b) Take derivative = 0? n 1 (4) min f ( w) ( w 5) 2 w s.t. w 3 Derivative? 3 5 N f n (w) (5) min w n 1 How to do? s.t. wi bi 9 Illustration • 2-D case: f (w1 , w2 ) 6 min f (w1 , w2 ), s.t. g (w1 , w2 ) b w1 , w2 f (w1 , w2 ) 5 w2 f ( w1 , w2 ) 1 f (w1 , w2 ) 2 w1 f (w1 , w2 ) 3 g (w1 , w2 ) b f (w1 , w2 ) 4 10 How to Solve? • Thanks to…… Lagrange multiplier Linear programming, quadratic programming, and recently, convex optimization • Standard form: min f 0 ( w ) w s.t. hi ( w ) bi , i 1,......, m s.t. gi ( w ) ci , i 1,......, n 11 Fallacy • A quadratic programming problem with constraints min Ax b 2 The importance of each food x Personal nutrient need | a 1 | | a2 | | x1 | ...... a N b | xN | (1) Take derivative (x) x ( AT A) 1 AT b choose xi with xi 0 Nutrient content of each food (2) Quadratic programming (o) arg min Ax b , s.t. xi 0 2 xi 0 x (3) Sparse coding (o) 12 The Idea of Sparsity 13 What is Sparsity? • Think about a problem? min Ax b Assume full rank, N > d 2 x | | x1 | | a a ...... a b R d 2 N 1 | | | xN | Many x can achieve min Ax b 0 2 x N Choose the x with the least nonzero component Which do you want? arg min x 0 , s.t. Ax b 0 2 x 14 Why Sparsity? • The more concise, the more better • In some domain, there naturally exists a sparse latent vector that controls the data we saw. (ex. MRI, music) | | b a 1 | | | a2 | 0 | xi ...... ad ( noise) | x j 0 A k-sparse domain means that each b can be constructed by a x vector with at most k nonzero element • In some domain, samples from the same class have the sparse property. • The domain can be learned. 15 Sparse Sensing VS. Sparse Coding • Assume that: Sparse coding We have A R d N , N d . Now an observation b R d comes in b Ax, with x sparse b Rd x arg min x 0 , s.t. Ax b 0 2 * x x * x ** y Wb W R p d , d Sparse sensing p y Wb WA x Qx, with x sparse yR p x arg min x 0 , s.t. Qx y 0 2 ** x Note: p is based on the sparsity of the data (on k) 16 Sparse Sensing b R d b Ax, with x sparse y Wb W R p d ,d x * x ** p y Rp y Wb WA x Qx, with x sparse 17 Sparse Sensing VS. Sparse Coding • Sparse sensing (compressed sensing): It spends much time or money to get b, so get y first then recover b • Sparse coding (sparse representation): Believe that there exists the sparse property in the data, otherwise sparse representation means nothing. x is used to be the feature of b x can be used to efficiently store b and reconstruct b 18 The Solution 19 How to Get The Sparse Solution? • There is no algorithm other than exhaustively searching to solve: x arg min x 0 , s.t. Ax b 0 2 * x • While in some situations (ex. special form of A), the solution of l1 minimization approaches the one of l0 minimization N x *** arg min x 1 = x x (n) , s.t. Ax b 0 2 n 1 x*** x* 20 Why l1? • Question 1: Why l1 can result in a sparse solution? 2 2 arg min x 1 , s.t. Ax b 0 arg min Ax b , s.t. x 1 c x x w2 Ax b 2 x 1 c w1 x 2 c 21 Why l1? • Question 2: Why the sparse solution achieved by l1 minimization approaches the one of l0 minimization? This is a matter or Mathematics No matter how, sparse representation based on l1 minimization has been widely used for pattern recognition. In addition, if one doesn’t care about using the sparse solution for representation (feature), it seems OK if these two solutions are not the same. b Ax *** b Ax * 22 Noise • Sometimes, the data is observed with nose | | b a 1 | | | a2 | 0 | xi* ...... ad | xi* 0 l0 (l1 ) minimization b b noise x x ??? • The answer seems to be negative b Ax* , y* arg min y 1 , s.t. Ay noise 0 2 y y* x* x* , and y* is usually not sparse x arg min x 1 , s.t. Ax b x possibly not sparse 2 0 is neither equal to y* x* nor to x* 23 Noise • Several ways to overcome this: arg min x 1 , s.t. Ax b 0 arg min x 1 , s.t. Ax b 2 c 2 x x arg min x 1 , s.t. Ax b 1 c x arg min x 1 , s.t. Ax b 0 arg min z 1 , s.t. 2 x z A | I z b 2 x 0, where z t • What is the difference between: Ax b 2 c and Ax b 1 c 24 Equivalent form • You may also see several forms for the problem: arg min x 1 , s.t. Ax b 1 c arg min x 1 Ax b 1 x x arg min Ax b 1 , s.t. x d x • These equivalent forms are derived from Lagrange multiplier • There have been several publications aiming at how solving the l1 minimization problem. 25 The Importance of Dictionary 26 Dictionary generation • If the preceding sections, we generally assume that the (over-complete) bases A is existed and known • However in practice, we usually need to build it: Wavelet + Fourier + Haar + …… Learning based on data • How to learn? Given a training set b R (i ) A* , X * arg min B AX A, X d 2 F N i 1 , form B as B b(1) b(2) ...... b( N ) X 1 , where X x (1) x (2) ...... x ( N ) • May result in over-fitting 27 Applications 28 Back to the problem we have • A quadratic programming problem with constraints min Ax b 2 The importance of each food x Personal nutrient need | a 1 | | a2 | | x1 | ...... a N b | xN | (1) Take derivative (x) x ( AT A) 1 AT b choose xi with xi 0 Nutrient content of each food (2) Quadratic programming (o) arg min Ax b , s.t. xi 0 2 xi 0 x (3) Sparse coding (o) 29 Face Recognition (1) 30 Face Recognition (2) 31 An important issue • When using sparse representation as a way of feature extraction, you may wonder, even if there exists the sparsity property in the data, does sparse feature really come up with better results? Does it contain any semantic meaning? • Successful areas: Face recognition Digit recognition Object recognition (with carful design): Ex. K-means Sparse representation 32 De-noising Learn a patch dictionary. For each patch, compute the sparse representation then use it to reconstruct the patch. x* arg min x 1 Ax b x 1 b Ax* 33 Detection based on reconstruction Learn a patch dictionary for a specific object. For each patch in the image, compute the sparse representation and use it to reconstruct the image. Check the error for each patch, and identify those with small error as detected object. x * arg min x 1 Ax b 1 x b Ax * check b b 2 2 Maybe not over-complete Other cases: Foreground-background detection, pedestrian detection, …… 34 Conclusion 35 What you should know • • • • • What is the form of standard optimization? What is sparsity? What is sparse coding and sparse sensing? What kind of optimization method to solve it? Try to use it !! 36 Thank you for listening 37