Sparse Coding and Its Extensions for Visual Recognition Kai Yu Media Analytics Department NEC Labs America, Cupertino, CA 1 Visual Recognition is HOT in Computer Vision Caltech 101 PASCAL VOC 4/7/2015 80 Million Tiny Images ImageNet 2 The pipeline of machine visual perception Most Efforts in Machine Learning Low-level sensing • • • • 4/7/2015 Preprocessing Feature extract. Feature selection Inference: prediction, recognition Most critical for accuracy Account for most of the computation Most time-consuming in development cycle Often hand-craft in practice 3 Computer vision features SIFT HoG Spin image RIFT GLOH Slide Credit: Andrew Ng Learning everything from data Machine Learning Low-level sensing Preprocessing Feature extract. Feature selection Inference: prediction, recognition Machine Learning 4/7/2015 5 BoW + SPM Kernel Bag-of-visual-words representation (BoW) based on vector quantization (VQ) Spatial pyramid matching (SPM) kernel • Combining multiple features, this method had been the state-of-the-art on Caltech-101, PASCAL, 15 Scene Categories, … 4/7/2015 Figure credit: Fei-Fei Li, Svetlana Lazebnik 6 Winning Method in PASCAL VOC before 2009 Multiple Feature Sampling Methods 4/7/2015 Multiple Visual Descriptors VQ Coding, Histogram, Nonlinear SVM SPM 7 Convolution Neural Networks Conv. Filtering Pooling • The Conv. Filtering Pooling architectures of some successful methods are not so much different from CNNs 8 BoW+SPM: the same architecture Local Gradients Pooling VQ Coding Average Pooling (obtain histogram) Nonlinear SVM e.g, SIFT, HOG Observations: • Nonlinear SVM is not scalable • VQ coding may be too coarse • Average pooling is not optimal • Why not learn the whole thing? 9 Develop better methods Better Coding Better Pooling 10 Better Coding Better Pooling Scalable Linear Classifier Sparse Coding Sparse coding (Olshausen & Field,1996). Originally developed to explain early visual processing in the brain (edge detection). Training: given a set of random patches x, learning a dictionary of bases [Φ1, Φ2, …] Coding: for data vector x, solve LASSO to find the sparse coefficient vector a 4/7/2015 11 Sparse Coding Example Learned bases (f1 , …, f64): “Edges” Natural Images 50 100 150 200 50 250 100 300 150 350 200 400 250 50 300 100 450 500 50 100 150 200 350 250 300 350 400 450 150 500 200 400 250 450 300 500 50 100 150 350 200 250 300 350 100 150 400 450 500 400 450 500 50 200 250 300 350 400 450 500 Test example 0.8 * + 0.3 * + 0.5 * f36, 0, …,+0,0.3 [a1, …, ax64] = [0, 0.8 0, …,* 0, 0.8 0.3*, 0, …,f0,420.5,+0]0.5 * (feature representation) Slide credit: Andrew Ng f63 Compact & easily interpretable Self-taught Learning [Raina, Lee, Battle, Packer & Ng, ICML 07] Motorcycles Not motorcycles Testing: What is this? … Unlabeled images Slide credit: Andrew Ng Classification Result on Caltech 101 9K images, 101 classes 64% SIFT VQ + Nonlinear SVM 50% Pixel Sparse Coding + Linear SVM 4/7/2015 14 Sparse Coding on SIFT Local Gradients Pooling e.g, SIFT, HOG 15 [Yang, Yu, Gong & Huang, CVPR09] Sparse Coding Max Pooling Scalable Linear Classifier Sparse Coding on SIFT [Yang, Yu, Gong & Huang, CVPR09] Caltech-101 64% SIFT VQ + Nonlinear SVM 73% SIFT Sparse Coding + Linear SVM 4/7/2015 16 What we have learned? Local Gradients Pooling Sparse Coding Max Pooling e.g, SIFT, HOG 1. Sparse coding is a useful stuff (why?) 2. Hierarchical architecture is needed 17 Scalable Linear Classifier MNIST Experiments Error: 4.54% Error: 3.75% Error: 2.64% • When SC achieves the best classification accuracy, the learned bases are like digits – each basis has a clear local class association. 4/7/2015 18 Distribution of coefficient (SIFT, Caltech101) Neighbor bases tend to get nonzero coefficients 4/7/2015 19 Interpretation 1 Discover subspaces • Each basis is a “direction” • Sparsity: each datum is a linear combination of only several bases. • Related to topic model 4/7/2015 Interpretation 2 Geometry of data manifold • Each basis an “anchor point” • Sparsity is induced by locality: each datum is a linear combination of neighbor anchors. 20 A Function Approximation View to Coding • Setting: f(x) is a nonlinear feature extraction function on image patches x • Coding: nonlinear mapping xa typically, a is high-dim & sparse • Nonlinear Learning: f(x) = <w, a> A coding scheme is good if it helps learning f(x) 4/7/2015 21 A Function Approximation View to Coding – The General Formulation Function Approx. Error 4/7/2015 ≤ An unsupervised learning objective 22 Local Coordinate Coding (LCC) Yu, Zhang & Gong, NIPS 09 Wang, Yang, Yu, Lv, Huang CVPR 10 • Dictionary Learning: k-means (or hierarchical k-means) • Coding for x, to obtain its sparse representation a Step 1 – ensure locality: find the K nearest bases Step 2 – ensure low coding error: 4/7/2015 23 Super-Vector Coding (SVC) Zhou, Yu, Zhang, and Huang, ECCV 10 • Dictionary Learning: k-means (or hierarchical k-means) • Coding for x, to obtain its sparse representation a Step 1 – find the nearest basis of x, obtain its VQ coding e.g. [0, 0, 1, 0, …] Step 2 – form super vector coding: e.g. [0, 0, 1, 0, …, 0, 0, (x-m3),0,…] Zero-order 4/7/2015 Local tangent 24 Function Approximation based on LCC Yu, Zhang, Gong, NIPS 10 locally linear data points bases 4/7/2015 25 Function Approximation based on SVC Zhou, Yu, Zhang, and Huang, ECCV 10 Local tangent Piecewise local linear (first-order) data points cluster centers PASCAL VOC Challenge 2009 Classes Ours Best of Other Teams Difference No.1 for 18 of 20 categories We used only HOG feature on gray images 4/7/2015 27 ImageNet Challenge 2010 1.4 million images, 1000 classes, top5 hit rate ~40% VQ + Intersection Kernel 64%~73% Various Coding Methods + Linear SVM 50% Classification accuracy 4/7/2015 28 Hierarchical sparse coding Yu, Lin, & Lafferty, CVPR 11 Learning from unlabeled data Conv. Filtering Pooling Conv. Filtering Pooling 29 1 n wi αk (φk ) wi = S(W ) Ω(α) i= 1 k= 1 A two-layer sparse coding formulation 1 S(W (W , α) = )≡ n W,α α n λ 1p T p× w w ∈ R i L (W,i α) + W n i= 1 1 +γ α 1 0, L (W, α) L (W, α) n λ 22 11 1 2 L (W, α) = X − BW Ω(α) i x i −F B+wi + S(W λ 2 wi )Ω(α)w 2n n n 2 . i= 1 wi 4/7/2015 W = (w1 w2 · · · wn ) ∈Σ(α) Rp× =n Ω(α) − 1 α ∈ Rq W S(W ) Ω(α) −1 q W 30 MNIST Results -- classification HSC vs. CNN: HSC provide even better performance than CNN more amazingly, HSC learns features in unsupervised manner! 31 MNIST results -- effect of hierarchical learning Comparing the Fisher score of HSC and SC Discriminative power: is significantly improved by HSC although HSC is unsupervised coding 32 MNIST results -- learned codebook One dimension in the second layer: invariance to translation, rotation, and deformation 33 Caltech101 results -- classification Learned descriptor: performs slightly better than SIFT + SC 34 Conclusion and Future Work “function approximation” view to derive novel sparse coding methods. Locality – one way to achieve sparsity and it’s really useful. But we need deeper understanding of the feature learning methods Interesting directions – Hierarchical coding – Deep Learning (many papers now!) – Faster methods for sparse coding (e.g. from LeCun’s group) – Learning features from a richer structure of data, e.g., video (learning invariance to out plane rotation) References • Learning Image Representations from Pixel Level via Hierarchical Sparse Coding, Kai Yu, Yuanqing Lin, John Lafferty. CVPR 2011 • Large-scale Image Classification: Fast Feature Extraction and SVM Training, Yuanqing Lin, Fengjun Lv, Liangliang Cao, Shenghuo Zhu, Ming Yang, Timothee Cour, Thomas Huang, Kai Yu in CVPR 2011 • ECCV 2010 Tutorial, Kai Yu, Andrew Ng (with links to some source codes) • Deep Coding Networks, Yuanqing Lin, Tong Zhang, Shenghuo Zhu, Kai Yu. In NIPS 2010. • Image Classification using Super-Vector Coding of Local Image Descriptors, Xi Zhou, Kai Yu, Tong Zhang, and Thomas Huang. In ECCV 2010. • Efficient Highly Over-Complete Sparse Coding using a Mixture Model, Jianchao Yang, Kai Yu, and Thomas Huang. In ECCV 2010. • Improved Local Coordinate Coding using Local Tangents, Kai Yu and Tong Zhang. In ICML 2010. • Supervised translation-invariant sparse coding, Jianchao Yang, Kai Yu, and Thomas Huang, In CVPR 2010 • Learning locality-constrained linear coding for image classification, Jingjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang. In CVPR 2010. • Nonlinear learning using local coordinate coding, Kai Yu, Tong Zhang, and Yihong Gong. In NIPS 2009. • Linear spatial pyramid matching using sparse coding for image classification, Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. In CVPR 2009. 4/7/2015 37