B EYOND S PATIAL P YRAMIDS : R ECEPTIVE F IELD L EARNING FOR P OOLED I MAGE F EATURES 1 2 1 Yangqing Jia Chang Huang Trevor Darrell 1 2 UC Berkeley EECS & ICSI NEC Labs America {jiayq,trevor}@berkeley.edu chuang@sv.nec-labs.com 2. T HE P IPELINE 1. C ONTRIBUTIONS The key contributions of our work are: 3. N EUROSCIENCE I NSPIRATION I • Analysis of the spatial receptive field (RF) designs for pooled features. coding A 1 A 2 (c1 , R1 ) pooling : • Evidence that spatial pyramids may be suboptimal in feature generation. (c2 , R2 ) x : A K f (x) “BEAR” (cM , RM ) • An algorithm that jointly learns adaptive RF and the classifiers, with an efficient implementation using over-completeness and structured sparsity. State-of-the-art classification algorithms take a two-layer pipeline: the coding layer learns activations from local image patches, and the pooling layer aggregates activations in multiple spatial regions. Linear classifiers are learned from the pooled features. 4. S PATIAL P OOLING R EVISITED 6. T HE L EARNING P ROBLEM • Much work has been done on the coding part, while the spatial pooling methods are often hand-crafted. • Sample performances on CIFAR-10 with different receptive field designs: N {(In , yn )}n=1 , • Given a set of training data we jointly learn the classifier and the pooled features as (assuming that coding is done in an unsupervised way): min C,R,θ 76.57 74.83 75.41 2x2 ave 4x4 max SPM Random 72.24 where Our algorithm 1 N N l(f (xn ; θ), yn ) + λ Reg(θ) n=1 xni = ci op(An,Ri ) • Advantage: pooled features are tailored towards the classification task (also reduces redundancy). 70.24 (with a dictionary of size 200) Note the suboptimality of SPM - random selection from an overcomplete set of spatially pooled features consistently outperforms SPM. • Disadvantage: may be intractable - an exponential number of possible receptive fields. • Solution: reasonably overcomplete receptive field candidates + sparsity constraints to control the number of final features. • We propose to learn the spatial receptive fields as well as the codes and the classifier. 7. O VERCOMPLETE RF 5. N OTATIONS • We propose to use overcomplete receptive field candidates based on regular grids: • Ri : RF of the i-th pooled feature. 9. R ESULTS • Directly perform optimization is still time and memory consuming. • Performance comparison on CIFAR-10 with state-ofthe-art approaches: • Following [Perkins JMLR03], We adopted an incremental, greedy approach to select features based on their scores: 2 ∂L(W, b) score(xi ) = ∂Wi,· Fro • After each increment, the model is retrained only with respect to an active subset of selected features to ensure fast re-training: (t+1) WSA ,· , b = arg minWS ,b A ,· L(W, b) • f (x, θ): the classifier based on pooled features x. • A pooled feature xi is defined by choosing a code indexed by ci and a spatial RF Ri : xi = ci op(ARi ) The vector of pooled features x is then determined by the set of parameters C = {c1 , · · · , cM } and R = {R1 , · · · , RM }. (a) Base (b) SPM (c) Ours Method Coates ICML’11 Our Method Lauer PR’07 Labusch TNN’08 Ranzato CVPR’07 Jarrett ICCV’09 • Benefit of overcompleteness in spatial pooling + feature selection: higher performance with smaller codebooks and lower feature dimensions. min W,b where W1,∞ "" • The structured sparsity regularization is adopted to select only a subset of features for classification: 1 N l(W xn + b, yn ) + n=1 N λ1 2 WFro + λ2 W1,∞ 1 M = i=1 maxj∈{1,··· ,L} |Wij |. Method ours, d=1600 ours, d=4000 ours, d=6000 Coates 2010 d=1600 Coates 2010 d=4000 Coates 2011 d=6000 Krizhevsky TR’10 Yu ICML’10 Ciresan Arxiv’11 Coates NIPS’11 Pooled Dim 6,400 16,000 24,000 6,400 16,000 48,000 N/A N/A N/A N/A Accuracy 80.17 82.04 83.11 77.9 79.6 81.5 78.9 74.5 80.49 82.0 • Result on MNIST and the 1-vs-1 saliency map obtained from our algorithm: %"& • op(·): pooling operator, such as max(·). LGN (Whitening) Simple Cells in V1 (sparse) coding 8. G REEDY F EATURE S ELECTION • I: image input. • A1 , · · · , AK : code activation as matrices, with Akij : activation of code k at position (i, j). Complex Cells in V1 spatial pooling # %"#!% %"# $% ' err% 1.02 0.64 0.83 0.59 0.62 0.53 10. R EFERENCES • A Coates and AY Ng. The importance of encoding versus training with sparse coding and vector quantization. ICML 2011. • S Perkins, K Lacker, and J Theiler. Grafting: fast, incremental feature selection by gradient descent in function space. JMLR, 3:1333–1356, 2003. • DH Hubel and TN Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. of Physiology, 160(1):106–154, 1962.