Learning Mid-Level Features For Recognition

advertisement
Learning Mid-Level Features
For Recognition
Y-Lan Boureau, Francis Bach, Yann LeCun and Jean Ponce
Published in CVPR 2010
Presented by Bo Chen, 8.20,2010
Outline
• 1. Classification System
• 2. Brief introduction of each step
• 3. Systematic evaluation of unsupervised
mid-level features
• 4. Learning discriminative dictionaries
• 5. Average and max pooling
• 6. Conclusions
System Flow Chart
Patches
SIFT
Subsampling sliding patches to cover all of details
Robust features invariant to some conditions
Coding
Generate mid-level feature, such as sparse coding,
vector quantization and deep network
Pooling
Max-pooling or average pooling
SPM
Classifier
Spatial pyramid model
Linear or nonlinear SVM
Scale Invariant Feature Transform (D. Lowe,
IJCV,2004)
Motivations: Image matching, scale invariance, rotation invariance,
illumination invariance and viewpoint invariance
Figures from David Lee’s ppt
Calculate SIFT Descriptors
Divide a 16x16 patch into 4 subregions, 8 bins in each subregion
which leads to a 4x4x8=128 dimensional vector. (low-level)
Figures from Jason Clemons’s ppt
Notations
Question: How can we represent each region?
Figure from S. Lazebnik et.al, CVPR06
Coding and Pooling
• Vector quantization (Bag-of-features)
Or
• Sparse Coding
Systematic evaluation of unsupervised
mid-level features
Macrofeatures and Denser SIFT
Sampling
Parameterizations:
1. SIFT sampling density
2. macrofeature side length
3. subsampling parameter
Results:
Caltech101: 75.7% (4, 2, 4)
Scene: 84.3% (8, 2, 1)
Results
Discriminative Dictionaries
Algorithm: stochastic gradient descent
Cons: high computational complexity
Solutions:
1.approximate z(n) by pooling over a random
sample of ten locations of the image.
2. Update only a random subset of coordinates at
each iteration.
Scenes dataset
Average and Max Pooling
• Why pooling?
Pooling is used to achieve invariance to image transformations,
more compact representations, and better robustness to noise and
clutter so as to preserves important information while discarding
irrelevant detail, the crux of the matter being to determine what falls
in which category.
• Max-pooling vs. Average-pooling
The authors show that using max pooling on hard vector quantized
features in a spatial pyramid brings the performance of linear
classification to the level of that obtained by Lazebnik et al. (2006)
with an intersection kernel, even though the resulting feature is
binary.
• Our feeling
Pooling helps the learned codes sparse, which follows the human
visual function. Especially, for convolutional deep network, pooling
appears very necessary since there is correlations between neighbor
contents.
Part conclusion from Y-Lan Boureau et. al, ICML 2010
Theoretical Comparison of Average and Max Pooling
Experimental methodology: Binary classification (positive and negative)
Conclusions
• 1. Give a comprehensive and systematic comparison
across each step of mid-level feature extraction through
several types of coding modules (hard and soft vector
quantization, sparse coding) and pooling schemes (by
taking the average, or the maximum), which obtains
state-of-the-art performance or better on several
recognition benchmarks.
• 2. Supervised dictionary learning method for sparse
coding
• 3. Theoretical and empirical insight into the remarkable
performance of max pooling.
Download