Learning Convolutional Feature Hierarchies for Visual Recognition

advertisement
Learning Convolutional
Feature Hierarchies for
Visual Recognition
Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau,
Karol Gregor, Michael Mathieu, Yann LeCun
NIPS 2010
Presented by Bo Chen
Outline
• 1. Drawbacks in the Traditional Convolutional
Methods
• 2. The Proposed Algorithm and Some Details
• 3. Experimental Results
• 4. Conslusions
Convolutional Sparse Coding
Negative:
1. The representations of whole images are highly redundant
because the training and the inference are performed at the
patch level.
2. The inference for a whole image is computationally expensive.
Solutions
• 1. Introducing Convolution Operator
• 2. Introducing Nonlinear Encoder Module
Learning Convolutional Dictionaries
• 1. The Boundary Effects Due to Convolutions
Apply a mask on the derivatives of the reconstruction error:
where mask is a term-by-term multiplier that either puts zeros or gradually
scales down the boundaries.
• 2. Computational Efficient Derivative
Learning an Efficient Encoder
1. A New Smooth Shrinkage Operator:
2. To aid faster convergence, use stochastic diagonal Levenberg-Marquardt
method to calculate a positive diagonal approximation to the hessian.
Patch Based vs Convolutional
Sparse Modeling
The convolution operator enables the system to model local structures that
appear anywhere in the signal. The convolutional dictionary does not waste
resources modeling similar filter structure at multiple locations. Instead, it
Models more orientations, frequencies, and different structures including
center-surround filters, double center-surround filters, and corner structures
at various angles.
Multi-Stage Architecture
The convolutional encoder can be used to replace patch-based
sparse coding modules used in multistage object recognition
architectures. Building on the previous findings, for each stage,
the encoder is followed by and absolute value rectification,
contrast normalization and average subsampling.
Absolute Value Rectification: a simple pointwise absolute value function
applied on the output of the encoder.
Contrast Normalization: reduce the dependencies between components
(feature maps). When used in between layers, the mean and standard deviation
is calculated across all feature maps with a 9 × 9 neighborhood in spatial
dimensions.
Average Pooling: a spatial pooling operation that is applied on each feature
map independently.
Experiments 1: Object Recognition Using
Caltech 101 Dataset
Preprocess:
1. 30/30 training/testing; 2. Resize: 151x143; 3. Local Contrast Normalization
Unsupervised Training: Berkeley segmentation dataset
Architecture:
First Layer:
64 9x9; Pooling: 10 × 10 area with 5 pixel stride.
Second Layer: 256 9x9, where each dictionary elementis constrained to
connect 16 dictionary elements from the first layer; 6 × 6 area with stride 4.
Recognition Accuracy
One Layer
Two Layers
Ours: 65.8% (0.6)
Pedestrian Detection(1)
Original dataset: positive=2416; negative=1218
Augmented: positive= 11370 (1000); negative=9001(1000)
Layer-1: 32 7x7; Layer-2: 64 7x7; Pooling: 2x2
Pedestrian Detection(2)
Conclusions
• 1. Convolutional training of feature extractors reduces the
redundancy among filters compared with those obtained
from patch based models.
• 2. Introduced two different convolutional encode functions
for performing efficient feature extraction which is crucial
for using sparse coding in real world applications.
• 3. The proposed sparse modeling systems has been
applied through a successful multi-stage architecture on
object recognition and pedestrian detection problems and
performed comparably to similar systems.
Download