What is the Best Multi-Stage Architecture for Object

advertisement
What is the Best Multi-Stage
Architecture for Object Recognition
Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun
Presented by Lingbo Li
ECE, Duke University
Dec. 13rd, 2010
Outline
•
•
•
•
Introduction
Model Architecture
Training Protocol
Experiments
 Caltech 101 Dataset
 NORB Dataset
 MNIST Dataset
• Conclusions
Introduction (I)
Feature extraction stages:
 A filter bank
 A non-linear operation
 A pooling operation
Recognition architectures:
• Single stage of features + supervised classifier: SIFT,
HoG, etc.
• Two or more successive stages of feature extractors +
supervised classifier: convolutional networks
Introduction (II)
• Q1: How do the non-linearities that follow the filter banks
influence the recognition accuracy?
• Q2: Is there any advantage to using an architecture with
two successive stages of features extraction, rather than
with a single stage?
• Q3: Does learning the filter banks in an unsupervised or
supervised manner improve the performance over hardwired filters or even random filters?
Model Architecture (I)
•
Input:
Output:
is the j-th feature map
Filter
:
A filter bank layer with 64 filters of size 9x9 :
Model Architecture (II)
•
•
 Subtractive normalization operation
 Divisive normalization operation
Model Architecture (III)
•
An average pooling layer with 4x4
down-sampling:
•
A max-pooling layer with 4x4 down-sampling:
Model Architecture (IV)
Combining Modules into a Hierarchy
•
•
•
•
Training Protocol (I)
Optimal sparse coding:
Under sparse condition, this problem can be written as an
optimization problem:
Given training samples
1) Minimize the loss function
, learning proceeds:
2) Find
by running a rather expensive optimization
algorithm.
Training Protocol (II)
Predictive Sparse Decomposition (PSD)
PSD trains a regressor
to approximate the sparse
solution
for all training samples, where
Learning proceeds by minimizing the loss function
where
Thus,
(dictionary) and
optimized.
(filters) are simultaneously
Training Protocol (III)
A single letter: an architecture with a single stage of feature extraction followed by a classifier;
A double letter: an architecture with two stages if feature extraction followed by a classifier.

Filters are set to random values and kept fixed.
Classifiers are trained in supervised mode.

Filters are trained using unsupervised PSD algorithm, and kept fixed.
Classifiers are trained in supervised model.

Filters are initialized with random values. The entire system (Feature stages +
classifiers) is trained in supervised mode with gradient descent.

Filters are initialized with the PSD unsupervised learning algorithm. The entire
system (feature stages + classifiers) is trained in supervised mode by gradient descent.
Experiments (I) – Caltech 101
• Data pre-processing:
1)
2)
3)
4)
Convert to gray-scale and resize to 151x151 pixels;
Subtract the image mean and divide by the image standard deviation;
Apply subtractive/divisive normalization (N layer with c=1);
Zero-padding the shorter side to 143 pixels.
• Recognition rates are averaged over 5 drawings of the
training set (30 images per class).
• Hyper-parameters are selected to maximize the performance
on the validation set of 5 samples per class taken out of the
training sets.
Experiments (I) – Caltech 101
• Using a Single Stage of Feature Extraction:
Multinomial
logistic regression
64 26x26
feature maps
PMK-SVM
• Using Two Stages of Feature Extraction:
64 26x26
feature maps
Multinomial
logistic regression
PMK-SVM
256 4x4
feature maps
Experiments (I) – Caltech 101
Experiments (I) – Caltech 101
• Random filters and no filter learning whatsoever with
can achieve decent performance;
• Supervised fine tuning improves the performance;
• Two-stage systems are better than their single-stage
counterparts;
• With rectification and normalization , unsupervised training
does not improve the performance;
• abs rectification is a crucial component for good
performance;
• Single-stage system with PMK-SVM reaches the same
performance with a two-stage with logistic regression;
Experiments (II) – NORB Dataset
• NORB dataset has 5 object categories;
• 24300 training samples and 24300 test samples (4860 per class); Each
image is gray-scale with 96x96 pixels;
• Only consider the
protocols;
1) Random filters do not
perform as well as learned
filters with more labels
samples.
2) The use of abs and
normalization makes a big
difference.
Experiments (II) – NORB Dataset
(1-a)
(2-a)
(3)
(1-b)
Use gradient descent to find the
optimal input patterns in a
architecture.
In the left figure:
(2-b) • (1-a) random stage-1 filters;
• (1-b) corresponding optimal inputs;
• (2-a) PSD filters;
• (2-b) Optimal input patterns;
• (3) subset of stage-2 filters after PSD
and supervised refinement on Caltech101.
Experiments (III) – MNIST Dataset
• 60,000 gray-scale 28x28 pixel images for training and 10,000
images for testing;
• 2-stage of feature extraction:
the first stage
50 28x28
feature maps
50 14x14
feature maps
Input Image
34x34
convolution
50 7x7 filters
Max-pooling
2*2 windows
10-way
multinomial
logistic
regression
Max-pooling
2x2 windows
convolution
1024 5x5filters
64 5x5
feature maps
64 10x10
feature maps
the second stage
Experiments (III) – MNIST Dataset
• Parameters are trained with PSD: the only hyperparameter is tuned with a validation set of
10,000 training samples.
• The classifier is randomly initialized;
• The whole system is tuned in supervised mode.
• A test error rate of 0.53% was obtained.
Conclusions (I)
• Q1: How do the non-linearities that follow the filter banks
influence the recognition accuracy?
1) A rectifying non-linearity is the single most important
factor.
2) A local normalization layer can also improve the
performance.
• Q2: Is there any advantage to using an architecture with two
successive stages of feature extraction, rather than with a
single stage?
1) Two stages are better than one.
2) The performance of two-stage system is similar to that of
the best single-stage systems based on SIFT and PMK-SVM.
Conclusions (II)
• Q3: Does learning the filter banks in an unsupervised or
supervised manner improve the performance over hard-wired
filters or even random filters?
1) Random filters yield good performance only in the case of
small training set.
2) The optimal input patterns for a randomly initialized stage are
similar to the optimal inputs for a stage that use learned filters.
3) The global supervised learning of filters yields good
recognition rate if with the proper non-linearites.
4) Unsupervised pre-training followed by supervised refinement
yields the best overall accuracy.
Download