What is the Best Multi-Stage Architecture for Object

What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo Li ECE, Duke University Dec. 13rd, 2010 Outline • • • • Introduction Model Architecture Training Protocol Experiments  Caltech 101 Dataset  NORB Dataset  MNIST Dataset • Conclusions Introduction (I) Feature extraction stages:  A filter bank  A non-linear operation  A pooling operation Recognition architectures: • Single stage of features + supervised classifier: SIFT, HoG, etc. • Two or more successive stages of feature extractors + supervised classifier: convolutional networks Introduction (II) • Q1: How do the non-linearities that follow the filter banks influence the recognition accuracy? • Q2: Is there any advantage to using an architecture with two successive stages of features extraction, rather than with a single stage? • Q3: Does learning the filter banks in an unsupervised or supervised manner improve the performance over hardwired filters or even random filters? Model Architecture (I) • Input: Output: is the j-th feature map Filter : A filter bank layer with 64 filters of size 9x9 : Model Architecture (II) • •  Subtractive normalization operation  Divisive normalization operation Model Architecture (III) • An average pooling layer with 4x4 down-sampling: • A max-pooling layer with 4x4 down-sampling: Model Architecture (IV) Combining Modules into a Hierarchy • • • • Training Protocol (I) Optimal sparse coding: Under sparse condition, this problem can be written as an optimization problem: Given training samples 1) Minimize the loss function , learning proceeds: 2) Find by running a rather expensive optimization algorithm. Training Protocol (II) Predictive Sparse Decomposition (PSD) PSD trains a regressor to approximate the sparse solution for all training samples, where Learning proceeds by minimizing the loss function where Thus, (dictionary) and optimized. (filters) are simultaneously Training Protocol (III) A single letter: an architecture with a single stage of feature extraction followed by a classifier; A double letter: an architecture with two stages if feature extraction followed by a classifier.  Filters are set to random values and kept fixed. Classifiers are trained in supervised mode.  Filters are trained using unsupervised PSD algorithm, and kept fixed. Classifiers are trained in supervised model.  Filters are initialized with random values. The entire system (Feature stages + classifiers) is trained in supervised mode with gradient descent.  Filters are initialized with the PSD unsupervised learning algorithm. The entire system (feature stages + classifiers) is trained in supervised mode by gradient descent. Experiments (I) – Caltech 101 • Data pre-processing: 1) 2) 3) 4) Convert to gray-scale and resize to 151x151 pixels; Subtract the image mean and divide by the image standard deviation; Apply subtractive/divisive normalization (N layer with c=1); Zero-padding the shorter side to 143 pixels. • Recognition rates are averaged over 5 drawings of the training set (30 images per class). • Hyper-parameters are selected to maximize the performance on the validation set of 5 samples per class taken out of the training sets. Experiments (I) – Caltech 101 • Using a Single Stage of Feature Extraction: Multinomial logistic regression 64 26x26 feature maps PMK-SVM • Using Two Stages of Feature Extraction: 64 26x26 feature maps Multinomial logistic regression PMK-SVM 256 4x4 feature maps Experiments (I) – Caltech 101 Experiments (I) – Caltech 101 • Random filters and no filter learning whatsoever with can achieve decent performance; • Supervised fine tuning improves the performance; • Two-stage systems are better than their single-stage counterparts; • With rectification and normalization , unsupervised training does not improve the performance; • abs rectification is a crucial component for good performance; • Single-stage system with PMK-SVM reaches the same performance with a two-stage with logistic regression; Experiments (II) – NORB Dataset • NORB dataset has 5 object categories; • 24300 training samples and 24300 test samples (4860 per class); Each image is gray-scale with 96x96 pixels; • Only consider the protocols; 1) Random filters do not perform as well as learned filters with more labels samples. 2) The use of abs and normalization makes a big difference. Experiments (II) – NORB Dataset (1-a) (2-a) (3) (1-b) Use gradient descent to find the optimal input patterns in a architecture. In the left figure: (2-b) • (1-a) random stage-1 filters; • (1-b) corresponding optimal inputs; • (2-a) PSD filters; • (2-b) Optimal input patterns; • (3) subset of stage-2 filters after PSD and supervised refinement on Caltech101. Experiments (III) – MNIST Dataset • 60,000 gray-scale 28x28 pixel images for training and 10,000 images for testing; • 2-stage of feature extraction: the first stage 50 28x28 feature maps 50 14x14 feature maps Input Image 34x34 convolution 50 7x7 filters Max-pooling 2*2 windows 10-way multinomial logistic regression Max-pooling 2x2 windows convolution 1024 5x5filters 64 5x5 feature maps 64 10x10 feature maps the second stage Experiments (III) – MNIST Dataset • Parameters are trained with PSD: the only hyperparameter is tuned with a validation set of 10,000 training samples. • The classifier is randomly initialized; • The whole system is tuned in supervised mode. • A test error rate of 0.53% was obtained. Conclusions (I) • Q1: How do the non-linearities that follow the filter banks influence the recognition accuracy? 1) A rectifying non-linearity is the single most important factor. 2) A local normalization layer can also improve the performance. • Q2: Is there any advantage to using an architecture with two successive stages of feature extraction, rather than with a single stage? 1) Two stages are better than one. 2) The performance of two-stage system is similar to that of the best single-stage systems based on SIFT and PMK-SVM. Conclusions (II) • Q3: Does learning the filter banks in an unsupervised or supervised manner improve the performance over hard-wired filters or even random filters? 1) Random filters yield good performance only in the case of small training set. 2) The optimal input patterns for a randomly initialized stage are similar to the optimal inputs for a stage that use learned filters. 3) The global supervised learning of filters yields good recognition rate if with the proper non-linearites. 4) Unsupervised pre-training followed by supervised refinement yields the best overall accuracy.

What is the Best Multi-Stage Architecture for Object

Related documents

Products

Support

What is the Best Multi-Stage Architecture for Object

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib