1 AN ANALYSIS OF SINGLELAYER NETWORKS IN UNSUPERVISED FEATURE LEARNING [1] Yani Chen 10/14/2014 Outline 2 Introduction Framework for feature learning Unsupervised feature learning algorithms Effect of some parameters Experiments and analysis on the results Introduction 3 1. Much work focused on employing complex unsupervised feature learning algorithm. 2. Simple factors, such as the number of hidden nodes may be more important to achieving high performance than the learning algorithm or the depth of the model. 3. Using only one single layer network can get very good feature learning results. Unsupervised feature learning framework 4 1>. extract random patches from unlabeled training images (choose image as example) 2>. apply a pre-processing stage to the patches 3>. learn a feature-mapping using an unsupervised feature learning algorithm 4>. extract features from equally spaced sub-patches covering the input images 5>. pool features together to reduce the number of feature values 6>. train a linear classifier to predict the labels given the feature vectors Unsupervised learning algorithm 5 1. Sparse autoencoder 2. Sparse restricted Boltzmann machine 3. K-means clustering 4. Gaussian mixture models clustering Sparse auto-encoder 6 Objective function (minimize): 1 J W , b m 1 i i hW , b x y 2 2 sl m i 1 s l 1 W 2 l ji l 1 i 1 j 1 2 2 s2 KL j j 1 Feature mapping function: f x g Wx b where g z 1 1 exp z Sparse restricted Boltzman machine 7 Energy function of an RBM is : E x , h b x a h h Wx The same type of sparsity penalty can be added like in the sparse autoencoder Sparse RBMs can be trained using a contrastive divergence approximation Feature mapping function: f x g Wx b [7] where g z 1 1 exp z K-means clustering 8 Object function for learning K centroids K arg min S i S S 1,S 2 , ,S K is the cluster sets 2 where i 1 x S i Feature mapping function 1> hard-assignment 1 fk x 0 xc i c if k arg min c j x 2 2 j otherwise 2> soft-assignment f k x max 0 , z z k where zk x c k , z mean z k GMM clustering 9 Gaussian mixture models: A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. GMM(Gaussian mixture models) 10 J P x overall likelihood n of the model n P xn f k x n | c k , k P k k f k x | ck, k is the posterior d : dimention k 1 ... K c k 1 2 d 2 k membership 1 2 1 k exp xc 2 probabilit ies of x Gaussians : mean of the k th Gaussian k : covariance matrix of the k th Gaussian 1 k x c k EM algorithm 11 EM(expectation-maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models. E-step : assign points to clusters f k x k | k , k P k P k | x n P xn M-step : estimate model parameters k k P k | x x P k | x n n n n n P k | x n x n k x n k n 1 P k N P k | x n n P k | x n n Gaussian mixtures 12 Feature mapping function: fk x 1 2 d d : dimention k 1 ... K c k 2 k 1 2 1 k 1 k exp x c k x c 2 of x Gaussians : mean of the k th Gaussian k : covariance matrix of the k th Gaussian Feature extraction and classification 13 Convolutional feature extraction and pooling(sum) Classification : (L2) SVM Data 14 1. CIFAR-10 (this data is used to tune the parameters) 2. NORB 3. downsampled STL(96*96 --> 32*32) CIFAR10 dataset 15 [3] The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images NORB dataset 16 [4] This dataset is intended for experiments in 3D object recognition from shape. It contains images of 50 toys belonging to 5 generic categories: animals, human figures, airplanes, trucks, and cars. 24,300 training image pairs (96*96), 24300 test image pairs . STL-10 dataset 17 [5] The STL-10 dataset consists of 5200 64x64 color images and 3200 test images in 4 classes, airplane, cat, car and dog. There are 50000 training images and 10000 test images. Effect elements 18 1. with or without whitening 2. number of features 3. stride(spacing between patches) 4. receptive field size Effect of whitening 19 Result of whitening: 1. the features are less correlated with each other 2. the features all have the same variance For sparse autoencoder and sparse RBM when using only 100 features, significant benefit from whitening preprocessing when the number of features getting bigger, the advantage disappeared For clustering algorithms The whitening is a must have step because they cannot handle the correlations in the data. Effect of number of features 20 Num of features used: 100, 200, 400, 800, 1600 All algorithms generally achieved higher performance by learning more features Effect of stride 21 Stride is the spacing between patches where feature values will be extracted Downward performance with increasing step size Effect of receptive field size 22 Receptive field size is the patch size. Overall, the 6 pixel receptive field size worked best. Classification results 23 Table 1: Test recognition accuracy on CIFAR-10 Algorithm Accuracy Raw pixels 3-way factored RBM (3 layers) Mean-covariance RBM (3 layers) Improved Local Coord. Coding Conv. Deep Belief Net (2 layers) 37.3% 65.3% 71.0% 74.5% 78.9% Sparse auto-encoder Sparse RBM K-means (Hard) K-means (Triangle, 1600 features) k-means (Triangle, 4000 features) 73.4% 72.4% 68.6% 77.9% 79.6% stride = 1, receptive field = 6, with whitening, large number of features Classification results 24 Table 2: Test recognition accuracy (and error) for NORB (normalized-uniform) Algorithm Accuracy(error) Conv. Neural Network Deep Boltzmann Machine Deep Belief Network Best result of [6] Deep neural network 93.4% (6.6%) 92.8% (7.2%) 95.0% (5.0%) 94.4% (5.0%) 97.13% (2.87%) Sparse auto-encoder Sparse RBM K-means (Hard) K-means (Triangle, 1600 features) k-means (Triangle, 4000 features) 96.9% (3.1%) 96.2% (3.8%) 96.9% (3.1%) 97.0% (3.0%) 97.21% (2.79%) stride = 1, receptive field = 6, with whitening, large number of features Classification results 25 Table 3: Test recognition accuracy on STL-10 Algorithm Accuracy Raw pixels K-means (Triangle 1600 features) 31.8% ( 0.62%) 51.5% ( 1.73%) The method proposed is strongest when we have large labeled training sets. Conclusion 26 Best performance is based on k-means clustering. Easy and fast. No hypermeters to tune. One layer network can get good result. Using more features and dense extraction. Reference 27 [1] Coates, Adam, Andrew Y. Ng, and Honglak Lee. "An analysis of single-layer networks in unsupervised feature learning." International Conference on Artificial Intelligence and Statistics. 2011. [2]http://ace.cs.ohio.edu/~razvan/courses/dl6900/index.html [3]A. Krizhevsky. Learning multiple layers of features form Tiny Images. Master’s thesis, Dept. of Comp. Sci., University of Toronto, 2009 [4] LeCun, Yann, Fu Jie Huang, and Leon Bottou. "Learning methods for generic object recognition with invariance to pose and lighting." Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. Vol. 2. IEEE, 2004. [5] http://cs.stanford.edu/~acoates/stl10 [6] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object recognition?." Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009. [7] Goh, Hanlin, Nicolas Thome, and Matthieu Cord. "Biasing restricted Boltzmann machines to manipulate latent selectivity and sparsity." NIPS workshop on deep learning and unsupervised feature learning. 2010. 28 THANK YOU !