ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, NIPS 2012 Eunsoo Oh (오은수) ILSVRC ● ImageNet Large Scale Visual Recognition Challenge ● An image classification challenge with 1,000 categories (1.2 million images) Processing… Deep Convolutional Neural Network (ILSVRC-2012 Winner) reference : http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf 2 Why Deep Learning? ● “Shallow” vs. “deep” architectures Learn a feature hierarchy all the way from pixels to classifier reference : http://web.engr.illinois.edu/~slazebni/spring14/lec24_cnn.pdf 3 Background ● A neuron Input (raw pixel) x1 Weights x3 w1 w2 w3 … wd x2 Output: f(w·x+b) f xd reference : http://en.wikipedia.org/wiki/Sigmoid_function#mediaviewer/File:Gjl-t(x).svg 4 Background ● Multi-Layer Neural Networks ● Nonlinear classifier ● Learning can be done by gradient descent Back-Propagation algorithm 5 Input Layer Hidden Layer Output Layer Background ● Convolutional Neural Networks ● Variation of multi-layer neural networks ● Kernel (Convolution Matrix) reference : http://en.wikipedia.org/wiki/Kernel_(image_processing) 6 Background ● Convolutional Filter . . . Input Feature Map reference : http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx 7 Proposed Method ● Deep Convolutional Neural Network ● 5 convolutional and 3 fully connected layers ● 650,000 neurons, 60 million parameters ● Some techniques for boosting up performance ● ReLU nonlinearity ● Training on Multiple GPUs ● Overlapping max pooling ● Data Augmentation ● Dropout 8 Rectified Linear Units (ReLU) reference : http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf 9 Training on Multiple GPUs ● Spread across two GPUs ● GTX 580 GPU with 3GB memory ● Particularly well-suited to cross-GPU parallelization ● Very efficient implementation of CNN on GPUs 10 Pooling ● Spatial Pooling ● Non-overlapping / overlapping regions ● Sum or max Max Sum reference : http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx 11 Data Augmentation 256x256 Training Image 12 Horizontal Flip 224x224 224x224 224x224 224x224 224x224 224x224 Dropout ● Independently set each hidden unit activity to zero with 0.5 probability ● Used in the two globally-connected hidden layers at the net's output reference : http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf 13 Overall Architecture ● Trained with stochastic gradient descent on two NVIDIA GPUs for about a week (5~6 days) ● 650,000 neurons, 60 million parameters, 630 million connections ● The last layer contains 1,000 neurons which produces a distribution over the 1,000 class labels. 14 Results ● ILSVRC-2010 test set ILSVRC-2010 winner Previous best published result Proposed Method 15 reference : http://image-net.org/challenges/LSVRC/2012/ilsvrc2012.pdf Results ● ILSVRC-2012 results Runner-up Top-5 error rate : 26.172% 16 Proposed method Top-5 error rate : 16.422% Qualitative Evaluations 17 Qualitative Evaluations 18 ILSVRC-2013 Classification reference : http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf 19 ILSVRC-2014 Classification 22 Layers 20 19 Layers Conclusion ● Large, deep convolutional neural networks for large scale image classification was proposed ● 5 convolutional layers, 3 fully-connected layers ● 650,000 neurons, 60 million parameters ● Several techniques for boosting up performance ● Several techniques for reducing overfitting ● The proposed method won the ILSVRC-2012 ● Achieved a winning top-5 error rate of 15.3%, compared to 26.2% achieved by the second-best entry 21 Q&A 22 Quiz ● 1. The proposed method used hand-designed features, thus there is no need to learn features and feature hierarchies. (True / False) ● 2. Which technique was not used in this paper? ① Dropout ② Rectified Linear Units nonlinearity ③ Training on multiple GPUs ④ Local contrast normalization 23 Appendix Feature Visualization ● 96 learned low-level(1st layer) filters 24 Appendix Visualizing CNN 25 reference : M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901, 2013. Appendix Local Response Normalization ● : the activity of a neuron computed by applyuing kernel i at position (x, y) ● The response-normalized activity is given by ● N : the total # of kernels in the layer ● n : hyper-parameter, n=5 ● k : hyper-parameter, k=2 ● α : hyper-parameter, α=10^(-4) ● This aids generalization even though ReLU don’t require it. ● This reduces top-5 error rate by 1.2% 26 Appendix Another Data Augmentation ● Alter the intensities of the RGB channels in training images ● Perform PCA on the set of RGB pixel values ● To each training image, add multiples of the found principal components ● To each RGB image pixel add the following quantity ● ● , : i-th eigenvector and eigenvalue : random variable drawn from a Gaussian with mean 0 and standard deviation 0.1 ● This reduces top-1 error rate by over 1% 27 Appendix Details of Learning ● Use stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weigh decay of 0.0005 ● The update rule for weight w was ● i : the iteration index ● : the learning rate, initialized at 0.01 and reduced three times prior to termination ● : the average over the i-th batch Di of the derivative of the objective with respect to w ● Train for 90 cycles through the training set of 1.2 million images 28 29