CS688/WST665 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo 1 Contents ● Introduction ● Motivation ● Previous work ● Main Idea ● Details ● Experiments ● Conclusion 2 Introduction ● Web-scale image retrieval ● Classify images or videos ● Detect and localize object ● Estimate semantic and geometrical attributes ● Why is this challenging? 3 ● View point ● Illumination ● Occlusion ● Scale ● Deformation ● Clutter background Motivation ● The current CNN require a fixed input image size (e.g., 224 x 224 ) Content loss Crop Distortion 224x224 Warp ● Recognition accuracy is degraded! 4 Convolutional Neural Network (CNN) Motivation ● The current CNN require a fixed input image size (e.g., 224 x 224 ) Content loss Crop Spatial Pyramid Distortion 224x224 Pooling Warp ● Recognition accuracy is degraded! 5 Convolutional Neural Network (CNN) Previous work (1/2) ● Spatial Pyramid Matching - very successful in traditional computer vision 6 Grauman et al, The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features, ICCV 2005. Lazebnik et al, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, CVPR 2006. Previous work (2/2) ● Zeiler-Fergus Architecture (2013, 1st) 8 Layers Still low accuracy! & Fixed Image Size ● Google LeNet (2014, 1st) Convolution Pooling Softmax Other 22 Layers Too complex model! & Fixed Image Size 7 M.D. Zeiler et al, “Visualizing and understanding convolutional neural networks”, aXiv:1311.2901, 2013. Christian Szegedy et a, “Going Deeper with Convolutions”, arXiv:1409.4842, 2014. Main Idea (1/2) ● Add Spatial Pyramid Pooling layer! Previous Nets SPP Net 8 Main Idea (2/2) ● Generate fixed length representation regardless of image size/scale. ● Simple (still 8 layers) and Powerful Model! ● Variable input size/scale ● Multi-size training, Multi-scale testing, Full image view ● Multi-level pooling ● Robust to deformation ● Operated on feature map ● Pooling in regions 9 Details – Convolutional Layers and Feature Maps ● Inherently, the convolutional layers can accept arbitrary size image. ● Feature map involve not only the strength of the responses, but also their spatial positions. 10 Details – The Spatial Pyramid Pooling Layer ● SPP-net is a new layer with Spatial Pyramid Pooling 256 x ( 4x4 + 2x2 + 1) = 5376 Dimension vector SoftMax FC7 FC6 SPP Conv5 Conv4 Conv3 Conv2 Conv1 11 256 filters Details – Training with the Spatial Pyramid Pooling ● Single-size training ● Simply modify the configuration file of CNN frameworks SoftMax FC7 FC6 SPP Conv5 Conv4 Conv3 Conv2 Conv1 12 Feature map: 13x13 Details – Training with the Spatial Pyramid Pooling ● Multiple-size training ● Multiple networks sharing all weights ● Each network for a single size. (e.g. 224x224, 180x180) ● Improve scale-invariance resize 13 Details – Fast CNN-based Object Detection ● The features can be computed from entire image only once. ● Similar accuracy, much faster (24x~64x) than R-CNN 2000 Convolutions! 14 1 Convolution! Experiments (1/4) ● ILSVRC image classification task ● 1000 object classes (1,431,167 images) 15 Experiments (2/4) ● ILSVRC image classification task (rank #3) ● SPP improves all CNN architectures Top-5 test accuracy Top-5 val. accuracy 16 Experiments (3/4) ● ILSVRC image detection task ● Fully annotated 200 object classes across 121,931 images ● Allows evaluation of generic object detection in cluttered scenes at scale Detected Region Groun d-truth :True :False 17 Experiments (4/4) ● ILSVRC image detection task (rank #2) ● 18 More practical than R-CNN Conclusion ● SPP is flexible solution for handling different scales, sizes, and aspect ration. ● Spatial Pyramid Pooling improves accuracy. ● Multi-size training improves accuracy. ● Full-image representation improves accuracy. ● Classification: SPP improves all CNNs in the literature. ● Detection: Practical, fast and accurate than R-CNN. 19