Sparse 3D convolutional neural networks Ben Graham Department of Statistics and Centre of Complexity Science University of Warwick https://github.com/btgraham/SparseConvNet Spatial Sparsity ◦ Convolutional neural networks can be be used to understand data with spatial structure. ◦ CNNs are most often used for dense 2D data such as photos. Small filters and triangular/tetrahedral lattices • To preserve sparsity, it is best to use small convolutional filters. ◦ The world is 3D! Lots of spatial data is 2D or 3D. • Convolutional networks can be defined in an obvious way on triangular/tetrahedral/hypertetrahedral graphs. SparseConvNet supports these graphs. ◦ Adding time to make space-time can result in 3 or 4 dimensional datasets. • In d-dimensions, a 2 × 2 . . . × 2 square/cubic convolutional filter covers 2d sites. • In d-dimensions, a size two triangular/tetrahedral filter contains only d + 1 sites. • The higher the dimension, the greater the potential benefit to allowing non-square graphs. ◦ For sparse data, using a sparse CNN can improve efficiency. This is just like with matrices: matrices that are sparse can be multiplied more quickly if you store them as sparse matrices. ◦ Three dimensional convolutional deep belief networks have been used to recognize objects in 2.5D depth maps [4]. ◦ The SHREC2015 Non-rigid 3D Shape Retrieval dataset contains 1200 exemplars split evenly between 50 classes. ◦ A spatially sparse CNN only stores/calculates the feature vectors at spatial locations that can see non-zero input. We call these active spatial locations. ◦ The features vectors are dense, so GPUs can be used efficiently. ◦ 3D CNNs have been used to analyze movement in 2+1 dimensional space-time [2] and for helping drones find a safe place to land [3]. 3D object recognition ◦ A CNN consists of layers: input, hidden, hidden, . . . , hidden, output. Each layer is a grid of spatial locations, with a feature vector at each location. ◦ Spatially sparse CNN were introduced in [1] to process handwritten Chinese characters. Pen strokes are 1D objects embedded in a 2D space, and are therefore sparse. Applications of 3 dimensional CNNs ◦ Objects stored as 3D surface models composed of triangles. (i) (ii) (iii) (iv) ◦ Rendered into cubic and tetrahedral lattices. ◦ Used 6-fold cross-validation to measure ability of sparse 3D CNNs to learn shape. Convolutional filter shapes for different lattices: (i) A 4 × 4 square grid with a 2 × 2 convolutional filter. (ii) A triangular grid with size 4, and a triangular filter with size 2. (iii) A 3 × 3 × 3 cubic grid, and a 2 × 2 × 2 filter. (iv) A tetrahedral grid with size 3, and a filter of size 2. ◦ Items randomly rotated in 3D to stop it being too easy. Action recognition To show how the set of active spatial locations changes as you travel through a CNN, we have drawn a pixellated circle, and then applied alternating convolutional and pooling operations. The amount of sparsity decreases the deeper you go into a network. However, the early layers are the computationally most expensive, and there sparsity is preserved to a large extent. ◦ Videos live in 2 + 1 dimensional space-time. ◦ Detecting actions (walking, jumping, . . . ) is thus a 3D classification problem. ◦ The pixel-wise difference between two successive video frames can be turned into a sparse 2D object by thresholding. ◦ Stacking these differences produces a sparse 3D representation of a video. 3D CNNs and beyond ◦ 88.0% accuracy for the CVAP-RHA dataset of 6 actions. ◦ 67.8% accuracy on the UCF101 datasets of 101 actions. Items from a 3D object dataset, embedded into a 40 × 40 × 40 cubic grid. Top row: four items from the snake class. Bottom row: an ant, an elephant, a robot and a tortoise. ◦ Sparsity is a useful optimization in 2D, and it is potentially even more useful in 3D. ◦ The curse of dimensionality : an N × N × N cubic grid contains many more points than an N × N square grid. Left: a frame from the Recognizing Human Actions video dataset. Right: the thresholded difference. Architecture Tetrahedral MP3/2 Cubic MP3/2 Cubic FMP 4 6 20 20 20 2 Cross validation errror, % 8 ◦ We have implemented CUDA code for sparse 2,3 and 4 dimensional data: SparseConvNet. 40 40 32 80 0 80 To show how sparsity operates in 3D, consider a trefoil knot has been drawn in the cubic lattice (left). Applying a 2 × 2 × 2 convolution, the number of active/non-zero sites increases (middle). Applying a 2 × 2 × 2 pooling operation reduces the scale, which tends to decrease the number of active sites (right). References 10 30 100 300 1000 Computational cost, millions of operations [1] Ben Graham. Spatially-sparse convolutional neural networks. 2014. [2] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221–231, January 2013. [3] D. Maturana and S. Scherer. 3D Convolutional Neural Networks for Landing Zone Detection from LiDAR. In ICRA, 2015. [4] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. Accuracy versus computational complexity for a variety of different rendering sizes and lattices types. Lines show the results for 1-, 2- and 3-fold testing with a given CNN.