A General Distributed Deep Learning Platform SINGA Team September 4, 2015 @VLDB BOSS Outline • Part one – Background – SINGA • Overview • Programming Model • System Architecture – Experimental Study – Research Challenges – Conclusion • Part two – Basic user guide – Advanced user guide 2 Background: Applications and Models Image/video classification Acoustic modeling Natural language processing Machine translation Image caption generation Source websites of each image can be found by click on it Drug detection 3 Background: Applications and Models Image/video classification CNN Acoustic modeling Natural language processing Machine translation Image caption generation MLP Directed acyclic connection Convolutional Neural Network (CNN) Multi-Layer Perceptron (MLP) Auto-Encoder (AE) Auto-Encoder Recurrent connection Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) Source websites of each image can be found by click on it LSTM Drug detection RBM Undirected connection Restricted Boltzman machine (RBM) Deep Botzman Machine (DBM) Deep Belief Network (DBN) 4 Background: Parameter Training • Objective function – Loss functions of deep learning models are (usually) non-linear non-convex No closed form solution. • Mini-batch Stochastic Gradient Descent (SGD) • Compute Gradients • Update Parameters 1 week 5 Background: Distributed Training Framework • Synchronous frameworks server worker worker ---one group coordinator worker worker Sandblaster of Google Brain [1] worker worker worker worker AllReduce in Baidu’s DeepImage[2] • Asynchronous frameworks server worker worker worker worker Distributed Hogwild in Caffe[4] worker worker worker worker Downpour of Google Brain[1] 6 SINGA Design Goals • General – Different categories of models – Different training frameworks • Scalable RNN CNN RBM … Hogwild AllReduce … – Scale to a large model and training dataset • e.g., 1 Billion parameters and 10M images – Model partition and Data partition • Easy to use – Simple programming model – Built-in models, Python binding, Web interface – without much awareness of the underlying distributed platform • Abstraction for extensibility, scalability and usability 7 Outline • Part one – Background – SINGA • Overview • Programming Model • System Architecture – Experimental Study – Research Challenges – Conclusion • Part two – Basic user guide – Advanced user guide 8 SINGA Overview Main Components • ClusterToplogy • NeuralNet • TrainOneBatch • Updater • Users (or programmers) submit a job configuration for 4 components. 9 SINGA Overview Main Components • ClusterToplogy • NeuralNet • TrainOneBatch • Updater • Users (or programmers) submit a job configuration for 4 components. • SINGA trains models over the worker-server architecture. ‒ Workers compute parameter gradients ‒ Servers perform parameter updates 10 SINGA Programming Model 11 SINGA Programming Model 12 SINGA Programming Model Note: BP, BPTT and CD have different flows, so implement different TrainOneBatch 13 Distributed Training • Worker Group – Loads a subset of training data and computes gradients for model replica – Workers within a group run synchronously – Different worker groups run asynchronously • Server Group – Maintains one ParamShard – Handles requests of multiple worker groups for parameter updates – Synchronize with neighboring groups Synchronous Frameworks • Configuration for AllReduce (Baidu’s DeepImage) – 1 worker group – 1 server group – Co-locate worker and server 15 Synchronous Frameworks • Configuration for Sandblaster – Separate worker and server groups – 1 server group – 1 worker group 16 Asynchronous Frameworks • Configuration for Downpour – Separate worker and server groups – >=1 worker groups – 1 server group Asynchronous Frameworks • Configuration of distributed Hogwild – – – – 1 server group/procs >=1 worker groups/procs Co-locate worker and server Share ParamShard 18 Outline • Part one – Background – SINGA • Overview • Programming Model • System Architecture – Experimental Study – Research Challenges – Conclusion • Part two – Basic user guide – Advanced user guide 19 Experiment: Training Performance (Synchronous) CIFAR10 dataset: 50K training and 10K test images. For single node setting - a 24-core server with 500GB “SINGA-dist”: - disable OpenBlas multi-threading - with Sandblaster topology 20 Experiment: Training Performance (Synchronous) CIFAR10 dataset: 50K training and 10K test images. For single node setting - a 24-core server with 500GB “SINGA-dist”: - disable OpenBlas multi-threading - with Sandblaster topology For multi nodes setting - a 32-node cluster - each node with a quad-core Intel Xeon 3.1 GHz CPU and 8G memory - with AllReduce topology 21 Experiment: Training Performance (Asynchronous) 60,000 iterations Caffe on a single node SINGA on a single node 22 Experiment: Training Performance (Asynchronous) 60,000 iterations Caffe on a single node SINGA in cluster of 32 nodes SINGA on a single node Challenge: Scalability • Problem – How to reduce the wall time to reach a certain accuracy by increasing cluster size? • Key factors – Efficiency of one training iteration, i.e., time per iteration – Convergence rate, i.e., number of training iterations • Solutions – Efficiency • Distributed training over one worker group • Hardware, e.g., GPU – Convergence rate • Numeric optimization • Multiple groups – Overhead of distributed training • communication would easily become the bottleneck • Parameter compression[8,9], elastic SGD[5], etc. 24 Conclusion: Current status 25 Feature Plan 26 References [1] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, pages 1232-1240,2012. [2] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deepimage: Scaling up image recognition. CoRR, abs/1501.02876, 2015. [3] B. Recht, C. Re, S. J. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693{701, 2011. [4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [5] S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. CoRR, abs/1412.6651, 2014 [6] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, abs/1003.0358, 2010. [7] D. Jiang, G. Chen, B. C. Ooi, K. Tan, and S. Wu. epic: an extensible and scalable system for processing big data. PVLDB, 7(7):541–552, 2014. [8] Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David. Low precision storage for deep learning. arXiv:1412.7024 [9] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs. 2014 [10] C. Zhang and C. Re. Dimmwitted: A study of main-memory statistical analytics. PVLDB, 7(12):1283–1294, 2014. [11] Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, Yann LeCun . Fast Convolutional Nets With fbfft: A GPU Performance Evaluation. 2015