A General Distributed Deep Learning Platform SINGA Team

advertisement
A General Distributed Deep Learning Platform
SINGA Team
September 4, 2015 @VLDB BOSS
Outline
• Part one
– Background
– SINGA
• Overview
• Programming Model
• System Architecture
– Experimental Study
– Research Challenges
– Conclusion
• Part two
– Basic user guide
– Advanced user guide
2
Background: Applications and Models
Image/video classification
Acoustic modeling Natural language processing Machine translation Image caption generation
Source websites of each image can be found by click on it
Drug detection
3
Background: Applications and Models
Image/video classification
CNN
Acoustic modeling Natural language processing Machine translation Image caption generation
MLP
Directed acyclic connection
Convolutional Neural Network (CNN)
Multi-Layer Perceptron (MLP)
Auto-Encoder (AE)
Auto-Encoder
Recurrent connection
Long Short Term Memory (LSTM)
Recurrent Neural Network (RNN)
Source websites of each image can be found by click on it
LSTM
Drug detection
RBM
Undirected connection
Restricted Boltzman machine (RBM)
Deep Botzman Machine (DBM)
Deep Belief Network (DBN)
4
Background: Parameter Training
• Objective function
– Loss functions of deep learning models are (usually)
non-linear non-convex  No closed form solution.
• Mini-batch Stochastic Gradient Descent (SGD)
• Compute Gradients
• Update Parameters
1 week
5
Background: Distributed Training Framework
• Synchronous frameworks
server
worker
worker
---one group
coordinator
worker
worker
Sandblaster of Google Brain [1]
worker
worker
worker
worker
AllReduce in Baidu’s DeepImage[2]
• Asynchronous frameworks
server
worker
worker
worker
worker
Distributed Hogwild in
Caffe[4]
worker
worker
worker
worker
Downpour of Google Brain[1]
6
SINGA Design Goals
• General
– Different categories of models
– Different training frameworks
• Scalable
RNN
CNN
RBM …
Hogwild
AllReduce
…
– Scale to a large model and training dataset
• e.g., 1 Billion parameters and 10M images
– Model partition and Data partition
• Easy to use
– Simple programming model
– Built-in models, Python binding, Web interface
– without much awareness of the underlying distributed platform
• Abstraction for extensibility, scalability and usability
7
Outline
• Part one
– Background
– SINGA
• Overview
• Programming Model
• System Architecture
– Experimental Study
– Research Challenges
– Conclusion
• Part two
– Basic user guide
– Advanced user guide
8
SINGA Overview
Main Components
• ClusterToplogy
• NeuralNet
• TrainOneBatch
• Updater
• Users (or programmers) submit a job configuration for 4 components.
9
SINGA Overview
Main Components
• ClusterToplogy
• NeuralNet
• TrainOneBatch
• Updater
• Users (or programmers) submit a job configuration for 4 components.
• SINGA trains models over the worker-server architecture.
‒ Workers compute parameter gradients
‒ Servers perform parameter updates
10
SINGA Programming Model
11
SINGA Programming Model
12
SINGA Programming Model
Note:
BP, BPTT and CD have different flows,
so implement different TrainOneBatch
13
Distributed Training
• Worker Group
– Loads a subset of training data and
computes gradients for model replica
– Workers within a group run synchronously
– Different worker groups run asynchronously
• Server Group
– Maintains one ParamShard
– Handles requests of multiple worker
groups for parameter updates
– Synchronize with neighboring groups
Synchronous Frameworks
• Configuration for AllReduce (Baidu’s DeepImage)
– 1 worker group
– 1 server group
– Co-locate worker and server
15
Synchronous Frameworks
• Configuration for Sandblaster
– Separate worker and server groups
– 1 server group
– 1 worker group
16
Asynchronous Frameworks
• Configuration for Downpour
– Separate worker and server groups
– >=1 worker groups
– 1 server group
Asynchronous Frameworks
• Configuration of distributed Hogwild
–
–
–
–
1 server group/procs
>=1 worker groups/procs
Co-locate worker and server
Share ParamShard
18
Outline
• Part one
– Background
– SINGA
• Overview
• Programming Model
• System Architecture
– Experimental Study
– Research Challenges
– Conclusion
• Part two
– Basic user guide
– Advanced user guide
19
Experiment: Training Performance (Synchronous)
CIFAR10 dataset: 50K training and 10K test images.
For single node setting
- a 24-core server with 500GB
“SINGA-dist”:
- disable OpenBlas multi-threading
- with Sandblaster topology
20
Experiment: Training Performance (Synchronous)
CIFAR10 dataset: 50K training and 10K test images.
For single node setting
- a 24-core server with 500GB
“SINGA-dist”:
- disable OpenBlas multi-threading
- with Sandblaster topology
For multi nodes setting
- a 32-node cluster
- each node with a quad-core Intel
Xeon 3.1 GHz CPU and 8G memory
- with AllReduce topology
21
Experiment: Training Performance (Asynchronous)
60,000 iterations
Caffe
on
a single
node
SINGA
on
a single
node
22
Experiment: Training Performance (Asynchronous)
60,000 iterations
Caffe
on
a single
node
SINGA
in cluster of 32 nodes
SINGA
on
a single
node
Challenge: Scalability
• Problem
– How to reduce the wall time to reach a certain accuracy by
increasing cluster size?
• Key factors
– Efficiency of one training iteration, i.e., time per iteration
– Convergence rate, i.e., number of training iterations
• Solutions
– Efficiency
• Distributed training over one worker group
• Hardware, e.g., GPU
– Convergence rate
• Numeric optimization
• Multiple groups
– Overhead of distributed training
• communication would easily become the bottleneck
• Parameter compression[8,9], elastic SGD[5], etc.
24
Conclusion: Current status
25
Feature Plan
26
References
[1] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker,
K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, pages 1232-1240,2012.
[2] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deepimage: Scaling up image recognition. CoRR,
abs/1501.02876, 2015.
[3] B. Recht, C. Re, S. J. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient
descent. In NIPS, pages 693{701, 2011.
[4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:
Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[5] S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. CoRR, abs/1412.6651,
2014
[6] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets excel on
handwritten digit recognition. CoRR, abs/1003.0358, 2010.
[7] D. Jiang, G. Chen, B. C. Ooi, K. Tan, and S. Wu. epic: an extensible and scalable system for processing big
data. PVLDB, 7(7):541–552, 2014.
[8] Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David. Low precision storage for deep learning.
arXiv:1412.7024
[9] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-Bit Stochastic Gradient Descent and Application
to Data-Parallel Distributed Training of Speech DNNs. 2014
[10] C. Zhang and C. Re. Dimmwitted: A study of main-memory statistical analytics. PVLDB, 7(12):1283–1294,
2014.
[11] Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, Yann LeCun . Fast
Convolutional Nets With fbfft: A GPU Performance Evaluation. 2015
Download