pptx

advertisement
PETUUM
A New Platform for Distributed Machine Learning on
Big Data
Eric P. Xing, Qirong Ho
Wei Dai, Jin Kyu Kim
Jinliang Wei, Seunghak Lee
Xun Zheng, Pengtao Xie
Abhimanu Kumar, Yaoliang Yu
1
What they think…
• Machine Learning is becoming a primary mechanism for extracting
information from data.
• Need ML methods to scale beyond single machine.
• Flickr, Instagram and Facebook are anecdotally known to possess 10s
of billions of images.
• Highly inefficient to use such big data sequentially in a batch or
scholastic fashion in a typical iterative ML algorithm.
2
• Despite rapid development of many new ML models and algorithms
aiming at scalable application, adoption of these technologies
remains generally unseen in the wider data mining, NLP, vision, and
other application communities.
• Difficult migration from an academic implementation (small desktop
PCs, small lab clusters) to a big, less predictable platform (cloud or a
corporate cluster) prevents ML models and algorithms from being
widely applied.
3
Why build a new Framework…?
• Find a systematic way to efficiently apply a wide spectrum
of advanced ML programs to industrial scale problems,
using Big Models (up to 100s of billions of parameters) on
Big Data (up to terabytes or petabytes).
• Modern parallelization strategies employ fine-grained
operations and scheduling beyond the classic bulksynchronous processing paradigm.
• Or even specialized graph-based execution that relies on
graph representations of ML programs.
• But it remains difficult to find a universal platform
applicable to a wide range of ML programs at scale.
4
Problems with other platforms
• Hadoop : simplicity of its Map Reduce abstraction makes it difficult to
exploit ML properties such as error tolerance and its performance on
many ML programs has been surpassed by alternatives.
• Spark : Spark does not offer fine-grained scheduling of computation
and communication for fast and correct execution of advanced ML
algorithms.
• GraphLab and Pregel efficiently partition graph-based models; but
ML programs such as topic modeling and regression either do not
admit obvious graph representations, or a graph representation
may not be the most efficient choice.
• In summary, existing systems manifest a unique tradeoff on
efficiency, correctness, programmability, and generality.
5
What they say…
6
Petuum in a nutshell…
• A distributed machine learning framework.
• Aims to provide a generic algorithmic and systems interface
to large scale machine learning.
• Takes care of difficult systems "plumbing work" and
algorithmic acceleration.
• Simplified the distributed implementation of ML programs allowing us to focus on model perfection and Big Data
Analytics.
• It runs efficiently at scale on research clusters and cloud
compute like Amazon EC2 and Google GCE.
7
Their Philosophy…
• Most ML programs are defined by an explicit objective function over
data (e.g., likelihood).
• The goal is to attain the optimality of this function, in the space
defined by the model parameters and other intermediate variables.
• Operational objectives such as fault tolerance and strong consistency
are absolute necessary.
• ML program’s true goal is fast, efficient convergence to an optimal
solution.
• Petuum is built on an ML-centric optimization-theoretic principle, as
opposed to various operational objectives.
8
So how they built it…
• Formalized ML algorithms as iterative-convergent programs •
•
•
•
stochastic gradient descent
MCMC for determining point estimates in latent variable models
coordinate descent, variational methods for graphical methods
proximal optimization for structured sparsity problems, and others
• Found out the shared properties across all algorithms.
• Key lies in the recognition of a clear dichotomy b/w DATA and MODEL
• This Inspired bimodal approach to parallelism: data parallel and
modal parallel distribution and execution of a big ML program over
cluster of machines.
9
Data parallel and Model parallel approach
• This approach exploits unique statistical nature of ML
algorithms, mainly three properties –
• Error tolerance – iterative-convergent algorithms are robust
against limited errors in intermediate calculations.
• Dynamic structural dependency – changing correlation strengths
between model parameters critical to efficient parallelization.
• Non-uniform convergence – No. of steps required for a
parameter to converge can be highly skewed across parameters.
10
11
Parallelization Strategies
12
Principle formulation for Data & Model Parallelism
• Iterative – Convergent ML Algorithm : Given data 𝒟 and model ℒ
(i.e., a fitness function such as likelihood ), a typical ML problem can
be grounded as executing the following update equation iteratively,
until the model state (i.e., parameters and/or latent variables) Α
reaches some stopping criteria:
Α𝓉 = Ϝ Α𝓉−1 , ∆ℒ Α𝓉−1 , 𝒟
subscript (t) denotes iteration
• The update function ∆ℒ () (which improve the loss) performs
computation on data 𝒟 and model state Α and,
• Outputs intermediate results to be aggregated by Ϝ().
13
Data Parallelism, In which data is divided across machines
14
Data 𝒟 is partitioned and assigned to computational workers
Assumption - function ∆() can be applied to each data set
independently, yielding the equation:
𝓉
𝓉−1
Α =Ϝ Α
Ρ
, ∑𝜌=1 ∆
𝓉−1
Α
, 𝒟𝜌
∆() outputs are aggregated via summation
It is crucial because CPUs can produce updates must faster than they
can be transmitted.
Each parallel worker contributes “equally”.
15
Model Parallelism, In which ML model is divided across machines
16
Model Α is partitioned and assigned to workers
Unlike data-parallelism, each update function ∆() also takes a
scheduling function 𝒮𝜌𝓉−1 (), which restricts ∆() to operate on a
subset of the model parameters Α:
Α𝓉 = Ϝ (Α𝓉−1 , {∆ Α𝓉−1 , 𝒮𝜌𝓉−1 Α𝓉−1 )}𝑃𝜌=1
Unlike data parallelism, the model parameters Α𝑗 are not
independent.
Hence, definition of model-parallelism includes a global scheduling
mechanism that select carefully-chosen parameters for parallel
updating.
17
Petuum System Design
18
Petuum Programming
Interface
19
Petuum Program
Structure
20
Performance
Petuum improves the performance of ML applications by making every
update or iteration more effective, without compromising on update
speed. More effective updates means faster ML completion time.
21
Petuum topic model (LDA)
Settings: 4.5GB dataset (8.2m docs, 737m tokens, 141k vocab, 1000
topics), 50 machines (800 cores), Petuum v1.0 vs YahooLDA, program
completion = reached -5.8e9 log-likelihood
22
Petuum sparse logistic regression
Settings: 29GB dataset (10m features, 50k samples), 8 machines (512
cores), Petuum v0.93 vs Shotgun Lasso, program completion = reached
0.5 loss function
23
Petuum multi-class logistic regression
Settings: 20GB dataset (253k samples, 21k features, 5.4b nonzeros, 1000
classes), 4 machines (256 cores), Petuum v1.0 vs Synchronous
parameter server, program completion = reached 0.0168 loss function
24
25
26
Some speed highlights
• Logistic regression: learn a 10m-dimensional model from 30GB of
sparse data in 20 minutes, on 8 machines with 16 cores each.
• LDA topic model: learn 1k topics on 8m documents (140k unique
words) in 17 minutes, on 25 machines with 16 cores each.
• Matrix Factorization (collaborative filtering): train on a 480k-by-20k
matrix with rank 40 in 2 minutes, on 25 machines with 16 cores each.
• Convolutional Neural Network built on Caffe: Train Alexnet (60m
parameters) in under 24 hours, on 8 machines with a Tesla K20 GPU
each.
• MedLDA supervised topic model: learn 1k topics on 1.1m documents
(20 labels) in 85 minutes, on 20 machines with 12 cores each.
• Multiclass Logistic Regression: train on the MNIST dataset (19GB, 8m
samples, 784 features) in 6 minutes, on 8 machines with 16 cores
each.
27
What it doesn’t do
• Primarily about Allowing ML practitioners to implement and
experiment with new data/model-parallel ML algorithms on small-tomedium clusters.
• lacks features that are necessary for clusters with ≥ 1000 machines.
• Such as Automatic recovery from machine failure.
• Experiments focused on clusters with 10-100 machines.
28
Thoughts
• Highly efficient and fast for their target users.
• Good library with 10+ algorithms.
• Lacks features for more then >=1000 machines but no experiments
even near to 500 machines, only 50 machines.
• Petuum is specifically designed for algorithms such as optimization
algorithms and sampling algorithms.
• So, Not a silver bullet for all Big Data problems.
29
Questions?
30
Thank you
Nitin Saroha
31
Download