PETUUM A New Platform for Distributed Machine Learning on Big Data Eric P. Xing, Qirong Ho Wei Dai, Jin Kyu Kim Jinliang Wei, Seunghak Lee Xun Zheng, Pengtao Xie Abhimanu Kumar, Yaoliang Yu 1 What they think… • Machine Learning is becoming a primary mechanism for extracting information from data. • Need ML methods to scale beyond single machine. • Flickr, Instagram and Facebook are anecdotally known to possess 10s of billions of images. • Highly inefficient to use such big data sequentially in a batch or scholastic fashion in a typical iterative ML algorithm. 2 • Despite rapid development of many new ML models and algorithms aiming at scalable application, adoption of these technologies remains generally unseen in the wider data mining, NLP, vision, and other application communities. • Difficult migration from an academic implementation (small desktop PCs, small lab clusters) to a big, less predictable platform (cloud or a corporate cluster) prevents ML models and algorithms from being widely applied. 3 Why build a new Framework…? • Find a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes). • Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulksynchronous processing paradigm. • Or even specialized graph-based execution that relies on graph representations of ML programs. • But it remains difficult to find a universal platform applicable to a wide range of ML programs at scale. 4 Problems with other platforms • Hadoop : simplicity of its Map Reduce abstraction makes it difficult to exploit ML properties such as error tolerance and its performance on many ML programs has been surpassed by alternatives. • Spark : Spark does not offer fine-grained scheduling of computation and communication for fast and correct execution of advanced ML algorithms. • GraphLab and Pregel efficiently partition graph-based models; but ML programs such as topic modeling and regression either do not admit obvious graph representations, or a graph representation may not be the most efficient choice. • In summary, existing systems manifest a unique tradeoff on efficiency, correctness, programmability, and generality. 5 What they say… 6 Petuum in a nutshell… • A distributed machine learning framework. • Aims to provide a generic algorithmic and systems interface to large scale machine learning. • Takes care of difficult systems "plumbing work" and algorithmic acceleration. • Simplified the distributed implementation of ML programs allowing us to focus on model perfection and Big Data Analytics. • It runs efficiently at scale on research clusters and cloud compute like Amazon EC2 and Google GCE. 7 Their Philosophy… • Most ML programs are defined by an explicit objective function over data (e.g., likelihood). • The goal is to attain the optimality of this function, in the space defined by the model parameters and other intermediate variables. • Operational objectives such as fault tolerance and strong consistency are absolute necessary. • ML program’s true goal is fast, efficient convergence to an optimal solution. • Petuum is built on an ML-centric optimization-theoretic principle, as opposed to various operational objectives. 8 So how they built it… • Formalized ML algorithms as iterative-convergent programs • • • • stochastic gradient descent MCMC for determining point estimates in latent variable models coordinate descent, variational methods for graphical methods proximal optimization for structured sparsity problems, and others • Found out the shared properties across all algorithms. • Key lies in the recognition of a clear dichotomy b/w DATA and MODEL • This Inspired bimodal approach to parallelism: data parallel and modal parallel distribution and execution of a big ML program over cluster of machines. 9 Data parallel and Model parallel approach • This approach exploits unique statistical nature of ML algorithms, mainly three properties – • Error tolerance – iterative-convergent algorithms are robust against limited errors in intermediate calculations. • Dynamic structural dependency – changing correlation strengths between model parameters critical to efficient parallelization. • Non-uniform convergence – No. of steps required for a parameter to converge can be highly skewed across parameters. 10 11 Parallelization Strategies 12 Principle formulation for Data & Model Parallelism • Iterative – Convergent ML Algorithm : Given data 𝒟 and model ℒ (i.e., a fitness function such as likelihood ), a typical ML problem can be grounded as executing the following update equation iteratively, until the model state (i.e., parameters and/or latent variables) Α reaches some stopping criteria: Α𝓉 = Ϝ Α𝓉−1 , ∆ℒ Α𝓉−1 , 𝒟 subscript (t) denotes iteration • The update function ∆ℒ () (which improve the loss) performs computation on data 𝒟 and model state Α and, • Outputs intermediate results to be aggregated by Ϝ(). 13 Data Parallelism, In which data is divided across machines 14 Data 𝒟 is partitioned and assigned to computational workers Assumption - function ∆() can be applied to each data set independently, yielding the equation: 𝓉 𝓉−1 Α =Ϝ Α Ρ , ∑𝜌=1 ∆ 𝓉−1 Α , 𝒟𝜌 ∆() outputs are aggregated via summation It is crucial because CPUs can produce updates must faster than they can be transmitted. Each parallel worker contributes “equally”. 15 Model Parallelism, In which ML model is divided across machines 16 Model Α is partitioned and assigned to workers Unlike data-parallelism, each update function ∆() also takes a scheduling function 𝒮𝜌𝓉−1 (), which restricts ∆() to operate on a subset of the model parameters Α: Α𝓉 = Ϝ (Α𝓉−1 , {∆ Α𝓉−1 , 𝒮𝜌𝓉−1 Α𝓉−1 )}𝑃𝜌=1 Unlike data parallelism, the model parameters Α𝑗 are not independent. Hence, definition of model-parallelism includes a global scheduling mechanism that select carefully-chosen parameters for parallel updating. 17 Petuum System Design 18 Petuum Programming Interface 19 Petuum Program Structure 20 Performance Petuum improves the performance of ML applications by making every update or iteration more effective, without compromising on update speed. More effective updates means faster ML completion time. 21 Petuum topic model (LDA) Settings: 4.5GB dataset (8.2m docs, 737m tokens, 141k vocab, 1000 topics), 50 machines (800 cores), Petuum v1.0 vs YahooLDA, program completion = reached -5.8e9 log-likelihood 22 Petuum sparse logistic regression Settings: 29GB dataset (10m features, 50k samples), 8 machines (512 cores), Petuum v0.93 vs Shotgun Lasso, program completion = reached 0.5 loss function 23 Petuum multi-class logistic regression Settings: 20GB dataset (253k samples, 21k features, 5.4b nonzeros, 1000 classes), 4 machines (256 cores), Petuum v1.0 vs Synchronous parameter server, program completion = reached 0.0168 loss function 24 25 26 Some speed highlights • Logistic regression: learn a 10m-dimensional model from 30GB of sparse data in 20 minutes, on 8 machines with 16 cores each. • LDA topic model: learn 1k topics on 8m documents (140k unique words) in 17 minutes, on 25 machines with 16 cores each. • Matrix Factorization (collaborative filtering): train on a 480k-by-20k matrix with rank 40 in 2 minutes, on 25 machines with 16 cores each. • Convolutional Neural Network built on Caffe: Train Alexnet (60m parameters) in under 24 hours, on 8 machines with a Tesla K20 GPU each. • MedLDA supervised topic model: learn 1k topics on 1.1m documents (20 labels) in 85 minutes, on 20 machines with 12 cores each. • Multiclass Logistic Regression: train on the MNIST dataset (19GB, 8m samples, 784 features) in 6 minutes, on 8 machines with 16 cores each. 27 What it doesn’t do • Primarily about Allowing ML practitioners to implement and experiment with new data/model-parallel ML algorithms on small-tomedium clusters. • lacks features that are necessary for clusters with ≥ 1000 machines. • Such as Automatic recovery from machine failure. • Experiments focused on clusters with 10-100 machines. 28 Thoughts • Highly efficient and fast for their target users. • Good library with 10+ algorithms. • Lacks features for more then >=1000 machines but no experiments even near to 500 machines, only 50 machines. • Petuum is specifically designed for algorithms such as optimization algorithms and sampling algorithms. • So, Not a silver bullet for all Big Data problems. 29 Questions? 30 Thank you Nitin Saroha 31