Parallel Machine Learning for Large-Scale Graphs Danny Bickson The GraphLab Team: Yucheng Low Joseph Gonzalez Aapo Kyrola Jay Gu Carlos Joe Guestrin Hellerstein Alex Smola Carnegie Mellon University Parallelism is Difficult Wide array of different parallel architectures: GPUs Multicore Clusters Clouds Supercomputers Different challenges for each architecture High Level Abstractions to make things easier How will we design and implement parallel learning systems? ... a popular answer: Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Map Reduce Feature Extraction Cross Validation Computing Sufficient Statistics Lasso Label Propagation Belief Kernel Propagation Methods Tensor Factorization Deep Belief Networks PageRank Neural Networks Example of Graph Parallelism PageRank Example Iterate: Where: α is the random reset probability L[j] is the number of links on page j 1 2 3 4 5 6 Properties of Graph Parallel Algorithms Dependency Graph Local Updates Iterative Computation My Rank Friends Rank Addressing Graph-Parallel ML We need alternatives to Map-Reduce Data-Parallel Map Reduce Feature Extraction Cross Validation Computing Sufficient Statistics Graph-Parallel Pregel Map Reduce? (Giraph)? SVM Lasso Kernel Methods Tensor Factorization Deep Belief Networks Belief Propagation PageRank Neural Networks Pregel (Giraph) Bulk Synchronous Parallel Model: Compute Communicate Barrier Problem: Bulk synchronous computation can be highly inefficient BSP Systems Problem: Curse of the Slow Job Iterations Data Data CPU 1 Data CPU 1 Data CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Barrier Data Barrier Data Barrier Data The Need for a New Abstraction If not Pregel, then what? Data-Parallel Map Reduce Feature Extraction Graph-Parallel Pregel (Giraph) Cross Validation SVM Computing Sufficient Statistics Kernel Methods Tensor Factorization Deep Belief Networks Belief Propagation PageRank Neural Networks Lasso The GraphLab Solution Designed specifically for ML needs Express data dependencies Iterative Simplifies the design of parallel programs: Abstract away hardware issues Automatic data synchronization Addresses multiple hardware architectures Multicore Distributed Cloud computing GPU implementation in progress What is GraphLab? The GraphLab Framework Graph Based Data Representation Scheduler Update Functions User Computation Consistency Model Data Graph A graph with arbitrary data (C++ Objects) associated with each vertex and edge. Graph: • Social Network Vertex Data: • User profile text • Current interests estimates Edge Data: • Similarity weights Update Functions An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope; // Update the vertex data R[i] (1 ) W jN [ i ] ji R[ j ]; // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); } Dynamic computation The Scheduler Scheduler The scheduler determines the order that vertices are updated CPU 1 e b a hi h c b a f i d g j k CPU 2 The process repeats until the scheduler is empty The GraphLab Framework Graph Based Data Representation Scheduler Update Functions User Computation Consistency Model Ensuring Race-Free Code How much can computation overlap? Need for Consistency? Higher Throughput (#updates/sec) No Consistency Potentially Slower Convergence of ML Inconsistent ALS 128 64 Dynamic Inconsistent Train RMSE 32 Dynamic Consistent 16 8 4 2 1 0.5 0 Netflix data, 8 cores 2 4 Updates 6 8 Millions Even Simple PageRank can be Dangerous GraphLab_pagerank(scope) { ref sum = scope.center_value sum = 0 forall (neighbor in scope.in_neighbors ) sum = sum + neighbor.value / nbr.num_out_edges sum = ALPHA + (1-ALPHA) * sum … Inconsistent PageRank Even Simple PageRank can be Dangerous GraphLab_pagerank(scope) { ref sum = scope.center_value sum = 0 forall (neighbor in scope.in_neighbors) sum = sum + neighbor.value / nbr.num_out_edges sum = ALPHA + (1-ALPHA) * sum … Read-write race CPU 1 reads bad PageRank estimate, as CPU 2 computes value CPU 1 Read CPU 2 Stable Unstable Race Condition Can Be Very Subtle GraphLab_pagerank(scope) { ref sum = scope.center_value sum = 0 forall (neighbor in scope.in_neighbors) sum = sum + neighbor.value / neighbor.num_out_edges sum = ALPHA + (1-ALPHA) * sum … GraphLab_pagerank(scope) { sum = 0 forall (neighbor in scope.in_neighbors) sum = sum + neighbor.value / nbr.num_out_edges sum = ALPHA + (1-ALPHA) * sum scope.center_value = sum … This was actually encountered in user code. GraphLab Ensures Sequential Consistency For each parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 Parallel CPU 2 Sequential Single CPU time Consistency Rules Data Guaranteed sequential consistency for all update functions Full Consistency Obtaining More Parallelism Edge Consistency Safe CPU 1 Read CPU 2 The GraphLab Framework Graph Based Data Representation Scheduler Update Functions User Computation Consistency Model What algorithms are implemented in GraphLab? Carnegie Mellon University Alternating Least Squares CoEM Lasso SVD Splash Sampler Bayesian Tensor Factorization Belief Propagation PageRank LDA Gibbs Sampling Dynamic Block Gibbs Sampling K-Means Linear Solvers …Many others… SVM Matrix Factorization GraphLab Libraries Matrix factorization SVD,PMF, BPTF, ALS, NMF, Sparse ALS, Weighted ALS, SVD++, time-SVD++, SGD Linear Solvers Jacobi, GaBP, Shotgun Lasso, Sparse logistic regression, CG Clustering K-means, Fuzzy K-means, LDA, K-core decomposition Inference Discrete BP, NBP, Kernel BP Efficient Multicore Collaborative Filtering LeBuSiShu team – 5th place in track1 Yao Wu Qiang Yan Qing Yang Institute of Automation Chinese Academy of Sciences Danny Bickson Yucheng Low Machine Learning Dept Carnegie Mellon University ACM KDD CUP Workshop 2011 Carnegie Mellon University ACM KDD CUP 2011 • Task: predict music score • Two main challenges: • Data magnitude – 260M ratings • Taxonomy of data Data taxonomy Our approach • Use ensemble method • Custom SGD algorithm for handling taxonomy Ensemble method • Solutions are merged using linear regression Performance results Blended Validation RMSE: 19.90 Classical Matrix Factorization Users Sparse Matrix d Item MFITR Users Features of the Artist Features of the Album Item Specific Features Sparse Matrix “Effective Feature of an Item” d Intuitively, features of an artist and features of his/her album should be “similar”. How do we express this? • Penalty terms which ensure Artist/Album/Track features are “close” Artist Album • Strength of penalty depends on “normalized rating similarity” (See neighborhood model) Track Fine Tuning Challenge Dataset has around 260M observed ratings 12 different algorithms, total 53 tunable parameters How do we train and cross validate all these parameters? USE GRAPHLAB! 16 Cores Runtime Speedup plots Who is using GraphLab? Carnegie Mellon University Universities using GraphLab Startups using GraphLab Companies tyring out GraphLab 2400++ Unique Downloads Tracked (possibly many more from direct repository checkouts) User community Performance results 1.00E+08 1.00E+06 1.00E+06 L1 Error 1.00E+08 1.00E+04 1.00E+02 Pregel (via GraphLab) 1.00E+00 0 5000 10000 Runtime (s) 100000000 1.00E+04 15000 Pregel (via GraphLab) 1.00E+02 1.00E+00 GraphLab 1.00E-02 Num-Vertices L1 Error GraphLab vs. Pregel (BSP) GraphLab 1.00E-02 0.0E+00 1.0E+09 Updates 2.0E+09 51% updated only once 1000000 10000 100 1 0 10 20 30 40 Number of Updates 50 60 Multicore PageRank (25M Vertices, 355M Edges) 70 CoEM (Rosie Jones, 2005) Named Entity Recognition Task Is “Dog” an animal? Is “Catalina” a place? Vertices: 2 Million Edges: 200 Million Hadoop the dog <X> ran quickly Australia travelled to <X> Catalina Island 95 Cores <X> is pleasant 7.5 hrs CoEM (Rosie Jones, 2005) Better 16 14 Optimal 12 Hadoop 95 Cores 7.5 hrs 16 Cores 30 min Speedup 10 GraphLab 8 6 GraphLab CoEM 15x Faster! 6x fewer CPUs! 4 2 0 0 2 4 6 8 10 12 14 16 Number of CPUs 62 GraphLab in the Cloud Carnegie Mellon CoEM (Rosie Jones, 2005) 16 Better 14 Optimal 12 Speedup Hadoop 10 8 GraphLab 95 Cores 7.5 hrs 16 Cores 30 min Large 6 GraphLab 32 EC2 4 in the Cloud machines 2 80 secs Small 0.3% of Hadoop time 0 0 2 4 6 8 Number of CPUs 10 12 14 16 Cost-Time Tradeoff video co-segmentation results faster a few machines helps a lot diminishing returns more machines, higher cost Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges 16 Ideal d=100 (30M Cycles) d=50Ideal (7.7M Cycles) 12 d=20 (2.1M Cycles) 10 d=5 (1.0M D=100 Cycles) 8 14 Speedup Users Netflix 6 D=20 4 2 1 4 8 4 10 D 3 Movies Runtime(s) 10 16 24 Hadoop MPI 32 40 #Nodes MPI 48 56 64 Hadoop GraphLab 2 10 GraphLab 1 10 4 8 16 24 32 40 48 56 64 Multicore Abstraction Comparison -0.02 -0.022 Dynamic Log Test Error -0.024 Round Robin -0.026 -0.028 -0.03 -0.032 -0.034 -0.036 0 5000000 Updates Netflix Matrix Factorization 10000000 Dynamic Computation, Faster Convergence The Cost of Hadoop 10 2 D=100 10 10 GraphLab 10 10 1 D=50 1 Cost($) Cost($) Hadoop 0 D=20 10 10 −1 1 10 2 10 3 10 Runtime(s) 10 4 D=5 0 −1 0.92 0.94 0.96 0.98 Error (RMSE) 1 Fault Tolerance Carnegie Mellon University Fault-Tolerance Larger Problems Increased chance of Machine Failure GraphLab2 Introduces two fault tolerance (checkpointing) mechanisms Synchronous Snapshots Chandi-Lamport Asynchronous Snapshots Synchronous Snapshots Time Run GraphLab Run GraphLab Barrier + Snapshot Run GraphLab Run GraphLab Barrier + Snapshot Run GraphLab Run GraphLab Curse of the slow machine 8 x 10 2.5 no snapshot No Snapshot vertices updated 2 1.5 1 0.5 0 0 sync. Snapshot async. snapshot sync. snapshot 50 100 time elapsed(s) 150 Curse of the Slow Machine Run GraphLab Time Run GraphLab Barrier + Snapshot Run GraphLab Run GraphLab Curse of the slow machine 8 x 10 2.5 no snapshot No Snapshot vertices updated 2 1.5 1 0.5 0 0 sync. Snapshot async. snapshot Delayed sync. Snapshot sync. snapshot 50 100 time elapsed(s) 150 Asynchronous Snapshots Chandy Lamport algorithm implementable as a GraphLab update function! Requires edge consistency struct chandy_lamport { void operator()(icontext_type& context) { save(context.vertex_data()); foreach ( edge_type edge, context.in_edges() ) { if (edge.source() was not marked as saved) { save(context.edge_data(edge)); context.schedule(edge.source(), chandy_lamport()); } } ... Repeat for context.out_edges Mark context.vertex() as saved; } }; Snapshot Performance 8 x 10 2.5 no snapshot No Snapshot Async. Snapshot vertices updated 2 1.5 1 0.5 0 0 sync. Snapshot async. snapshot sync. snapshot 50 100 time elapsed(s) 150 Snapshot with 15s fault injection Halt 1 out of 16 machines 15s 8 x 10 2.5 no snapshot No Snapshot Async. Snapshot vertices updated 2 1.5 1 0.5 0 0 sync. Snapshot async. snapshot sync. snapshot 50 100 time elapsed(s) 150 New challenges Natural Graphs Power Law 10 10 Yahoo! Web Graph: 1.4B Verts, 6.7B Edges 8 Top 1% of vertices is adjacent to 53% of the edges! 10 6 count 10 4 10 2 10 0 10 0 10 2 10 4 10 degree 6 10 8 10 Problem: High Degree Vertices High degree vertices limit parallelism: Touch a Large Amount of State Requires Heavy Locking Processed Sequentially High Communication in Distributed Updates Split gather and scatter across machines: Data from Y neighbors transmitted separately across network Machine 1 Machine 2 High Degree Vertices are Common Popular Movies Users “Social” People Netflix Movies Hyper Parameters B θθ ZZZ θ θ Z ZZZ ZZ wwww Z ZZZZ www ZZ ww w ww www w Docs α Common Words LDA Obama Words Two Core Changes to Abstraction Factorized Update Functors Apply Gather + + + + + + + Scatter + Monolithic Updates Decomposed Updates Delta Update Functors f1 f2 (f1o f2)( Monolithic Updates Composable Update “Messages” ) Decomposable Update Functors Gather Y Δ Y +…+ User Defined: Gather( Scatter Y Y Apply the accumulated value to center vertex Update adjacent edges and vertices. User Defined: User Defined: Scope Y + Y Parallel Sum Apply Y )Δ Apply( Y , Δ) Y Scatter( Y ) Δ 1 + Δ 2 Δ3 Locks are acquired only for region within a scope Relaxed Consistency Factorized PageRank double gather(scope, edge) { return edge.source().value().rank / scope.num_out_edge(edge.source()) } double merge(acc1, acc2) { return acc1 + acc2 } void apply(scope, accum) { old_value = scope.center_value().rank scope.center_value().rank = ALPHA + (1 - ALPHA) * accum scope.center_value().residual = abs(scope.center_value().rank – old_value) } void scatter(scope, edge) { if (scope.center_vertex().residual > EPSILON) reschedule_schedule(edge.target()) } Factorized Updates: Significant Decrease in Communication Split gather and scatter across machines: F1 ( Y o F2 )( Y ) Y Small amount of data transmitted over network Factorized Consistency Neighboring vertices maybe be updated simultaneously: Gather Gather A B Factorized Consistency Locking Gather on an edge cannot occur during apply: Gather A B Apply Vertex B gathers on other neighbors while A is performing Apply Decomposable Loopy Belief Propagation Gather: Accumulates product of in messages Apply: Updates central belief Scatter: Computes out messages and schedules adjacent vertices Update Function: wi xj User Factors (W) Movies ≈ Users Netflix x Movies w1 w2 Gather: Sum terms Apply: matrix inversion & multiply y1 y2 y3 y4 x1 x2 x3 Movie Factors (X) Users Decomposable Alternating Least Squares (ALS) Comparison of Abstractions 1.00E+08 L1 Error 1.00E+06 1.00E+04 1.00E+02 GraphLab1 1.00E+00 Factorized Updates 1.00E-02 0 1000 2000 3000 4000 Runtime (s) 5000 Multicore PageRank (25M Vertices, 355M Edges) 6000 Need for Vertex Level Asynchrony Costly gather for a single change! Y Exploit commutative associative “sum” + + + + + Y Commut-Assoc Vertex Level Asynchrony Y Exploit commutative associative “sum” + + + + + Y Commut-Assoc Vertex Level Asynchrony Y +Δ Exploit commutative associative “sum” + + + + + +Δ Y Delta Updates: Vertex Level Asynchrony Y Exploit commutative associative “sum” + Old+(Cached) + Sum + + +Δ Y Delta Updates: Vertex Level Asynchrony Δ Δ Y Exploit commutative associative “sum” + Old+(Cached) + Sum + + +Δ Y Delta Update void update(scope, delta) { scope.center_value() = scope.center_value() + delta if(abs(delta) > EPSILON) { out_delta = delta * (1 – ALPHA) * 1 / scope.num_out_edge(edge.source()) reschedule_out_neighbors(delta) } } double merge(delta, delta) { return delta + delta } Program starts with: schedule_all(ALPHA) Multicore Abstraction Comparison 1.00E+08 1.00E+07 Delta 1.00E+06 Factorized L1 Error 1.00E+05 1.00E+04 GraphLab 1 1.00E+03 Simulated Pregel 1.00E+02 1.00E+01 1.00E+00 1.00E-01 1.00E-02 0 2000 4000 6000 8000 Runtime (s) 10000 12000 Multicore PageRank (25M Vertices, 355M Edges) 14000 Distributed Abstraction Comparison 400 35 350 30 Runtime (s) Total Communication (GB) GraphLab1 300 250 200 150 GraphLab2 (Delta Updates) 100 GraphLab1 25 20 15 10 50 5 0 0 2 4 6 # Machines (8 CPUs per Machine) 8 GraphLab2 (Delta Updates) 2 4 6 # Machines (8 CPUs per Machine) Distributed PageRank (25M Vertices, 355M Edges) 8 PageRank Altavista Webgraph 2002 1.4B vertices, 6.7B edges Hadoop 800 cores Prototype GraphLab2 512 cores 9000 s 431s Known Inefficiencies. 2x gain possible Summary of GraphLab2 Decomposed Update Functions: Expose parallelism in high-degree vertices: Apply Gather + + + + + + + Scatter + Delta Update Functions: Expose asynchrony in highdegree vertices Y Δ Y Lessons Learned Machine Learning: Asynchronous often much faster than Synchronous Dynamic computation often faster However, can be difficult to define optimal thresholds: Science to do! Consistency can improve performance Sometimes required for convergence Though there are cases where relaxed consistency is sufficient System: Distributed asynchronous systems are harder to build But, no distributed barriers == better scalability and performance Scaling up by an order of magnitude requires rethinking of design assumptions E.g., distributed graph representation High degree vertices & natural graphs can limit parallelism Need further assumptions on update functions Summary An abstraction tailored to Machine Learning Targets Graph-Parallel Algorithms Naturally expresses Data/computational dependencies Dynamic iterative computation Simplifies parallel algorithm design Automatically ensures data consistency Achieves state-of-the-art parallel performance on a variety of problems Parallel GraphLab 1.1 Multicore Available Today GraphLab2 (in the Cloud) soon… Documentation… Code… Tutorials… http://graphlab.org Carnegie Mellon