GraphX: Unifying Data-Parallel and Graph-Parallel Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion Stoica Strata 2014 *These slides are best viewed in PowerPoint with anima Graphs are Central to Analytics Hyperlinks Raw Wikipedia XML Title PR Text Table Title Body <</ />> </> PageRank Top 20 Page Term-Doc Graph Topic Model (LDA) Word Topics Word Topic Discussion Table User Disc. Community User Editor Graph Detection Community User Com. Community Topic Topic Com. PageRank: Identifying Leaders Rank of user i Weighted sum of neighbors’ ranks Update ranks in parallel Iterate until convergence 3 The Graph-Parallel Pattern Model / Alg. State Computation depends only on the neighbors 4 Many Graph-Parallel Algorithms • Collaborative Filtering – CoEM – Alternating Least Squares • Community Detection – Stochastic Gradient – Triangle-Counting Descent – K-core Decomposition – Tensor Factorization – K-Truss • Structured Prediction – Loopy Belief Propagation – Max-Product Linear Programs – Gibbs Sampling • Semi-supervised ML – Graph SSL • Graph Analytics – – – – PageRank Personalized PageRank Shortest Path Graph Coloring • Classification – Neural Networks 5 Graph-Parallel Systems oogle Expose specialized APIs to simplify graph programming. Exploit graph structure to achieve orders-of-magnitude performance gains over more general 6 PageRank on the Live-Journal Graph Mahout/Hadoop 1340 354 Naïve Spark GraphLab 22 0 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, PageRank for 10 iterations) GraphLab is 60x faster than Hadoop GraphLab is 16x faster than Spark Graphs are Central to Analytics Hyperlinks Raw Wikipedia XML Title PR Text Table Title Body <</ />> </> PageRank Top 20 Page Term-Doc Graph Topic Model (LDA) Word Topics Word Topic Discussion Table User Disc. Community User Editor Graph Detection Community User Com. Community Topic Topic Com. Separate Systems to Support Each View Table View Graph View Table Dependency Graph Row Row Row Row Resul t Having separate systems for each view is difficult to use and inefficient 10 Difficult to Program and Use Users must Learn, Deploy, and Manage multiple systems Leads to brittle and often complex interfaces 11 Inefficient Extensive data movement and duplication across the network and file system <</ />> </> XML HDFS HDFS HDFS HDFS Limited reuse internal data-structures across stages 12 Solution: The GraphX Unified Approach New API New System Blurs the distinction between Tables and Graphs Combines Data-Parallel Graph-Parallel Systems Enabling users to easily and efficiently express the entire graph analytics pipeline Tables and Graphs are composable views of the same physical data Table View GraphX Unified Representation Graph View Each view has its own operators that exploit the semantics of the view to achieve efficient execution View a Graph as a Table Vertex Property Table Property Graph R F Id Property (V) Rxin (Stu., Berk.) Jegonzal (PstDoc, Berk.) Franklin (Prof., Berk) Istoica (Prof., Berk) Edge Property Table J I SrcId DstId Property (E) rxin jegonzal Friend franklin rxin Advisor istoica franklin Coworker franklin jegonzal PI Table Operators Table (RDD) operators are inherited from Spark: map reduce sample filter count take groupBy fold first sort reduceByKey partitionBy union groupByKey mapWith join cogroup pipe leftOuterJoin cross save rightOuterJoin zip ... 16 Graph Operators class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views ----------------def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations -----------------------------def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean, pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E] } 17 Triplets Join Vertices and Edges The triplets operator joins vertices and edges: Vertices Triplets Edges A A B B A C A C C B C B C D C D C D The mrTriplets operator sums adjacent triplets. SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum FROM triplets AS t GROUPBY t.dstId Map Reduce Triplets Map-Reduce for each vertex B mapF(A B ) A1 mapF(A C ) A2 C A D reduceF(A1 ,A2 ) E A F 19 Example: Oldest Follower 23 What is the age of the oldest follower for each user? 42 B val oldestFollowerAge = graph .mrTriplets( e=> (e.dst.id, e.src.age),//Map (a,b)=> max(a, b) //Reduce ) D 19 .vertices C 30 A E F 16 20 75 We express the Pregel and GraphLab abstractions using the GraphX operators in less than 50 lines of code! By composing these operators we can construct entire graph-analytics pipelines. 21 DIY Demo this Afternoon GraphX System Design Distributed Graphs as Tables (RDDs) Property Graph Part. 1 B C A D A D 2D Vertex A Cut Heuristic D Vertex Table (RDD) Routing Table (RDD) A A 1 2 B C D F E Part. 2 Edge Table (RDD)B A A C B 1 B C C 1 C D D 1 2 A E A F E D E F E E 2 F F 2 Caching for Iterative mrTriplets Vertex Table (RDD) A A B B C C D D Edge Table (RDD) Mirror Cache A B A C B C C D A E A F E E D F E F A B C D Mirror Cache A E E FF D Incremental Updates for Iterative mrTriplets Vertex Table (RDD) Change A B C D Edge Table (RDD) Mirror Cache A B A C B C C D A E A F E E D F E F A B C D Mirror Cache Change E F D Scan A Aggregation for Iterative mrTriplets Vertex Table (RDD) Change Change Change Change Edge Table (RDD) Mirror Cache A B A B A C B C C D A E A F E E D F E F A Local Aggregate B C D C Mirror Cache D Change Change E F Local Aggregate D Scan A Reduction in Communication Due to Cached Updates Connected Components on Twitter Graph Network Comm. (MB) 10000 1000 100 10 Most vertices are within 8 hops of all vertices in their comp. 1 0.1 0 2 4 6 8 Iteration 10 12 14 16 Benefit of Indexing Active Edges Connected Components on Twitter Graph Runtime (Seconds) 30 Scan 25 Indexed 20 15 Scan All Edges 10 Index of “Active” Edges 5 0 0 2 4 6 8 Iteration 10 12 14 16 Join Elimination Identify and bypass joins for unused triplets fields Communication (MB) » Example: PageRank only accesses source PageRank on Twitter Three Way Join attribute 14000 Join Elimination 12000 10000 8000 6000 4000 2000 0 Factor of 2 reduction in communication 0 5 10 Iteration 15 20 30 Additional Query Optimizations Indexing and Bitmaps: » To accelerate joins across graphs » To efficiently construct sub-graphs Substantial Index and Data Reuse: » Reuse routing tables across graphs and subgraphs » Reuse edge adjacency information and indices 31 Performance Comparisons Live-Journal: 69 Million Edges Mahout/Hadoop 1340 Naïve Spark 354 Giraph 207 68 GraphX GraphLab 22 0 200 400 600 800 1000 1200 1400 1600 Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 3x slower than GraphLab GraphX scales to larger graphs Twitter Graph: 1.5 Billion Edges Giraph 749 451 GraphX GraphLab 203 0 200 400 600 800 Runtime (in seconds, PageRank for 10 iterations) GraphX is roughly 2x slower than GraphLab » Scala + Java overhead: Lambdas, GC time, … » No shared memory parallelism: 2x increase in comm. PageRank is just one stage…. What about a pipeline? A Small Pipeline in GraphX Raw Wikipedia Hyperlinks <</ />> </> PageRank HDFS XML Spark Preprocess Top 20 Pages HDFS Compute Spark Spark Post. 1492 Giraph + Spark 605 GraphX 342 GraphLab + Spark 375 0 200 400 600 800 1000 1200 1400 1600 Total Runtime (in Seconds) Timed end-to-end GraphX is faster than The GraphX Stack (Lines of Code) PageRan Connected k (5) Comp. (10) Shortest SVD Path (40) (10) ALS (40) Pregel (28) + GraphLab (50) GraphX (3575) Spark K-core (51) Triangl e Count (45) LDA (120) Status Alpha release as part of Spark 0.9 Seeking collaborators and feedback Conclusion and Observations Domain specific views: Tables and Graphs » tables and graphs are first-class composable objects » specialized operators which exploit view semantics Single system that efficiently spans the pipeline » minimize data movement and duplication » eliminates need to learn and manage multiple systems Graphs through the lens of database 38 Active Research Static Data Dynamic Data » Apply GraphX unified approach to time evolving data » Model and analyze relationships over time Serving Graph Structured Data » Allow external systems to interact with GraphX » Unify distributed graph databases with relational database technology 39 Thanks! http://amplab.github.io/graphx/ ankurd@eecs.berkeley.edu crankshaw@eecs.berkeley.edu rxin@eecs.berkeley.edu jegonzal@eecs.berkeley.edu Graph Property 1 Real-World Graphs Power-Law Degree Distribution Edges >> Vertices AltaVista WebGraph1.4B Vertices, 6.6B Edges 8 Facebook More than 10 vertices have one neighbor. Top 1% of vertices are adjacent to 50% of the edges! 8 10 6 10 count Number of Vertices 10 4 10 2 10 0 10 0 10 2 10 4 10 degree Degree 6 10 8 10 Ratio of Edges to Vertices 10 200 180 160 140 120 100 80 60 40 20 0 2008 2010 Year 2012 41 Graph Property 2 Active Vertices PageRank on Web Graph 100000000 51% updated only once! Num-Vertices 10000000 1000000 100000 10000 1000 100 10 1 0 10 20 30 40 50 Number of Updates 60 70 Graphs are Essential to Data Mining and Machine Learning Identify influential people and information Find communities Understand people’s shared interests Model complex data dependencies Recommending Products Users Ratings Item s Recommending Products f(i) Movie s f(j) User Factors (U) Movie s ≈ User s Netflix x f(1) r13 r14 f(2) f(3) f(4) r24 r25 f(5) Iterate: 45 Movie Factors (M) User s Low-Rank Matrix Factorization: Predicting User Behavior ? ? ? Liberal ? Conservative ? ? ? ? ? ? Post Post ? ? Post Post Post ? Post Post ? Post ? ? Post ? Post Post Post Post ? ? Conditional Random Field ? ? ? ? ? Belief Propagation ? ? Post Post Post ? ? ? ? 46 Finding Communities Count triangles passing through each vertex: 2 3 1 4 Measures “cohesiveness” of local community Fewer Triangles Weaker Community More Triangles Stronger Community Example Graph Analytics Pipeline Preprocessing Compute Post Proc. <</ />> </> XML Raw Data ETL Initial Graph Analyz Slice Compute e Subgraph PageRank Top Users Repeat 48