PowerPoint slides with animation

advertisement
GraphX:
Unifying Data-Parallel and
Graph-Parallel Analytics
Presented by Joseph Gonzalez
Joint work with Reynold Xin, Daniel Crankshaw, Ankur
Dave, Michael Franklin, and Ion Stoica
Strata 2014
*These slides are best viewed in PowerPoint with anima
Graphs are Central to Analytics
Hyperlinks
Raw
Wikipedia
XML
Title PR
Text
Table
Title Body
<</ />>
</>
PageRank Top 20 Page
Term-Doc
Graph
Topic Model
(LDA)
Word Topics
Word Topic
Discussion
Table
User Disc.
Community
User
Editor Graph Detection Community
User Com.
Community
Topic
Topic Com.
PageRank: Identifying
Leaders
Rank of
user i
Weighted sum of
neighbors’ ranks
Update ranks in parallel
Iterate until convergence
3
The Graph-Parallel Pattern
Model / Alg.
State
Computation depends
only on the neighbors
4
Many Graph-Parallel Algorithms
• Collaborative Filtering
– CoEM
– Alternating Least Squares • Community Detection
– Stochastic Gradient
– Triangle-Counting
Descent
– K-core Decomposition
– Tensor Factorization
– K-Truss
• Structured Prediction
– Loopy Belief Propagation
– Max-Product Linear
Programs
– Gibbs Sampling
• Semi-supervised ML
– Graph SSL
• Graph Analytics
–
–
–
–
PageRank
Personalized PageRank
Shortest Path
Graph Coloring
• Classification
– Neural Networks
5
Graph-Parallel Systems
oogle
Expose specialized APIs to simplify
graph programming.
Exploit graph structure to achieve
orders-of-magnitude performance gains
over more general
6
PageRank on the Live-Journal Graph
Mahout/Hadoop
1340
354
Naïve Spark
GraphLab
22
0
200
400
600
800
1000 1200 1400 1600
Runtime (in seconds, PageRank for 10 iterations)
GraphLab is 60x faster than Hadoop
GraphLab is 16x faster than Spark
Graphs are Central to Analytics
Hyperlinks
Raw
Wikipedia
XML
Title PR
Text
Table
Title Body
<</ />>
</>
PageRank Top 20 Page
Term-Doc
Graph
Topic Model
(LDA)
Word Topics
Word Topic
Discussion
Table
User Disc.
Community
User
Editor Graph Detection Community
User Com.
Community
Topic
Topic Com.
Separate Systems to Support Each
View
Table View
Graph View
Table
Dependency
Graph
Row
Row
Row
Row
Resul
t
Having separate systems
for each view is
difficult to use and
inefficient
10
Difficult to Program and Use
Users must Learn, Deploy, and
Manage multiple systems
Leads to brittle and often
complex interfaces
11
Inefficient
Extensive data movement and duplication across
the network and file system
<</ />>
</>
XML
HDFS
HDFS
HDFS
HDFS
Limited reuse internal data-structures
across stages
12
Solution: The GraphX Unified Approach
New API
New System
Blurs the distinction
between Tables and
Graphs
Combines Data-Parallel
Graph-Parallel Systems
Enabling users to easily and efficiently
express the entire graph analytics
pipeline
Tables and Graphs are composable
views of the same physical data
Table View
GraphX Unified
Representation
Graph View
Each view has its own operators that
exploit the semantics of the view
to achieve efficient execution
View a Graph as a Table
Vertex Property Table
Property Graph
R
F
Id
Property (V)
Rxin
(Stu., Berk.)
Jegonzal
(PstDoc, Berk.)
Franklin
(Prof., Berk)
Istoica
(Prof., Berk)
Edge Property Table
J
I
SrcId
DstId
Property (E)
rxin
jegonzal
Friend
franklin
rxin
Advisor
istoica
franklin
Coworker
franklin
jegonzal
PI
Table Operators
Table (RDD) operators are inherited from
Spark:
map
reduce
sample
filter
count
take
groupBy
fold
first
sort
reduceByKey
partitionBy
union
groupByKey
mapWith
join
cogroup
pipe
leftOuterJoin
cross
save
rightOuterJoin
zip
...
16
Graph Operators
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views ----------------def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Transformations -----------------------------def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
def mapV(m: (Id, V) => T ): Graph[T,E]
def mapE(m: Edge[V,E] => T ): Graph[V,T]
// Joins ---------------------------------------def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
// Computation ---------------------------------def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
}
17
Triplets Join Vertices and
Edges
The triplets operator joins vertices and
edges:
Vertices
Triplets
Edges
A
A
B
B
A
C
A
C
C
B
C
B
C
D
C
D
C
D
The mrTriplets operator sums adjacent
triplets.
SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum
FROM triplets AS t GROUPBY t.dstId
Map Reduce Triplets
Map-Reduce for each vertex B
mapF(A
B
)
A1
mapF(A
C
)
A2
C
A
D
reduceF(A1 ,A2
)
E
A
F
19
Example: Oldest Follower
23
What is the age of the oldest
follower for each user?
42
B
val oldestFollowerAge = graph
.mrTriplets(
e=> (e.dst.id, e.src.age),//Map
(a,b)=> max(a, b) //Reduce
)
D 19
.vertices
C
30
A
E
F
16
20
75
We express the Pregel and GraphLab
abstractions using the GraphX operators
in less than 50 lines of code!
By composing these operators we can
construct entire graph-analytics pipelines.
21
DIY Demo this
Afternoon
GraphX System Design
Distributed Graphs as Tables
(RDDs)
Property Graph
Part. 1
B
C
A
D
A
D
2D Vertex
A Cut Heuristic
D
Vertex
Table
(RDD)
Routing
Table
(RDD)
A
A 1 2
B
C
D
F
E
Part. 2
Edge
Table
(RDD)B
A
A
C
B 1
B
C
C 1
C
D
D 1 2
A
E
A
F
E
D
E
F
E
E 2
F
F 2
Caching for Iterative mrTriplets
Vertex
Table
(RDD)
A
A
B
B
C
C
D
D
Edge Table
(RDD)
Mirror
Cache
A
B
A
C
B
C
C
D
A
E
A
F
E
E
D
F
E
F
A
B
C
D
Mirror
Cache
A
E
E
FF
D
Incremental Updates for Iterative
mrTriplets
Vertex
Table
(RDD)
Change
A
B
C
D
Edge Table
(RDD)
Mirror
Cache
A
B
A
C
B
C
C
D
A
E
A
F
E
E
D
F
E
F
A
B
C
D
Mirror
Cache
Change
E
F
D
Scan
A
Aggregation for Iterative mrTriplets
Vertex
Table
(RDD)
Change
Change
Change
Change
Edge Table
(RDD)
Mirror
Cache
A
B
A
B
A
C
B
C
C
D
A
E
A
F
E
E
D
F
E
F
A
Local
Aggregate
B
C
D
C
Mirror
Cache
D
Change
Change
E
F
Local
Aggregate
D
Scan
A
Reduction in
Communication Due to
Cached Updates
Connected Components on Twitter Graph
Network Comm. (MB)
10000
1000
100
10
Most vertices are within 8 hops
of all vertices in their comp.
1
0.1
0
2
4
6
8
Iteration
10
12
14
16
Benefit of Indexing Active
Edges
Connected Components on Twitter Graph
Runtime (Seconds)
30
Scan
25
Indexed
20
15
Scan All Edges
10
Index of “Active” Edges
5
0
0
2
4
6
8
Iteration
10
12
14
16
Join Elimination
Identify and bypass joins for unused triplets
fields
Communication (MB)
» Example: PageRank only accesses source
PageRank on Twitter Three Way Join
attribute
14000
Join Elimination
12000
10000
8000
6000
4000
2000
0
Factor of 2 reduction in communication
0
5
10
Iteration
15
20
30
Additional Query
Optimizations
Indexing and Bitmaps:
» To accelerate joins across graphs
» To efficiently construct sub-graphs
Substantial Index and Data Reuse:
» Reuse routing tables across graphs and subgraphs
» Reuse edge adjacency information and indices
31
Performance Comparisons
Live-Journal: 69 Million Edges
Mahout/Hadoop
1340
Naïve Spark
354
Giraph
207
68
GraphX
GraphLab
22
0
200
400
600
800
1000 1200 1400 1600
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 3x slower than GraphLab
GraphX scales to larger
graphs
Twitter Graph: 1.5 Billion Edges
Giraph
749
451
GraphX
GraphLab
203
0
200
400
600
800
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 2x slower than GraphLab
» Scala + Java overhead: Lambdas, GC time, …
» No shared memory parallelism: 2x increase in comm.
PageRank is just one
stage….
What about a pipeline?
A Small Pipeline in GraphX
Raw Wikipedia
Hyperlinks
<</ />>
</>
PageRank
HDFS
XML
Spark Preprocess
Top 20 Pages
HDFS
Compute
Spark
Spark Post.
1492
Giraph + Spark
605
GraphX
342
GraphLab + Spark
375
0
200
400
600
800 1000 1200 1400 1600
Total Runtime (in Seconds)
Timed end-to-end GraphX is faster than
The GraphX Stack
(Lines of Code)
PageRan Connected
k (5)
Comp. (10)
Shortest
SVD
Path
(40)
(10)
ALS
(40)
Pregel (28) + GraphLab (50)
GraphX (3575)
Spark
K-core
(51)
Triangl
e
Count
(45)
LDA
(120)
Status
Alpha release as part of Spark 0.9
Seeking collaborators and feedback
Conclusion and
Observations
Domain specific views: Tables and Graphs
» tables and graphs are first-class composable
objects
» specialized operators which exploit view
semantics
Single system that efficiently spans the
pipeline
» minimize data movement and duplication
» eliminates need to learn and manage multiple
systems
Graphs through the lens of database
38
Active Research
Static Data  Dynamic Data
» Apply GraphX unified approach to time evolving
data
» Model and analyze relationships over time
Serving Graph Structured Data
» Allow external systems to interact with GraphX
» Unify distributed graph databases with relational
database technology
39
Thanks!
http://amplab.github.io/graphx/
ankurd@eecs.berkeley.edu
crankshaw@eecs.berkeley.edu
rxin@eecs.berkeley.edu
jegonzal@eecs.berkeley.edu
Graph Property 1
Real-World Graphs
Power-Law Degree Distribution Edges >> Vertices
AltaVista WebGraph1.4B Vertices, 6.6B
Edges
8
Facebook
More than 10
vertices
have one neighbor.
Top 1% of vertices
are adjacent to
50% of the edges!
8
10
6
10
count
Number of Vertices
10
4
10
2
10
0
10
0
10
2
10
4
10
degree
Degree
6
10
8
10
Ratio of Edges to Vertices
10
200
180
160
140
120
100
80
60
40
20
0
2008
2010
Year
2012
41
Graph Property 2
Active Vertices
PageRank on Web Graph
100000000
51% updated only once!
Num-Vertices
10000000
1000000
100000
10000
1000
100
10
1
0
10
20
30
40
50
Number of Updates
60
70
Graphs are Essential to
Data Mining and Machine
Learning
Identify influential people and information
Find communities
Understand people’s shared interests
Model complex data dependencies
Recommending Products
Users
Ratings
Item
s
Recommending Products
f(i)
Movie
s
f(j)
User Factors (U)
Movie
s
≈
User
s
Netflix
x
f(1)
r13
r14
f(2)
f(3)
f(4)
r24
r25
f(5)
Iterate:
45
Movie Factors (M)
User
s
Low-Rank Matrix Factorization:
Predicting User Behavior
?
?
?
Liberal
?
Conservative
?
?
?
?
?
?
Post
Post
?
?
Post
Post
Post
?
Post
Post
?
Post
?
?
Post
?
Post
Post
Post
Post
?
?
Conditional
Random
Field
?
?
?
?
?
Belief Propagation
?
?
Post
Post
Post
?
?
?
?
46
Finding Communities
Count triangles passing through each
vertex:
2
3
1
4
Measures “cohesiveness” of local
community
Fewer Triangles
Weaker Community
More Triangles
Stronger Community
Example Graph Analytics
Pipeline
Preprocessing
Compute
Post Proc.
<</ />>
</>
XML
Raw
Data
ETL
Initial
Graph
Analyz
Slice
Compute
e
Subgraph PageRank Top
Users
Repeat
48
Download