GraphLab A New Parallel Framework for Machine Learning

advertisement
Parallel Machine Learning for Large-Scale Graphs
Danny Bickson
The GraphLab Team:
Yucheng
Low
Joseph
Gonzalez
Aapo
Kyrola
Jay
Gu
Carlos
Joe
Guestrin Hellerstein
Alex
Smola
Carnegie Mellon University
Parallelism is Difficult
Wide array of different parallel architectures:
GPUs
Multicore
Clusters
Clouds
Supercomputers
Different challenges for each architecture
High Level Abstractions to make things easier
How will we
design and implement
parallel learning systems?
... a popular answer:
Map-Reduce / Hadoop
Build learning algorithms on-top of
high-level parallel abstractions
Map-Reduce for Data-Parallel ML
Excellent for large data-parallel tasks!
Data-Parallel
Graph-Parallel
Map Reduce
Feature
Extraction
Cross
Validation
Computing Sufficient
Statistics
Lasso
Label Propagation
Belief
Kernel
Propagation
Methods
Tensor
Factorization
Deep Belief
Networks
PageRank
Neural
Networks
Example of
Graph Parallelism
PageRank Example
Iterate:
Where:
α is the random reset probability
L[j] is the number of links on page j
1
2
3
4
5
6
Properties of Graph Parallel Algorithms
Dependency
Graph
Local
Updates
Iterative
Computation
My Rank
Friends Rank
Addressing Graph-Parallel ML
We need alternatives to Map-Reduce
Data-Parallel
Map Reduce
Feature
Extraction
Cross
Validation
Computing Sufficient
Statistics
Graph-Parallel
Pregel
Map Reduce?
(Giraph)?
SVM
Lasso
Kernel
Methods
Tensor
Factorization
Deep Belief
Networks
Belief
Propagation
PageRank
Neural
Networks
Pregel (Giraph)
Bulk Synchronous Parallel Model:
Compute
Communicate
Barrier
Problem:
Bulk synchronous
computation can be
highly inefficient
BSP Systems Problem:
Curse of the Slow Job
Iterations
Data
Data
CPU 1
Data
CPU 1
Data
CPU 1
Data
Data
Data
Data
Data
Data
Data
Data
CPU 2
CPU 2
CPU 2
Data
Data
Data
Data
Data
Data
Data
Data
CPU 3
CPU 3
CPU 3
Data
Data
Data
Data
Data
Barrier
Data
Barrier
Data
Barrier
Data
The Need for a New Abstraction
If not Pregel, then what?
Data-Parallel
Map Reduce
Feature
Extraction
Graph-Parallel
Pregel (Giraph)
Cross
Validation
SVM
Computing Sufficient
Statistics
Kernel
Methods
Tensor
Factorization
Deep Belief
Networks
Belief
Propagation
PageRank
Neural
Networks
Lasso
The GraphLab Solution
Designed specifically for ML needs
Express data dependencies
Iterative
Simplifies the design of parallel programs:
Abstract away hardware issues
Automatic data synchronization
Addresses multiple hardware architectures
Multicore
Distributed
Cloud computing
GPU implementation in progress
What is GraphLab?
The GraphLab Framework
Graph Based
Data Representation
Scheduler
Update Functions
User Computation
Consistency Model
Data Graph
A graph with arbitrary data (C++ Objects) associated
with each vertex and edge.
Graph:
• Social Network
Vertex Data:
• User profile text
• Current interests estimates
Edge Data:
• Similarity weights
Update Functions
An update function is a user defined program which when
applied to a vertex transforms the data in the scope of the vertex
pagerank(i, scope){
// Get Neighborhood data
(R[i], Wij, R[j]) scope;
// Update the vertex data
R[i]    (1   )
W
jN [ i ]
ji
 R[ j ];
// Reschedule Neighbors if needed
if R[i] changes then
reschedule_neighbors_of(i);
}
Dynamic
computation
The Scheduler
Scheduler
The scheduler determines the order that vertices are updated
CPU 1
e
b
a
hi
h
c
b
a
f
i
d
g
j
k
CPU 2
The process repeats until the scheduler is empty
The GraphLab Framework
Graph Based
Data Representation
Scheduler
Update Functions
User Computation
Consistency Model
Ensuring Race-Free Code
How much can computation overlap?
Need for Consistency?
Higher
Throughput
(#updates/sec)
No Consistency
Potentially Slower
Convergence of ML
Inconsistent ALS
128
64
Dynamic Inconsistent
Train RMSE
32
Dynamic Consistent
16
8
4
2
1
0.5
0
Netflix data, 8 cores
2
4
Updates
6
8
Millions
Even Simple PageRank can be Dangerous
GraphLab_pagerank(scope) {
ref sum = scope.center_value
sum = 0
forall (neighbor in scope.in_neighbors )
sum = sum + neighbor.value / nbr.num_out_edges
sum = ALPHA + (1-ALPHA) * sum
…
Inconsistent PageRank
Even Simple PageRank can be Dangerous
GraphLab_pagerank(scope) {
ref sum = scope.center_value
sum = 0
forall (neighbor in scope.in_neighbors)
sum = sum + neighbor.value / nbr.num_out_edges
sum = ALPHA + (1-ALPHA) * sum
…
Read-write race 
CPU 1 reads bad PageRank estimate,
as CPU 2 computes value
CPU 1
Read
CPU 2
Stable
Unstable
Race Condition Can Be Very Subtle
GraphLab_pagerank(scope) {
ref sum = scope.center_value
sum = 0
forall (neighbor in scope.in_neighbors)
sum = sum + neighbor.value / neighbor.num_out_edges
sum = ALPHA + (1-ALPHA) * sum
…
GraphLab_pagerank(scope) {
sum = 0
forall (neighbor in scope.in_neighbors)
sum = sum + neighbor.value / nbr.num_out_edges
sum = ALPHA + (1-ALPHA) * sum
scope.center_value = sum
…
This was actually encountered in user code.
GraphLab Ensures Sequential Consistency
For each parallel execution, there exists a sequential execution
of update functions which produces the same result.
CPU 1
Parallel
CPU 2
Sequential
Single
CPU
time
Consistency Rules
Data
Guaranteed sequential consistency for all update functions
Full Consistency
Obtaining More Parallelism
Edge Consistency
Safe
CPU 1
Read
CPU 2
The GraphLab Framework
Graph Based
Data Representation
Scheduler
Update Functions
User Computation
Consistency Model
What algorithms
are implemented in
GraphLab?
Carnegie Mellon University
Alternating Least
Squares
CoEM
Lasso
SVD
Splash Sampler
Bayesian Tensor
Factorization
Belief Propagation
PageRank
LDA
Gibbs Sampling
Dynamic Block Gibbs Sampling
K-Means
Linear Solvers
…Many others…
SVM
Matrix
Factorization
GraphLab Libraries
Matrix factorization
SVD,PMF, BPTF, ALS, NMF, Sparse ALS, Weighted ALS, SVD++, time-SVD++, SGD
Linear Solvers
Jacobi, GaBP, Shotgun Lasso, Sparse logistic regression, CG
Clustering
K-means, Fuzzy K-means, LDA, K-core decomposition
Inference
Discrete BP, NBP, Kernel BP
Efficient Multicore
Collaborative Filtering
LeBuSiShu team – 5th place in track1
Yao Wu
Qiang Yan
Qing Yang
Institute of Automation
Chinese Academy of Sciences
Danny Bickson
Yucheng Low
Machine Learning Dept
Carnegie Mellon University
ACM KDD CUP Workshop 2011 Carnegie Mellon University
ACM KDD CUP 2011
• Task: predict music score
• Two main challenges:
• Data magnitude – 260M ratings
• Taxonomy of data
Data taxonomy
Our approach
• Use ensemble method
• Custom SGD algorithm for handling
taxonomy
Ensemble method
• Solutions are merged using linear
regression
Performance results
Blended Validation RMSE: 19.90
Classical Matrix Factorization
Users
Sparse Matrix
d
Item
MFITR
Users
Features of the Artist
Features of the Album
Item Specific Features
Sparse Matrix
“Effective Feature of an Item”
d
Intuitively, features of an artist and features of
his/her album should be “similar”. How do we
express this?
• Penalty terms which ensure
Artist/Album/Track features are
“close”
Artist
Album
• Strength of penalty depends on
“normalized rating similarity”
(See neighborhood model)
Track
Fine Tuning Challenge
Dataset has around 260M observed ratings
12 different algorithms, total 53 tunable parameters
How do we train and cross validate all these
parameters?
USE GRAPHLAB!
16 Cores Runtime
Speedup plots
Who is using
GraphLab?
Carnegie Mellon University
Universities using GraphLab
Startups using GraphLab
Companies tyring out GraphLab
2400++ Unique Downloads Tracked
(possibly many more from direct repository checkouts)
User community
Performance results
1.00E+08
1.00E+06
1.00E+06
L1 Error
1.00E+08
1.00E+04
1.00E+02
Pregel
(via GraphLab)
1.00E+00
0
5000
10000
Runtime (s)
100000000
1.00E+04
15000
Pregel
(via GraphLab)
1.00E+02
1.00E+00
GraphLab
1.00E-02
Num-Vertices
L1 Error
GraphLab vs. Pregel (BSP)
GraphLab
1.00E-02
0.0E+00
1.0E+09
Updates
2.0E+09
51% updated only once
1000000
10000
100
1
0
10
20
30
40
Number of Updates
50
60
Multicore PageRank (25M Vertices, 355M Edges)
70
CoEM (Rosie Jones, 2005)
Named Entity Recognition Task
Is “Dog” an animal?
Is “Catalina” a place?
Vertices: 2 Million
Edges: 200 Million
Hadoop
the dog
<X> ran quickly
Australia
travelled to <X>
Catalina Island
95 Cores
<X> is pleasant
7.5 hrs
CoEM (Rosie Jones, 2005)
Better
16
14
Optimal
12
Hadoop
95 Cores
7.5 hrs
16 Cores
30 min
Speedup
10
GraphLab
8
6
GraphLab CoEM
15x Faster!
6x fewer CPUs!
4
2
0
0
2
4
6
8
10
12
14
16
Number of CPUs
62
GraphLab in the Cloud
Carnegie Mellon
CoEM (Rosie Jones, 2005)
16
Better
14
Optimal
12
Speedup
Hadoop
10
8
GraphLab
95 Cores
7.5 hrs
16 Cores
30 min
Large
6
GraphLab
32 EC2
4
in the Cloud machines
2
80 secs
Small
0.3% of Hadoop time
0
0
2
4
6
8
Number of CPUs
10
12
14
16
Cost-Time Tradeoff
video co-segmentation results
faster
a few
machines
helps a lot
diminishing
returns
more machines, higher cost
Netflix Collaborative Filtering
Alternating Least Squares Matrix Factorization
Model: 0.5 million nodes, 99 million edges
16
Ideal
d=100 (30M Cycles)
d=50Ideal
(7.7M Cycles)
12
d=20 (2.1M Cycles)
10 d=5 (1.0M
D=100
Cycles)
8
14
Speedup
Users
Netflix
6
D=20
4
2
1
4 8
4
10
D
3
Movies
Runtime(s)
10
16
24
Hadoop
MPI
32 40
#Nodes
MPI
48
56
64
Hadoop
GraphLab
2
10
GraphLab
1
10
4 8
16
24
32
40
48
56
64
Multicore Abstraction Comparison
-0.02
-0.022
Dynamic
Log Test Error
-0.024
Round Robin
-0.026
-0.028
-0.03
-0.032
-0.034
-0.036
0
5000000
Updates
Netflix Matrix Factorization
10000000
Dynamic Computation,
Faster Convergence
The Cost of Hadoop
10
2
D=100
10
10
GraphLab
10
10
1
D=50
1
Cost($)
Cost($)
Hadoop
0
D=20
10
10
−1
1
10
2
10
3
10
Runtime(s)
10
4
D=5
0
−1
0.92
0.94
0.96
0.98
Error (RMSE)
1
Fault Tolerance
Carnegie Mellon University
Fault-Tolerance
Larger Problems  Increased chance of Machine
Failure
GraphLab2 Introduces two fault tolerance
(checkpointing) mechanisms
Synchronous Snapshots
Chandi-Lamport Asynchronous Snapshots
Synchronous Snapshots
Time
Run GraphLab
Run GraphLab
Barrier + Snapshot
Run GraphLab
Run GraphLab
Barrier + Snapshot
Run GraphLab
Run GraphLab
Curse of the slow machine
8
x 10
2.5 no snapshot
No Snapshot
vertices updated
2
1.5
1
0.5
0
0
sync.
Snapshot
async. snapshot
sync. snapshot
50
100
time elapsed(s)
150
Curse of the Slow Machine
Run GraphLab
Time
Run GraphLab
Barrier + Snapshot
Run GraphLab
Run GraphLab
Curse of the slow machine
8
x 10
2.5 no snapshot
No Snapshot
vertices updated
2
1.5
1
0.5
0
0
sync.
Snapshot
async. snapshot
Delayed sync.
Snapshot
sync. snapshot
50
100
time elapsed(s)
150
Asynchronous Snapshots
Chandy Lamport algorithm implementable as a GraphLab
update function! Requires edge consistency
struct chandy_lamport {
void operator()(icontext_type& context) {
save(context.vertex_data());
foreach ( edge_type edge, context.in_edges() ) {
if (edge.source() was not marked as saved) {
save(context.edge_data(edge));
context.schedule(edge.source(), chandy_lamport());
}
}
... Repeat for context.out_edges
Mark context.vertex() as saved;
}
};
Snapshot Performance
8
x 10
2.5 no snapshot
No Snapshot
Async.
Snapshot
vertices updated
2
1.5
1
0.5
0
0
sync.
Snapshot
async. snapshot
sync. snapshot
50
100
time elapsed(s)
150
Snapshot with 15s fault injection
Halt 1 out of 16 machines 15s
8
x 10
2.5 no snapshot
No Snapshot
Async.
Snapshot
vertices updated
2
1.5
1
0.5
0
0
sync.
Snapshot
async. snapshot
sync. snapshot
50
100
time elapsed(s)
150
New challenges
Natural Graphs  Power Law
10
10
Yahoo! Web Graph: 1.4B Verts, 6.7B Edges
8
Top 1% of vertices is
adjacent to
53% of the edges!
10
6
count
10
4
10
2
10
0
10
0
10
2
10
4
10
degree
6
10
8
10
Problem: High Degree Vertices
High degree vertices limit parallelism:
Touch a Large
Amount of State
Requires Heavy
Locking
Processed
Sequentially
High Communication in
Distributed Updates
Split gather and scatter across machines:
Data from
Y neighbors
transmitted separately
across network
Machine 1
Machine 2
High Degree Vertices are Common
Popular Movies
Users
“Social” People
Netflix
Movies
Hyper Parameters
B
θθ
ZZZ θ θ
Z
ZZZ
ZZ
wwww Z
ZZZZ
www ZZ
ww
w
ww
www
w
Docs
α
Common Words
LDA
Obama
Words
Two Core Changes to Abstraction
Factorized Update Functors
Apply
Gather
+ +
+
+
+
+
+
Scatter
+
Monolithic Updates
Decomposed Updates
Delta Update Functors
f1
f2
(f1o f2)(
Monolithic Updates
Composable Update “Messages”
)
Decomposable Update Functors
Gather
Y
Δ
Y
+…+
User Defined:
Gather(
Scatter
Y
Y
Apply the accumulated
value to center vertex
Update adjacent edges
and vertices.
User Defined:
User Defined:
Scope
Y
+
Y
Parallel
Sum
Apply
Y
)Δ
Apply(
Y
, Δ) 
Y
Scatter(
Y
)
Δ 1 + Δ 2  Δ3
Locks are acquired only for region within a scope  Relaxed Consistency
Factorized PageRank
double gather(scope, edge) {
return edge.source().value().rank /
scope.num_out_edge(edge.source())
}
double merge(acc1, acc2) { return acc1 + acc2
}
void apply(scope, accum) {
old_value = scope.center_value().rank
scope.center_value().rank = ALPHA + (1 - ALPHA) * accum
scope.center_value().residual =
abs(scope.center_value().rank – old_value)
}
void scatter(scope, edge) {
if (scope.center_vertex().residual > EPSILON)
reschedule_schedule(edge.target())
}
Factorized Updates: Significant
Decrease in Communication
Split gather and scatter across machines:
F1
(
Y
o
F2
)( Y ) Y
Small amount of data transmitted over network
Factorized Consistency
Neighboring vertices maybe be updated simultaneously:
Gather
Gather
A
B
Factorized Consistency Locking
Gather on an edge cannot occur during apply:
Gather
A
B
Apply
Vertex B gathers on
other neighbors while
A is performing Apply
Decomposable Loopy Belief Propagation
Gather: Accumulates product of in messages
Apply: Updates central belief
Scatter: Computes out messages and schedules
adjacent vertices
Update
Function:
wi
xj
User Factors (W)
Movies
≈
Users
Netflix
x
Movies
w1
w2
Gather: Sum terms
Apply: matrix inversion & multiply
y1
y2
y3
y4
x1
x2
x3
Movie Factors (X)
Users
Decomposable Alternating Least
Squares (ALS)
Comparison of Abstractions
1.00E+08
L1 Error
1.00E+06
1.00E+04
1.00E+02
GraphLab1
1.00E+00
Factorized
Updates
1.00E-02
0
1000
2000 3000 4000
Runtime (s)
5000
Multicore PageRank (25M Vertices, 355M Edges)
6000
Need for Vertex Level Asynchrony
Costly gather for a
single change!
Y
Exploit commutative associative “sum”
+
+
+
+
+

Y
Commut-Assoc Vertex Level Asynchrony
Y
Exploit commutative associative “sum”
+
+
+
+
+

Y
Commut-Assoc Vertex Level Asynchrony
Y
+Δ
Exploit commutative associative “sum”
+
+
+
+
+
+Δ
Y
Delta Updates: Vertex Level Asynchrony
Y
Exploit commutative associative “sum”
+ Old+(Cached)
+ Sum
+ +
+Δ
Y
Delta Updates: Vertex Level Asynchrony
Δ
Δ
Y
Exploit commutative associative “sum”
+ Old+(Cached)
+ Sum
+ +
+Δ
Y
Delta Update
void update(scope, delta) {
scope.center_value() = scope.center_value() + delta
if(abs(delta) > EPSILON) {
out_delta = delta * (1 – ALPHA) *
1 / scope.num_out_edge(edge.source())
reschedule_out_neighbors(delta)
}
}
double merge(delta, delta) { return delta + delta
}
Program starts with: schedule_all(ALPHA)
Multicore Abstraction Comparison
1.00E+08
1.00E+07
Delta
1.00E+06
Factorized
L1 Error
1.00E+05
1.00E+04
GraphLab 1
1.00E+03
Simulated Pregel
1.00E+02
1.00E+01
1.00E+00
1.00E-01
1.00E-02
0
2000
4000
6000
8000
Runtime (s)
10000
12000
Multicore PageRank (25M Vertices, 355M Edges)
14000
Distributed Abstraction Comparison
400
35
350
30
Runtime (s)
Total Communication (GB)
GraphLab1
300
250
200
150
GraphLab2
(Delta Updates)
100
GraphLab1
25
20
15
10
50
5
0
0
2
4
6
# Machines (8 CPUs per Machine)
8
GraphLab2
(Delta Updates)
2
4
6
# Machines (8 CPUs per Machine)
Distributed PageRank (25M Vertices, 355M Edges)
8
PageRank
Altavista Webgraph 2002
1.4B vertices, 6.7B edges
Hadoop
800 cores
Prototype GraphLab2 512 cores
9000 s
431s
Known
Inefficiencies.
2x gain possible
Summary of GraphLab2
Decomposed Update Functions: Expose parallelism in
high-degree vertices:
Apply
Gather
+ +
+
+
+
+
+
Scatter
+
Delta Update Functions: Expose asynchrony in highdegree vertices
Y
Δ
Y
Lessons Learned
Machine Learning:
Asynchronous often much
faster than Synchronous
Dynamic computation often
faster
However, can be difficult to
define optimal thresholds:
Science to do!
Consistency can improve
performance
Sometimes required for
convergence
Though there are cases where
relaxed consistency is sufficient
System:
Distributed asynchronous
systems are harder to build
But, no distributed barriers ==
better scalability and
performance
Scaling up by an order of
magnitude requires rethinking
of design assumptions
E.g., distributed graph
representation
High degree vertices & natural
graphs can limit parallelism
Need further assumptions on
update functions
Summary
An abstraction tailored to Machine Learning
Targets Graph-Parallel Algorithms
Naturally expresses
Data/computational dependencies
Dynamic iterative computation
Simplifies parallel algorithm design
Automatically ensures data consistency
Achieves state-of-the-art parallel performance
on a variety of problems
Parallel GraphLab 1.1
Multicore Available Today
GraphLab2 (in the Cloud) soon…
Documentation… Code… Tutorials…
http://graphlab.org
Carnegie Mellon
Download