GraphLab A New Parallel Framework for Machine Learning

advertisement
A New Parallel Framework for
Machine Learning
Joseph Gonzalez
Joint work with
Yucheng
Low
Aapo
Kyrola
Danny
Bickson
Carlos
Guestrin
Alex
Smola
Guy
Blelloch
Joe
Hellerstein
David
O’Hallaron
Carnegie Mellon
How will we
design and implement
parallel learning systems?
We could use ….
Threads, Locks, &
Messages
“low level parallel primitives”
Threads, Locks, and Messages
ML experts
repeatedly solve the same
parallel design challenges:
Implement and debug complex parallel system
Tune for a specific parallel platform
Two months later the conference paper contains:
“We implemented ______ in parallel.”
The resulting code:
is difficult to maintain
is difficult to extend
couples learning model to parallel implementation
6
... a better answer:
Map-Reduce / Hadoop
Build learning algorithms on-top of
high-level parallel abstractions
MapReduce – Map Phase
1
2
.
9
CPU 1
4
2
.
3
CPU 2
2
1
.
3
CPU 3
2
5
.
8
CPU 4
Embarrassingly Parallel independent computation
No Communication needed
8
MapReduce – Map Phase
8
4
.
3
2
4
.
1
CPU 1
1
2
.
9
1
8
.
4
CPU 2
4
2
.
3
8
4
.
4
CPU 3
2
1
.
3
CPU 4
2
5
.
8
Image Features
9
MapReduce – Map Phase
6
7
.
5
1
7
.
5
CPU 1
1
2
.
9
2
4
.
1
1
4
.
9
CPU 2
4
2
.
3
8
4
.
3
3
4
.
3
CPU 3
2
1
.
3
1
8
.
4
CPU 4
2
5
.
8
8
4
.
4
Embarrassingly Parallel independent computation
No Communication needed
10
MapReduce – Reduce Phase
Attractive Face
Statistics
Ugly Face
Statistics
17
26
.
31
22
26
.
26
CPU 1
1
2
.
9
2
4
.
1
1
7
.
5
4
2
.
3
CPU 2
8
4
.
3
6
7
.
5
2
1
.
3
1
8
.
4
1
4
.
9
2
5
.
8
8
4
.
4
3
4
.
3
Image Features
11
Map-Reduce for Data-Parallel ML
Excellent for large data-parallel tasks!
Data-ParallelGraph-Parallel
Map Reduce
Feature
Extraction
Cross
Validation
Computing Sufficient
Statistics
Is there more to
Machine Learning
Lasso
?
Label Propagation
Belief
Kernel
Propagation
Methods
Tensor
Factorization
Deep Belief
Networks
PageRank
Neural
Networks
12
Concrete Example
Label Propagation
Label Propagation Algorithm
Social Arithmetic:
+
Sue Ann
50% What I list on my profile
40% Sue Ann Likes
10% Carlos Like
80% Cameras
20% Biking
40%
I Like: 60% Cameras, 40% Biking
Likes[i] =
å
Wij ´ Likes[ j]
Me
Recurrence Algorithm:
Profile
50%
50% Cameras
50% Biking
jÎFriends[i]
iterate until convergence
Parallelism:
Compute all Likes[i] in parallel
Carlos
10%
30% Cameras
70% Biking
Properties of Graph Parallel Algorithms
Dependency
Graph
Factored
Computation
Iterative
Computation
What I Like
What My
Friends Like
Map-Reduce for Data-Parallel ML
Excellent for large data-parallel tasks!
Data-ParallelGraph-Parallel
Map Reduce
Feature
Extraction
Cross
Validation
Computing Sufficient
Statistics
?
Map Reduce?
Lasso
Label Propagation
Belief
Kernel
Propagation
Methods
Tensor
Factorization
Deep Belief
Networks
PageRank
Neural
Networks
16
Why not use Map-Reduce
for
Graph Parallel Algorithms?
Data Dependencies
Map-Reduce does not efficiently express
dependent data
Independent Data Rows
User must code substantial data transformations
Costly data replication
Iterative Algorithms
Map-Reduce not efficiently express iterative algorithms:
Iterations
Data
Data
CPU 1
Data
CPU 1
Data
Data
Data
Data
Data
Data
Data
Data
CPU 2
CPU 2
CPU 2
Data
Data
Data
Data
Data
Data
Data
Data
CPU 3
CPU 3
CPU 3
Data
Data
Data
Data
Data
Data
Barrier
Data
Barrier
Data
Barrier
Slow
Processor
CPU 1
Data
MapAbuse: Iterative MapReduce
Only a subset of data needs computation:
Iterations
Data
Data
CPU 1
Data
CPU 1
Data
CPU 1
Data
Data
Data
Data
Data
Data
Data
Data
CPU 2
CPU 2
CPU 2
Data
Data
Data
Data
Data
Data
Data
Data
CPU 3
CPU 3
CPU 3
Data
Data
Data
Data
Data
Barrier
Data
Barrier
Data
Barrier
Data
MapAbuse: Iterative MapReduce
System is not optimized for iteration:
Iterations
Data
Data
Data
CPU 1
CPU 1
CPU 3
Data
Data
Data
Data
Data
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
CPU 2
CPU 3
Disk Penalty
Data
Data
Data
Startup Penalty
Data
CPU 1
Disk Penalty
CPU 2
Data
Startup Penalty
Data
Disk Penalty
StartupPenalty
Data
Data
Data
Data
Data
Data
Data
Data
Map-Reduce for Data-Parallel ML
Excellent for large data-parallel tasks!
Data-ParallelGraph-Parallel
Map Reduce
Feature
Extraction
Cross
Validation
Computing Sufficient
Statistics
Pregel
Map Reduce?
(Giraph)?
SVM
Lasso
Kernel
Methods
Tensor
Factorization
Deep Belief
Networks
Belief
Propagation
PageRank
Neural
Networks
22
Pregel (Giraph)
Bulk Synchronous Parallel Model:
Compute
Communicate
Barrier
Problem
Bulk synchronous computation
can be highly inefficient.
Example:
Loopy Belief Propagation
25
Loopy Belief Propagation (Loopy BP)
• Iteratively estimate the “beliefs” about vertices
– Read in messages
– Updates marginal
estimate (belief)
– Send updated
out messages
• Repeat for all variables
until convergence
26
Bulk Synchronous Loopy BP
• Often considered embarrassingly parallel
– Associate processor
with each vertex
– Receive all messages
– Update all beliefs
– Send all messages
• Proposed by:
–
–
–
–
Brunton et al. CRV’06
Mendiburu et al. GECC’07
Kang,et al. LDMTA’10
…
27
Sequential Computational Structure
28
Hidden Sequential Structure
29
Hidden Sequential Structure
Evidence
Evidence
• Running Time:
Time for a single
parallel iteration
Number of Iterations
30
Optimal Sequential Algorithm
Running
Time
Bulk Synchronous
2n2/p
Gap
Forward-Backward
p ≤ 2n
2n
p=1
Optimal Parallel
n
p=2
31
The Splash Operation
~
• Generalize the optimal chain algorithm:
to arbitrary cyclic graphs:
1) Grow a BFS Spanning tree
with fixed size
2) Forward Pass computing all
messages at each vertex
3) Backward Pass computing all
messages at each vertex
32
Data-Parallel Algorithms can be Inefficient
Runtime in Seconds
10000
Optimized in Memory Bulk Synchronous
8000
6000
Asynchronous Splash BP
4000
2000
0
1
2
3
4
5
6
Number of CPUs
7
8
The limitations of the Map-Reduce abstraction can lead to
inefficient parallel algorithms.
The Need for a New Abstraction
Map-Reduce is not well suited for Graph-Parallelism
Data-ParallelGraph-Parallel
Map Reduce
Feature
Extraction
Pregel (Giraph)
Cross
Validation
SVM
Computing Sufficient
Statistics
Kernel
Methods
Tensor
Factorization
Deep Belief
Networks
Belief
Propagation
PageRank
Neural
Networks
Lasso
34
What is GraphLab?
The GraphLab Framework
Graph Based
Data Representation
Scheduler
Update Functions
User Computation
Consistency Model
36
Data Graph
A graph with arbitrary data (C++ Objects) associated
with each vertex and edge.
Graph:
• Social Network
Vertex Data:
•User profile text
• Current interests estimates
Edge Data:
• Similarity weights
37
Implementing the Data Graph
Multicore Setting
In Memory
Relatively Straight Forward
vertex_data(vid)  data
edge_data(vid,vid)  data
neighbors(vid)  vid_list
Cluster Setting
In Memory
Partition Graph:
ParMETIS or Random Cuts
Challenge:
Fast lookup, low overhead
Solution:
Dense data-structures
Fixed Vdata&Edata types
Immutable graph structure
A
B
C
D
Cached Ghosting
Node 1
Node 2
A
B
A
B
C
D
C
D
The GraphLab Framework
Graph Based
Data Representation
Scheduler
Update Functions
User Computation
Consistency Model
39
Update Functions
An update function is a user defined program which when
applied to a vertex transforms the data in the scopeof the vertex
label_prop(i, scope){
// Get Neighborhood data
(Likes[i], Wij, Likes[j]) scope;
// Update the vertex data
Likes[i] ¬
å
Wij ´ Likes[ j];
jÎFriends[i]
// Reschedule Neighbors if needed
if Likes[i] changes then
reschedule_neighbors_of(i);
}
40
The GraphLab Framework
Graph Based
Data Representation
Scheduler
Update Functions
User Computation
Consistency Model
41
The Scheduler
Scheduler
The scheduler determines the order that vertices are updated.
CPU 1
e
b
a
hi
h
c
b
a
f
i
d
g
j
k
CPU 2
The process repeats until the scheduler is empty.
42
Implementing the Schedulers
Multicore Setting
Cluster Setting
Multicore scheduler on each
node
Approximate FiFo/Priority
Random placement
Work stealing
Schedules only “local” vertices
Exchange update functions
CPU 3
CPU 4
Queue 2
Queue 3
Queue 4
CPU 2
Queue 2
CPU 2
CPU 1
Queue 1
CPU 1
Queue 1
Node 1
v1
f(v1)
f(v2)
Node 2
CPU 1
CPU 2
Queue 2
Fine-grained locking
Atomic operations
Queue 1
Challenging!
v2
The GraphLab Framework
Graph Based
Data Representation
Scheduler
Update Functions
User Computation
Consistency Model
45
Ensuring Race-Free Code
How much can computation overlap?
Importance of Consistency
Many algorithms require strict consistency, or performs significantly better under strict
consistency.
Alternating Least Squares
Error (RMSE)
12
10
Inconsistent Updates
8
6
Consistent Updates
4
2
0
0
10
20
# Iterations
30
Importance of Consistency
Machine learning algorithms require “model debugging”
Build
Test
Debug
Tweak Model
GraphLab Ensures Sequential Consistency
For each parallel execution, there exists a sequential execution
of update functions which produces the same result.
CPU 1
time
Parallel
CPU 2
Sequential
Single
CPU
49
Consistency Rules
Data
Guaranteed sequential consistency for all update functions
51
Full Consistency
52
Obtaining More Parallelism
53
Edge Consistency
Safe
CPU 1
Read
CPU 2
54
Consistency Through R/W Locks
Read/Write locks:
Full Consistency
Write
Write
Write
Canonical Lock Ordering
Edge Consistency
Read
Write
Read
Read
Write
Consistency Through R/W Locks
Multicore Setting: Pthread R/W Locks
Distributed Setting: Distributed Locking
Prefetch Locks and Data
Node 1
Node 2
Allow computation to proceed while locks/data are
requested.
Lock Pipeline
Data Graph
Partition
Consistency Through Scheduling
Edge Consistency Model:
Two vertices can be Updated simultaneously if they do not
share an edge.
Graph Coloring:
Two vertices can be assigned the same color if they do not
share an edge.
Barrier
Phase 3
Barrier
Phase 2
Barrier
Phase 1
The GraphLab Framework
Graph Based
Data Representation
Scheduler
Update Functions
User Computation
Consistency Model
58
Algorithms Implemented
PageRank
Loopy Belief Propagation
Gibbs Sampling
CoEM
Graphical Model Parameter Learning
Probabilistic Matrix/Tensor Factorization
Alternating Least Squares
Lasso with Sparse Features
Support Vector Machines with Sparse Features
Label-Propagation
…
Shared Memory
Experiments
Shared Memory Setting
16 Core Workstation
60
Loopy Belief Propagation
3D retinal image denoising
Vertices: 1 Million
Edges: 3 Million
Data Graph
Update Function:
Loopy BP Update Equation
Scheduler:
Approximate Priority
Consistency Model:
Edge Consistency
61
Loopy Belief Propagation
Better
16
14
Optimal
12
Speedup
10
8
SplashBP
6
4
2
0
0
2
4
6
8
10
12
14
16
Number of CPUs
15.5x speedup
62
CoEM (Rosie Jones, 2005)
Named Entity Recognition Task
Is “Dog” an animal?
Is “Catalina” a place?
Vertices: 2 Million
Edges: 200 Million
Hadoop
the dog
<X> ran quickly
Australia
travelled to <X>
Catalina Island
95 Cores
<X> is pleasant
7.5 hrs
CoEM (Rosie Jones, 2005)
Better
16
14
Optimal
12
Hadoop
95 Cores
7.5 hrs
16 Cores
30 min
Speedup
10
GraphLab
8
6
GraphLabCoEM
15x Faster!
6x fewer CPUs!
4
2
0
0
2
4
6
8
10
12
14
16
Number of CPUs
64
Experiments
Amazon EC2
High-Performance Nodes
65
Video Cosegmentation
Segments mean the same
Gaussian EM clustering + BP on 3D grid
Model: 10.5 million nodes, 31 million edges
Video Coseg. Speedups
Prefetching Data & Locks
Matrix Factorization
Netflix Collaborative Filtering
Alternating Least Squares Matrix Factorization
Model: 0.5 million nodes, 99 million edges
Users
Netflix
d
Movies
Netflix
Speedup Increasing size of the matrix factorization
16
Ideal
Speedup
14
d=100 (159.91 IPB)
12
d=50 (85.68 IPB)
10 d=20 (48.72 IPB)
d=5 (44.85 IPB)
8
6
4
2
1
4 8
16
24
32 40
#Nodes
48
56
64
The Cost of Hadoop
10
2
D=100
10
10
GraphLab
10
10
1
D=50
1
Cost($)
Cost($)
Hadoop
0
D=20
10
10
−1
1
10
2
10
3
10
Runtime(s)
10
4
D=5
0
−1
0.92
0.94
0.96
0.98
Error (RMSE)
1
Summary
An abstraction tailored to Machine Learning
Targets Graph-Parallel Algorithms
Naturally expresses
Data/computational dependencies
Dynamic iterative computation
Simplifies parallel algorithm design
Automatically ensures data consistency
Achieves state-of-the-art parallel performance
on a variety of problems
72
Checkout GraphLab
Documentation… Code… Tutorials…
http://graphlab.org
Questions & Feedback
jegonzal@cs.cmu.edu
Carnegie Mellon
73
Current/Future Work
Out-of-core Storage
Hadoop/HDFS Integration
Graph Construction
Graph Storage
Launching GraphLab from Hadoop
Fault Tolerance through HDFS Checkpoints
Sub-scope parallelism
Address the challenge of very high degree nodes
Improved graph partitioning
Support for dynamic graph structure
Download