03DistributedClusterComputingPlatforms - Tsinghua

advertisement
Distributed Cluster
Computing
Platforms
Outline
What is the purpose of Data Intensive Super Computing?
MapReduce
Pregel
Dryad
Spark/Shark
Distributed Graph Computing
Why DISC
DISC stands for Data Intensive Super Computing
A lot of applications.
◦ scientific data, web search engine, social network
◦ economic, GIS
New data are continuously generated
People want to understand the data
BigData analysis is now considered as a very important
method for scientific research.
What are the required features for
the platform to handle DISC?
Application specific: it is very difficult or even impossible to construct one
system to fit them all. One example is the POSIX compatible file system.
Each system should be re-configure or even re-designed for a specific
application. Think about the motivation for building the Google file system
for Google search engine.
Programmer friendly interfaces: The Application programmer should not
consider how to handle the infrastructure such as machines and networks.
Fault Tolerant: The platform should handle the fault components
automatically without any special treatment from the application.
Scalability: The platform should run on top of at least thousands of
machines and harnessing the power of all the components. The load
balance should be achieved by the platform instead of the application
itself.
Try to understand all these four features during the introduction of the
concrete platform below.
Programming Model
Implementation
Google
MapReduce
Refinements
Evaluation
Conclusion
Motivation: large scale data
processing
Process lots of data to produce other derived data
◦ Input: crawled documents, web request logs etc.
◦ Output: inverted indices, web page graph structure,
top queries in a day etc.
◦ Want to use hundreds or thousands of CPUs
◦ but want to only focus on the functionality
MapReduce hides messy details in a library:
◦
◦
◦
◦
Parallelization
Data distribution
Fault-tolerance
Load balancing
Motivation: Large Scale Data
Processing
Want to process lots of data ( > 1
TB)
Want to parallelize across
hundreds/thousands of CPUs
… Want to make this easy
"Google Earth uses 70.5 TB: 70 TB for the raw
imagery and 500 GB for the index data."
From: http://googlesystem.blogspot.com/2006/09/how-muchdata-does-google-store.html
MapReduce
Automatic parallelization &
distribution
Fault-tolerant
Provides status and monitoring tools
Clean abstraction for programmers
Programming Model
Borrows from functional programming
Users implement interface of two functions:
◦ map
(in_key, in_value) ->
(out_key, intermediate_value) list
◦ reduce (out_key, intermediate_value list) ->
out_value list
map
Records from the data source (lines out of
files, rows of a database, etc) are fed into
the map function as key*value pairs: e.g.,
(filename, line).
map() produces one or more intermediate
values along with an output key from the
input.
reduce
After the map phase is over, all the
intermediate values for a given output key
are combined together into a list
reduce() combines those intermediate
values into one or more final values for that
same output key
(in practice, usually only one final value per
key)
Architecture
Input key*value
pairs
Input key*value
pairs
...
map
map
Data store 1
Data store n
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
(key 2,
values...)
(key 1,
values...)
(key 3,
values...)
== Barrier == : Aggregates intermediate values by output key
key 1,
intermediate
values
reduce
final key 1
values
key 2,
intermediate
values
reduce
final key 2
values
key 3,
intermediate
values
reduce
final key 3
values
Parallelism
map() functions run in parallel, creating
different intermediate values from different
input data sets
reduce() functions also run in parallel, each
working on a different output key
All values are processed independently
Bottleneck: reduce phase can’t start until
map phase is completely finished.
Example: Count word occurrences
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator
intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
Example vs. Actual Source Code
Example is written in pseudo-code
Actual implementation is in C++, using a
MapReduce library
Bindings for Python and Java exist via
interfaces
True code is somewhat more involved
(defines how the input key/values are
divided up and accessed, etc.)
Example
Page 1: the weather is good
Page 2: today is good
Page 3: good weather is good.
Map output
Worker 1:
◦(the 1), (weather 1), (is 1), (good 1).
Worker 2:
◦(today 1), (is 1), (good 1).
Worker 3:
◦(good 1), (weather 1), (is 1), (good 1).
Reduce Input
Worker 1:
◦ (the 1)
Worker 2:
◦ (is 1), (is 1), (is 1)
Worker 3:
◦ (weather 1), (weather 1)
Worker 4:
◦ (today 1)
Worker 5:
◦ (good 1), (good 1), (good 1), (good 1)
Reduce Output
Worker 1:
◦ (the 1)
Worker 2:
◦ (is 3)
Worker 3:
◦ (weather 2)
Worker 4:
◦ (today 1)
Worker 5:
◦ (good 4)
Some Other Real Examples
Term frequencies through the whole
Web repository
Count of URL access frequency
Reverse web-link graph
Implementation Overview
Typical cluster:
◦
◦
◦
◦
◦
100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
Limited bisection bandwidth
Storage is on local IDE disks
GFS: distributed file system manages data (SOSP'03)
Job scheduling system: jobs made up of tasks, scheduler assigns tasks to
machines
Implementation is a C++ library linked into user programs
Architecture
Execution
Parallel Execution
Task Granularity And
Pipelining
Fine granularity tasks: many more map tasks than
machines
◦ Minimizes time for fault recovery
◦ Can pipeline shuffling with map execution
◦ Better dynamic load balancing
Often use 200,000 map/5000 reduce tasks w/ 2000
machines
Locality
Effect: Thousands of machines read
input at local disk speed
Master program divvies up tasks based on
location of data: (Asks GFS for locations of
replicas of input file blocks) tries to have
map() tasks on same machine as physical
file data, or at least same rack
map() task inputs are divided into 64 MB
blocks: same size as Google File System
chunks
Without this, rack switches limit read rate
Fault Tolerance
Master detects worker failures
◦Re-executes completed & in-progress
map() tasks
◦Re-executes in-progress reduce() tasks
Master notices particular input key/values
cause crashes in map(), and skips those
values on re-execution.
◦Effect: Can work around bugs in thirdparty libraries!
Fault Tolerance
On worker failure:
◦ Detect failure via periodic heartbeats
◦ Re-execute completed and in-progress map tasks
◦ Re-execute in progress reduce tasks
◦ Task completion committed through master
Master failure:
◦ Could handle, but don't yet (master failure
unlikely) Robust: lost 1600 of 1800 machines
once, but finished fine
Optimizations
No reduce can start until map is complete:
◦ A single slow disk controller can rate-limit the whole process
Master redundantly executes “slow-moving” map tasks; uses results of
first copy to finish, (one finishes first “wins”)
Slow workers significantly lengthen completion time
Other jobs consuming resources on machine
Bad disks with soft errors transfer data very slowly
Weird things: processor caches disabled (!!)
Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the total
computation?
Optimizations
“Combiner” functions can run on same machine as a mapper
Causes a mini-reduce phase to occur before the real reduce phase, to
save bandwidth
Under what conditions is it sound to use a combiner?
Refinement
Sorting guarantees within each reduce
partition
Compression of intermediate data
Combiner: useful for saving network
bandwidth
Local execution for debugging/testing
User-defined counters
Performance
Tests run on cluster of 1800 machines:
◦ 4 GB of memory
◦ Dual-processor 2 GHz Xeons with Hyperthreading
◦ Dual 160 GB IDE disks
◦ Gigabit Ethernet per machine
◦ Bisection bandwidth approximately 100 Gbps
Two benchmarks:
MR_Grep Scan 1010 100-byte records to extract records matching a rare
pattern (92K matching records)
MR_Sort
benchmark)
Sort 1010 100-byte records (modeled after TeraSort
MR_Grep
Locality optimization helps:
◦ 1800 machines read 1 TB of data at peak of ~31 GB/s
◦ Without this, rack switches would limit to 10 GB/s
Startup overhead is significant for short jobs
MR_Sort
Backup tasks reduce job completion time significantly
System deals well with failures
Normal
No Backup Tasks
200 processes killed
More and more MapReduce
MapReduce Programs In Google Source Tree
Example uses:
distributed grep
distributed sort
web link-graph reversal
term-vector per host
web access log stats
inverted index construction
document clustering
machine learning
statistical machine translation
Real MapReduce : Rewrite of
Production Indexing System
Rewrote Google's production indexing system
using MapReduce
◦ Set of 10, 14, 17, 21, 24 MapReduce operations
◦ New code is simpler, easier to understand
◦ MapReduce takes care of failures, slow machines
◦ Easy to make indexing faster by adding more
machines
MapReduce Conclusions
MapReduce has proven to be a useful
abstraction
Greatly simplifies large-scale computations
at Google
Functional programming paradigm can be
applied to large-scale applications
Fun to use: focus on problem, let library
deal w/ messy details
MapReduce Programs
Sorting
Searching
Indexing
Classification
TF-IDF
Breadth-First Search / SSSP
PageRank
Clustering
MapReduce
for
PageRank
PageRank: Random Walks Over
The Web
If a user starts at a random web page and surfs by clicking links and
randomly entering new URLs, what is the probability that s/he will
arrive at a given page?
The PageRank of a page captures this notion
◦ More “popular” or “worthwhile” pages get a higher rank
PageRank: Visually
www.cnn.com
en.wikipedia.org
www.nytimes.com
PageRank: Formula
Given page A, and pages T1 through Tn linking to A, PageRank is defined
as:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +
PR(Tn)/C(Tn))
C(P) is the cardinality (out-degree) of page P
d is the damping (“random URL”) factor
PageRank: Intuition
Calculation is iterative: PRi+1 is based on PRi
Each page distributes its PRi to all pages it links
to. Linkees add up their awarded rank
fragments to find their PRi+1
d is a tunable parameter (usually = 0.85)
encapsulating the “random jump factor”
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
PageRank: First
Implementation
Create two tables 'current' and 'next' holding the PageRank for each
page. Seed 'current' with initial PR values
Iterate over all pages in the graph, distributing PR from 'current' into
'next' of linkees
current := next; next := fresh_table();
Go back to iteration step or end if converged
Distribution of the Algorithm
Key insights allowing parallelization:
◦ The 'next' table depends on 'current', but not on any other rows of 'next'
◦ Individual rows of the adjacency matrix can be processed in parallel
◦ Sparse matrix rows are relatively small
Distribution of the Algorithm
Consequences of insights:
◦ We can map each row of 'current' to a list of PageRank “fragments” to
assign to linkees
◦ These fragments can be reduced into a single PageRank value for a page
by summing
◦ Graph representation can be even more compact; since each element is
simply 0 or 1, only transmit column numbers where it's 1
Map step: break page rank into even fragments to distribute to link targets
Reduce step: add together fragments into next PageRank
Iterate for next step...
Phase 1: Parse HTML
Map task takes (URL, page content) pairs and maps them to (URL,
(PRinit, list-of-urls))
◦ PRinit is the “seed” PageRank for URL
◦ list-of-urls contains all pages pointed to by URL
Reduce task is just the identity function
Phase 2: PageRank Distribution
Map task takes (URL, (cur_rank, url_list))
◦ For each u in url_list, emit (u, cur_rank/|url_list|)
◦ Emit (URL, url_list) to carry the points-to list along through iterations
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Phase 2: PageRank Distribution
Reduce task gets (URL, url_list) and many (URL, val) values
◦ Sum vals and fix up with d
◦ Emit (URL, (new_rank, url_list))
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Finishing up...
A non-parallelizable component determines whether convergence has
been achieved (Fixed number of iterations? Comparison of key
values?)
If so, write out the PageRank lists - done!
Otherwise, feed output of Phase 2 into another Phase 2 iteration
PageRank Conclusions
MapReduce isn't the greatest at iterated computation, but still helps
run the “heavy lifting”
Key element in parallelization is independent PageRank computations
in a given step
Parallelization requires thinking about minimum data partitions to
transmit (e.g., compact representations of graph rows)
◦ Even the implementation shown today doesn't actually scale to the whole
Internet; but it works for intermediate-sized graphs
So, do you think that MapReduce is suitable for PageRank?
(homework, give concrete reason for why and why not.)
Dryad Design
Implementation
Dryad
Policies as Plug-ins
Building on Dryad
Design Space
Internet
Dataparallel
Private
data
center
Shared
memory
Latency
Throughput
54
Data Partitioning
DATA
RAM
DATA
55
2-D Piping
Unix Pipes: 1-D
grep | sed | sort | awk | perl
Dryad: 2-D
grep1000 | sed500 | sort1000 | awk500 | perl50
56
Dryad = Execution Layer
Job (Application)
Dryad
Cluster
Pipeline
≈
Shell
Machine
57
Dryad Design
Implementation
Policies as Plug-ins
Building on Dryad
58
Virtualized 2-D Pipelines
59
Virtualized 2-D Pipelines
60
Virtualized 2-D Pipelines
61
Virtualized 2-D Pipelines
62
Virtualized 2-D Pipelines
• 2D DAG
• multi-machine
• virtualized
63
Dryad Job Structure
grep1000 | sed500 | sort1000 | awk500 | perl50
Channels
Input
files
Stage
sort
grep
Output
files
awk
sed
perl
sort
grep
awk
sed
grep
sort
Vertices
(processes)
64
Channels
Finite Streams of items
X
Items
M
• distributed filesystem files
(persistent)
• SMB/NTFS files
(temporary)
• TCP pipes
(inter-machine)
• memory FIFOs
(intra-machine)
65
Architecture
data plane
job schedule
Files, TCP, FIFO, Network
NS
Job manager
control plane
V
V
V
PD
PD
PD
cluster
66
Staging
1. Build
2. Send
.exe
JM code
7. Serialize
vertices
vertex
code
5. Generate graph
6. Initialize vertices
3. Start JM
Cluster
services
8. Monitor
Vertex execution
4. Query
cluster resources
Fault Tolerance
Dryad Design
Implementation
Policies and Resource Management
Building on Dryad
69
Policy Managers
R
R
R
R
Stage R
Connection R-X
X
X
X
X Manager R manager
X
Stage X
R-X
Manager
Job
Manager
70
Duplicate Execution Manager
X[0]
X[1]
X[3]
Completed vertices
X[2]
Slow
vertex
X’[2]
Duplicate
vertex
Duplication Policy = f(running times, data volumes)
Aggregation Manager
S
S
S
rack #
dynamic
S
S
#3S
#3S
#2S
T
static
#1S
S
#2S
#1S
# 1A
# 2A
# 3A
T
72
Data Distribution
(Group By)
Source
Source
Source
m
mxn
Dest
Dest
Dest
n
73
Range-Distribution Manager
S
S
S
S
S
S
[0-100)
Hist
[0-30),[30-100)
static
T
D
D
T
[0-?)
[0-30)
dynamic
D
T
[?-100)
[30-100)
74
Goal: Declarative Programming
X
X
static
X
S
S
T
X
T
S
T
T
dynamic
75
Dryad Design
Implementation
Policies as Plug-ins
Building on Dryad
76
Machine
Learning
sed, awk, grep, etc.
legacy
code
C#
PSQL
Perl
C++
Queries
Distributed Shell
SSIS
C#
Vectors
DryadLINQ
C++
SQL
server
Dryad
Distributed Filesystem
CIFS/NTFS
Job queueing, monitoring
Software Stack
Cluster Services
Windows
Server
Windows
Server
Windows
Server
Windows
Server
77
SkyServer Query 18
select distinct P.ObjID
into results
from photoPrimary U,
neighbors N,
photoPrimary L
where U.ObjID = N.ObjID
and L.ObjID = N.NeighborObjID
and P.ObjID < L.ObjID
and abs((U.u-U.g)-(L.u-L.g))<0.05
and abs((U.g-U.r)-(L.g-L.r))<0.05
and abs((U.r-U.i)-(L.r-L.i))<0.05
and abs((U.i-U.z)-(L.i-L.z))<0.05
H
n
Y
Y
L
L
U
S
4n
S
M
4n
M
D
n
D
X
n
X
N
U
N
78
SkyServer Q18 Performance
16.0
Dryad In-Memory
14.0
Dryad Two-pass
12.0
SQLServer 2005
10.0
Speed-up
(times)
8.0
6.0
4.0
2.0
0.0
0
2
4
6
Number of Computers
8
10
79
DryadLINQ
• Declarative programming
• Integration with Visual Studio
• Integration with .Net
• Type safety
• Automatic serialization
• Job graph optimizations
 static
 dynamic
• Conciseness
80
LINQ
Collection<T> collection;
bool IsLegal(Key);
string Hash(Key);
var results = from c in collection
where IsLegal(c.key)
select new { Hash(c.key), c.value};
81
DryadLINQ = LINQ + Dryad
Vertex
code
Collection<T> collection;
bool IsLegal(Key k);
string Hash(Key);
var results = from c in collection
where IsLegal(c.key)
select new { Hash(c.key), c.value};
Query
plan
(Dryad job)
Data
collection
C#
C#
C#
C#
results
82
Sort & Map-Reduce in DryadLINQ
S
S
S
[0-100)
Sampl
[0-30),[30-100)
D
D
Sort
D
Sort
[0-30)
[30-100)
83
PLINQ
public static IEnumerable<TSource>
DryadSort<TSource, TKey>(IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer,
bool isDescending)
{
return source.AsParallel().OrderBy(keySelector, comparer);
}
84
Machine Learning in
DryadLINQ
Data analysis
Machine learning
Large Vector
DryadLINQ
Dryad
85
Very Large Vector Library
PartitionedVector<T>
T
T
T
Scalar<T>
T
86
Operations on Large Vectors:
Map 1
T
f
U
f preserves partitioning
T
f
U
87
Map 2 (Pairwise)
T
U
f
V
T
U
f
V
88
Map 3 (Vector-Scalar)
T
f
U
V
T
U
f
V
89
89
Reduce (Fold)
f
U U
U
U
f
f
f
U
U
U
f
U
90
Linear Algebra
T
T
, ,
U
V
=
mn
,  , 
m
91
Linear Regression
Data
xt   , yt  
n
m
t  {1,..., n}
Find
S.t.
A
n m
Axt  yt
92
Analytic
Solution
A  ( y  x )(  x  x
t
t
X[0]
T
t
X[1]
t
X[2]
T 1
t
t
Y[0]
Y[1]
)
Y[2]
Map
X×XT
X×XT
X×XT
Y×XT
Y×XT
Y×XT
Reduce
Σ
Σ
[ ]-1
*
A
93
Linear Regression Code
A  (t yt  x )( t xt  x )
T
t
T 1
t
Vectors x = input(0), y = input(1);
Matrices xx = x.PairwiseOuterProduct(x);
OneMatrix xxs = xx.Sum();
Matrices yx = y.PairwiseOuterProduct(x);
OneMatrix yxs = yx.Sum();
OneMatrix xxinv = xxs.Map(a => a.Inverse());
OneMatrix A = yxs.Map(
xxinv, (a, b) => a.Multiply(b));
94
Expectation Maximization (Gaussians)
• 160 lines
• 3 iterations shown
95
Conclusions
Dryad = distributed execution environment
Application-independent (semantics oblivious)
Supports rich software ecosystem
◦
◦
◦
◦
Relational algebra
Map-reduce
LINQ
Etc.
DryadLINQ = A Dryad provider for LINQ
This is only the beginning!
96
Some other system you should
know about BigData processing
Hadoop HDFS, MapReduce (open source version of GFS and
MapReduce)
HIVE/Pig/Sawzall (Query Language Processing)
Spark/Shark (Efficient use of cluster memory and supporting iterative
mapreduce program)
Thank you! Any Questions?
Pregel as backup slides
Introduction
Computation Model
Writing a Pregel Program
Pregel
System Implementation
Experiments
Conclusion
Introduction (1/2)
Source: SIGMETRICS ’09 Tutorial – MapReduce: The Programming Model and Practice, by Jerry Zhao
Introduction (2/2)
Many practical computing problems concern large graphs
Large graph data
Graph algorithms
Web graph
Transportation routes
Citation relationships
Social networks
PageRank
Shortest path
Connected components
Clustering techniques
MapReduce is ill-suited for graph processing
◦ Many iterations are needed for parallel graph processing
◦ Materializations of intermediate results at every MapReduce iteration
harm performance
Single Source Shortest Path
(SSSP)
Problem
◦ Find shortest path from a source node to all target nodes
Solution
◦ Single processor machine: Dijkstra’s algorithm
Example: SSSP – Dijkstra’s
Algorithm
1


10
2
0
9
3
5
4
6
7

2

Example: SSSP – Dijkstra’s
Algorithm
1
10

10
2
0
9
3
5
4
6
7
5
2

Example: SSSP – Dijkstra’s
Algorithm
1
8
14
10
2
0
9
3
5
4
6
7
5
2
7
Example: SSSP – Dijkstra’s
Algorithm
1
8
13
10
2
0
9
3
5
4
6
7
5
2
7
Example: SSSP – Dijkstra’s
Algorithm
1
8
9
10
2
0
9
3
5
4
6
7
5
2
7
Example: SSSP – Dijkstra’s
Algorithm
1
8
9
10
2
0
9
3
5
4
6
7
5
2
7
Single Source Shortest Path
(SSSP)
Problem
◦ Find shortest path from a source node to all target nodes
Solution
◦ Single processor machine: Dijkstra’s algorithm
◦ MapReduce/Pregel: parallel breadth-first search (BFS)
MapReduce Execution
Overview
Example: SSSP – Parallel BFS in MapReduce
Adjacency matrix
A
A
B
B
C
10
B
E
1
2
4
D
3
7
Adjacency List
A: (B, 10), (D, 5)
9
2
A
D: (B, 3), (C, 9), (E, 2)
E: (A, 7), (C, 6)

10
0
2
9
3
4
6
6
5
7

B: (C, 1), (D, 2)
C: (E, 4)
1

5
C
E
D
C
D
2

E
Example: SSSP – Parallel BFS in MapReduce
Map input: <node ID, <dist, adj list>>
B
<A, <0, <(B, 10), (D, 5)>>>
C
1

<B, <inf, <(C, 1), (D, 2)>>>
<C, <inf, <(E, 4)>>>
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
A
10
0
<E, <inf, <(A, 7), (C, 6)>>>
Map output: <dest node ID, dist>

2
9
3
5
<B, 10> <D, 5>
<A, <0, <(B, 10), (D, 5)>>>
<C, inf> <D, inf>
<B, <inf, <(C, 1), (D, 2)>>>
4
6
7

D
<E, inf>
<C, <inf, <(E, 4)>>>
<B, inf> <C, inf> <E, inf>
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<A, inf> <C, inf>
<E, <inf, <(A, 7), (C, 6)>>>
2

E
Flushed to local disk!!
Example: SSSP – Parallel BFS in MapReduce
Reduce input: <node ID, dist>
B
<A, <0, <(B, 10), (D, 5)>>>
<B, 10> <B, inf>
1

<A, inf>
<B, <inf, <(C, 1), (D, 2)>>>
C
A

10
0
2
9
3
4
6
<C, <inf, <(E, 4)>>>
<C, inf> <C, inf> <C, inf>
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<D, 5> <D, inf>
<E, <inf, <(A, 7), (C, 6)>>>
<E, inf> <E, inf>
5
7

D
2

E
Example: SSSP – Parallel BFS in MapReduce
Reduce input: <node ID, dist>
B
<A, <0, <(B, 10), (D, 5)>>>
<B, 10> <B, inf>
1

<A, inf>
<B, <inf, <(C, 1), (D, 2)>>>
C
A

10
0
2
9
3
4
6
<C, <inf, <(E, 4)>>>
<C, inf> <C, inf> <C, inf>
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<D, 5> <D, inf>
<E, <inf, <(A, 7), (C, 6)>>>
<E, inf> <E, inf>
5
7

D
2

E
Example: SSSP – Parallel BFS in MapReduce
Reduce output: <node ID, <dist, adj list>>
= Map input for next iteration
<A, <0, <(B, 10), (D, 5)>>>
B
A
<C, <inf, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
1
10
Flushed to DFS!!
<B, <10, <(C, 1), (D, 2)>>>
C

10
0
2
9
3
4
6
<E, <inf, <(A, 7), (C, 6)>>>
Map output: <dest node ID, dist>
<B, 10> <D, 5>
<C, 11> <D, 12>
<E, inf>
<B, 8> <C, 14> <E, 7>
<A, inf> <C, inf>
5
<A, <0, <(B, 10), (D, 5)>>>
<B, <10, <(C, 1), (D, 2)>>>
<C, <inf, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<E, <inf, <(A, 7), (C, 6)>>>
7
5
D
2

E
Flushed to local disk!!
Example: SSSP – Parallel BFS in MapReduce
Reduce input: <node ID, dist>
B
<A, <0, <(B, 10), (D, 5)>>>
<B, 10> <B, 8>
1
10
<A, inf>
<B, <10, <(C, 1), (D, 2)>>>
C
A

10
0
2
9
3
4
6
<C, <inf, <(E, 4)>>>
<C, 11> <C, 14> <C, inf>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<D, 5> <D, 12>
<E, <inf, <(A, 7), (C, 6)>>>
<E, inf> <E, 7>
5
7
5
D
2

E
Example: SSSP – Parallel BFS in MapReduce
Reduce input: <node ID, dist>
B
<A, <0, <(B, 10), (D, 5)>>>
<B, 10> <B, 8>
1
10
<A, inf>
<B, <10, <(C, 1), (D, 2)>>>
C
A

10
0
2
9
3
4
6
<C, <inf, <(E, 4)>>>
<C, 11> <C, 14> <C, inf>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<D, 5> <D, 12>
<E, <inf, <(A, 7), (C, 6)>>>
<E, inf> <E, 7>
5
7
5
D
2

E
Example: SSSP – Parallel BFS in MapReduce
Reduce output: <node ID, <dist, adj list>>
= Map input for next iteration
<A, <0, <(B, 10), (D, 5)>>>
B
<C, <11, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
A
1
8
Flushed to DFS!!
<B, <8, <(C, 1), (D, 2)>>>
C
11
10
0
2
9
3
4
6
<E, <7, <(A, 7), (C, 6)>>>
5
7
… the rest omitted …
5
D
2
7
E
Computation Model (1/3)
Input
Supersteps
(a sequence of iterations)
Output
Computation Model (2/3)
“Think like a vertex”
Inspired by Valiant’s Bulk Synchronous Parallel model (1990)
Source: http://en.wikipedia.org/wiki/Bulk_synchronous_parallel
Computation Model (3/3)
Superstep: the vertices compute in parallel
◦ Each vertex
◦ Receives messages sent in the previous superstep
◦ Executes the same user-defined function
◦ Modifies its value or that of its outgoing edges
◦ Sends messages to other vertices (to be received in the next superstep)
◦ Mutates the topology of the graph
◦ Votes to halt if it has no further work to do
◦ Termination condition
◦ All vertices are simultaneously inactive
◦ There are no messages in transit
Example: SSSP – Parallel BFS in
Pregel
1


10
2
0
9
3
5
4
6
7

2

Example: SSSP – Parallel BFS in
Pregel
10

2
9
3

5
5
2

4
6

7




10
0

1



Example: SSSP – Parallel BFS in
Pregel
1
10

10
2
0
9
3
5
4
6
7
5
2

Example: SSSP – Parallel BFS in
Pregel
2
5

14
8
10
0
11
1
10
9
3
12
4
6
7
5
2
7

Example: SSSP – Parallel BFS in
Pregel
1
8
11
10
2
0
9
3
5
4
6
7
5
2
7
Example: SSSP – Parallel BFS in
Pregel
9
1
8
11
10
0
14
13
2
9
3
5
4
7
5
2
6
15
7
Example: SSSP – Parallel BFS in
Pregel
1
8
9
10
2
0
9
3
5
4
6
7
5
2
7
Example: SSSP – Parallel BFS in
Pregel
1
8
9
10
2
0
9
3
5
4
7
5
2
6
13
7
Example: SSSP – Parallel BFS in
Pregel
1
8
9
10
2
0
9
3
5
4
6
7
5
2
7
Differences from MapReduce
Graph algorithms can be written as a series of chained MapReduce
invocation
Pregel
◦ Keeps vertices & edges on the machine that performs computation
◦ Uses network transfers only for messages
MapReduce
◦ Passes the entire state of the graph from one stage to the next
◦ Needs to coordinate the steps of a chained MapReduce
C++ API
Writing a Pregel program
◦ Subclassing the predefined Vertex class
Override this!
in msgs
out msg
Example: Vertex Class for SSSP
System Architecture
Pregel system also uses the master/worker model
◦ Master
◦ Maintains worker
◦ Recovers faults of workers
◦ Provides Web-UI monitoring tool of job progress
◦ Worker
◦ Processes its task
◦ Communicates with the other workers
Persistent data is stored as files on a distributed storage system (such
as GFS or BigTable)
Temporary data is stored on local disk
Execution of a Pregel Program
1.
Many copies of the program begin executing on a cluster of machines
2.
The master assigns a partition of the input to each worker
◦ Each worker loads the vertices and marks them as active
3.
The master instructs each worker to perform a superstep
◦ Each worker loops through its active vertices & computes for each vertex
◦ Messages are sent asynchronously, but are delivered before the end of the superstep
◦ This step is repeated as long as any vertices are active, or any messages are in transit
4.
After the computation halts, the master may instruct each worker to
save its portion of the graph
Fault Tolerance
Checkpointing
◦ The master periodically instructs the workers to save the state of their partitions to persistent
storage
◦ e.g., Vertex values, edge values, incoming messages
Failure detection
◦ Using regular “ping” messages
Recovery
◦ The master reassigns graph partitions to the currently available workers
◦ The workers all reload their partition state from most recent available checkpoint
Experiments
Environment
◦ H/W: A cluster of 300 multicore commodity PCs
◦ Data: binary trees, log-normal random graphs (general graphs)
Naïve SSSP implementation
◦ The weight of all edges = 1
◦ No checkpointing
Experiments
SSSP – 1 billion vertex binary tree: varying # of worker tasks
Experiments
SSSP – binary trees: varying graph sizes on 800 worker tasks
Experiments
SSSP – Random graphs: varying graph sizes on 800 worker tasks
Conclusion & Future Work
Pregel is a scalable and fault-tolerant platform with an API that is
sufficiently flexible to express arbitrary graph algorithms
Future work
◦ Relaxing the synchronicity of the model
◦ Not to wait for slower workers at inter-superstep barriers
◦ Assigning vertices to machines to minimize inter-machine communication
◦ Caring dense graphs in which most vertices send messages to most other
vertices
Download