BIG DATA

advertisement
BIG DATA
ALGORITHMS
GOOGLE TREND
BIG DATA
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it...
IS THERE ANYTHING
FUNDAMENTALLY NEW?
• Massive Data vs Big Data
• The 3 V’s
• Volume
• Velocity
• Variety
BIG DATA ECOSYSTEM
BIG DATA APPLICATIONS
Big Data Algorithms
External memory Data stream
Algorithms
Algorithms
Distributed
Algorithms
Parallel
Algorithms
2006
1999
1988
8
1980
COMPUTATIONAL MODELS
FOR BIG DATA
All models are wrong,
But some are useful.
George E. P. Box
WHAT’S THE BOTTLENECK?
• CPU speed approaching limit
• Does it matter?
• From CPU-intensive computing to dataintensive computing
• Algorithm has to be near-linear, linear, or
even sub-linear!
10
• Data movement, i.e., communication is the
bottleneck!
Random Access Machine Model
R
A
M
• Standard theoretical model of computation:
– Unlimited memory
– Uniform access cost
• Simple model crucial for success of computer industry
11
Hierarchical Memory
L
1
L
2
R
A
M
• Modern machines have complicated memory hierarchy
– Levels get larger and slower further away from CPU
– Data moved between levels using large blocks
12
Slow I/O
• Disk access is 106 times slower than main memory access
read/write arm
track
4835 1915 5748 4125
magnetic surface
“The difference in speed between
modern CPU and disk
technologies is analogous to the
difference in speed in sharpening
a pencil using a sharpener on
one’s desk or by taking an
airplane to the other side of the
world and using a sharpener on
someone else’s desk.” (D. Comer)
– Disk systems try to amortize large access time transferring large
contiguous blocks of data (8-16Kbytes)
– Important to store/access data to take advantage of blocks (locality)
13
Scalability Problems
• Most programs developed in RAM-model
– Run on large datasets because
OS moves blocks as needed

Scalability problems!
running time
• Moderns OS utilizes sophisticated paging and prefetching strategies
– But if program makes scattered accesses even good OS cannot
take advantage of block access
data size
14
External Memory Model
D
Block I/O
N = # of items in the problem instance
B = # of items per disk block
M = # of items that fit in main memory
I/O: # blocks moved between memory and disk
M
CPU time is ignored
P
Successful model used extensively in massive
data algorithms and database communities
15
Fundamental Bounds
•
•
•
•
Scanning:
Sorting:
Permuting
Searching:
Internal
N
N log N
N
log
External
N
B
N
B
log
B
min{ N ,
2
N
log
M
N
B
B
N
B
log
M
B
N
B
}
N
• Note:
– Linear I/O: O(N/B)
– Permuting not linear
– Permuting and sorting bounds are equal in all practical cases
– B factor VERY important: NB  NB log M B NB  N
16
Queues and Stacks
• Queue:
– Maintain push and pop blocks in main memory
Push
Pop

O(1/B) I/O per operation (amortized)
• Stack:
– Maintain push/pop block in main memory

O(1/B) I/O per operation (amortized)
17
Sorting
• Merge sort:
– Create N/M memory sized sorted lists
– Repeatedly merge lists together Θ(M/B) at a time
N
( M
)
N
( M
/
(
M
)
B
N
M 2
/(
) )
M
B
1
 O (log
M
B
N
M
) phases using O ( N B ) I/Os each  O ( N log
B
M
B
N
B
) I/Os
18
Sorting
• <M/B sorted lists (queues) can be merged in O(N/B) I/Os
M/B blocks in main memory
• The M/B head elements kept in a heap in main memory
19
Toy Experiment: Permuting
• Problem:
– Input: N elements out of order: 6, 7, 1, 3, 2, 5, 10, 9, 4, 8
* Each element knows its correct position
– Output: Store them on disk in the right order
• Internal memory solution:
– Just scan the original sequence and move every element in the
right place!
– O(N) time, O(N) I/Os
• External memory solution:
– Use sorting
N
– O(N log N) time, O ( NB log
) I/Os
B
M
B
20
21
Searching in External Memory
• Store N elements in a data structure such that
– Given a query element x, find it or its predecessor
22
B-trees
• BFS-blocking naturally corresponds to tree with fan-out  ( B )
• B-trees balanced by allowing node degree to vary
– Rebalancing performed by splitting and merging nodes
(a,b)-tree
• T is an (a,b)-tree (a≥2 and b≥2a-1)
– All leaves on the same level
(contain between a and b elements)
– Except for the root, all nodes have
degree between a and b
– Root has degree between 2 and b
• (a,b)-tree uses linear space and has height O (log a N )

Choosing a,b =  ( B ) each node/leaf stored in one disk block

O(N/B) space and O (log N ) query
B
(2,4)-tree
(a,b)-Tree Insert
• Insert:
Search and insert element in leaf v
DO v has b+1 elements/children
Split v:
make nodes v’ and v’’ with
 b 21   b and b 21   a elements
insert element (ref) in parent(v)
(make new root if necessary)
v=parent(v)
v
b 1
v’
v’’
 b 21   b 21 
• Insert touch O (log
a
N ) nodes
(a,b)-Tree Insert
(a,b)-Tree Delete
• Delete:
Search and delete element from leaf v
DO v has a-1 elements/children
Fuse v with sibling v’:
move children of v’ to v
delete element (ref) from parent(v)
(delete root if necessary)
If v has >b (and ≤ a+b-1<2b) children split v
v=parent(v)
• Delete touch O (log
a
N ) nodes
v
a -1
v
 2a - 1
(a,b)-Tree Delete
(a,b)-Tree
• (a,b)-tree properties:
– Every update can
cause O(logaN) rebalancing
operations
– If b>2a
* Why?
O ( 1B)
(2,3)-tree
delete
insert
rebalancing operations amortized
Summary/Conclusion: B-tree
• B-trees: (a,b)-trees with a,b =  ( B )
– O(N/B) space
– O(logB N) query
– O(logB N) update
• B-trees with elements in the leaves sometimes called B+-tree
– Now B-tree and B+tree are synonyms
• Construction in O ( NB log M B NB ) I/Os
– Sort elements and construct leaves
– Build tree level-by-level bottom-up
Basic Structures: I/O-Efficient Priority Queue
Internal Priority Queues
• Operations:
– Required:
* Insert
* DeleteMax
* Max
– Optional:
* Delete
* Update
• Implementation:
– Binary tree
– Heap
100
40
90
40
50
29
23
15
65
Insertion
30
Internal Priority Queues
• Operations:
– Required:
* Insert
* DeleteMax
* Max
– Optional:
* Delete
* Update
• Implementation:
– Binary tree
– Heap
100
40
90
40
65
50
23
15
29
Insertion
30
Internal Priority Queues
• Operations:
– Required:
* Insert
* DeleteMax
* Max
– Optional:
* Delete
* Update
• Implementation:
– Binary tree
– Heap
40
90
40
65
50
23
15
29
DeleteMax
30
Internal Priority Queues
• Operations:
– Required:
* Insert
* DeleteMax
* Max
– Optional:
* Delete
* Update
• Implementation:
– Binary tree
– Heap
90
40
40
65
50
23
15
29
DeleteMax
30
Internal Priority Queues
• Operations:
– Required:
* Insert
* DeleteMax
* Max
– Optional:
* Delete
* Update
• Implementation:
– Binary tree
– Heap
90
40
65
40
30
50
29
23
15
DeleteMax
How to Make the Heap I/O-Efficient
I/O Technique 1: Make it many-way
I/O Technique 2: Buffering!
External Heap
main memory
insert buffer
in memory
heap has fan-out Θ(M/B)
each node has Θ(M/B) blocks
may not be half-full
Heap property: All elements in a child are smaller than those in its parent
External Heap: Insert
main memory
insert buffer
in memory
heap has fan-out Θ(M/B)
each node has Θ(M/B) blocks
may not be half-full
Heap property: All elements in a child are smaller than those in its parent
External Heap: Insert
main memory
insert buffer
in memory
heap has fan-out Θ(M/B)
each node has Θ(M/B) blocks
may not be half-full
Heap property: All elements in a child are smaller than those in its parent
External Heap: Insert
main memory
insert buffer
in memory
heap has fan-out Θ(M/B)
each node has Θ(M/B) blocks
sift-up
may not be half-full
Heap property: All elements in a child are smaller than those in its parent
External Heap: Insert
main memory
insert buffer
in memory
sift-up
heap has fan-out Θ(M/B)
each node has Θ(M/B) blocks
may not be half-full
Heap property: All elements in a child are smaller than those in its parent
External Heap: Insert
main memory
insert buffer
in memory
heap has fan-out Θ(M/B)
each node has Θ(M/B) blocks
swap
may not be half-full
Heap property: All elements in a child are smaller than those in its parent
External Heap: DeleteMax
main memory
insert buffer
in memory
heap has fan-out Θ(M/B)
each node has Θ(M/B) blocks
may not be half-full
Heap property: All elements in a child are smaller than those in its parent
External Heap: DeleteMax
main memory
insert buffer
in memory
heap has fan-out Θ(M/B)
each node has Θ(M/B) blocks
may not be half-full
Heap property: All elements in a child are smaller than those in its parent
External Heap: DeleteMax
insert buffer
main memory
in memory
refill
heap has fan-out Θ(M/B)
each node has Θ(M/B) blocks
may not be half-full
Heap property: All elements in a child are smaller than those in its parent
External Heap: DeleteMax
insert buffer
main memory
in memory
heap has fan-out Θ(M/B)
each node has Θ(M/B) blocks
refill
may not be half-full
Heap property: All elements in a child are smaller than those in its parent
External Heap: DeleteMax
insert buffer
main memory
in memory
heap has fan-out Θ(M/B)
each node has Θ(M/B) blocks
refill
merge
may not be half-full
Heap property: All elements in a child are smaller than those in its parent
External Heap: I/O Analysis
• What is the I/O cost for a sequence of N mixed insertions /
deletemax (analysis in paper too complicated)
• Height of heap: Θ(logM/BN/B)
• Insertions
– Wait until insert buffer is full (served at least Ω(M) inserts)
– Then do one (occasionally two) bottom-up chains of sift-ups.
* Cost: O(M/B∙logM/BN/B)
* Amortized cost per insert: O(1/B∙logM/BN/B)
• DeleteMax:
– Wait until root is below half full (served at least Ω(M)
deletemax)
– Then do one, two, sometimes a lot of refills… dead
– Do one sift-up: this is easy
External Heap: I/O Analysis
• Cost of all refills:
– Need a global argument
– Idea: trace individual elements
– Total amount of “work”: O(N logM/BN/B)
* One unit of work: move one element up one level
* Refills do positive work
* sift-ups do both positive and negative work
* |positive work done by refills| + |positive works done by siftups| – |negative work done by sift-ups| = O(N logM/BN/B)
* But note: |positive works done by sift-ups| >
|negative work done by sift-ups|
* So, |positive work done by refills| = O(N logM/BN/B)
External Heap: I/O Analysis
• Work done by refills: O(N logM/BN/B)
• Each refill spends Θ(M/B) I/Os and does Θ(M) work
• Total # I/Os for all refills:
N log
N /B M
N


log
M
B
B
N
M /B
M /B
• How about merges?
• Amortized I/O per operation:
1
O  log
B
• Another way of sorting
M /B
N 

B 
B
 sort ( N )
External Heap: In Practice
• In practice: Know the scale of your problem!
– Suppose M = 512M, B = 256K, then two levels can support
M*(M/B) = 1024G = 1T of data!
insert buffer
main memory
in memory
heap has fan-out Θ(M/B)
each node has Θ(M/B) blocks
Amortized I/O per insert or delete-max:
O(1/B)
may not be half-full
Recap: Basic General I/O Techniques
(1) Make it many way: Merge sort
(2) Buffering: External heap
(priority queue)
(3) Reduce to sort + pqueue
Pointer Dereferencing
• “Almost every problem in computer science can be solved by
another level of indirection”
pointer array P[i]
5
3
7
3
6
4
8
data array D[i]
• Dereference each pointer needs many random I/Os
• How do we get the values I/O-efficiently?
– Output (i, data) pairs
4
6
8
I/O-Efficient Pointer Dereferencing
pointer array P[i]
5
3
7
3
6
4
8
4
6
data array D[i]
Total I/O: sort(N)
• Sort pointer array by pointers
– Produce a list of (i, P[i]) pairs, sorted by P[i]
• Scan both arrays in parallel
– Produce (i, data) pairs
• Sort the list back by i if needed
8
Time-Forward Processing
2
7
1
9
7
9
• Scan sequence in order, create a priority queue
• For a cell
• For each incoming edge
9
9
10
Total I/O: sort(N)
• DeleteMin from pq if there’s a match, obtain the incoming value
• Compute the outgoing value
• For each outgoing edge
• Insert (destination address, value) to pq, with destination as key
Application: Maximal Independent Set
• Given an undirected graph G = (V,E) stored on disk
– A list of (vertex-id, vertex-id) pairs representing all edges
• An independent set is a set I of vertices so that no two vertices in I
are adjacent
• Set I is maximal if any other vertex is added to I, then I becomes not
independent
– Note: maximum independent set is NP-hard!
• Internal memory
– Add vertices one by one until no more vertices can be added
– Time: O(|E|)
I/O-Efficient Maximal Independent Set
1
4
6
2
Total I/O: sort(N)
3
7
5
1
2
3
4
5
6
7
• Make all edges directed from a low vertex id to a high vertex id
• Sort all edges by source
• Now have a time-forward processing problem!
Big Data Algorithms
External memory
Algorithms
Data stream
Algorithms
Distributed
Algorithms
Parallel
Algorithms
2006
1999
1988
1980
59
Problem One: Missing Card
• I take one from a deck of 52 cards, and pass the rest to you. Suppose
you only have a (very basic) calculator and bad memory, how can
you find out which card is missing with just one pass over the 51
cards?
• What if there are two missing cards?
60
A data stream algorithm …
• Makes one pass over the input data
• Uses a small amount of memory (much smaller than the input
data)
• Computes something
61
Why do we need streaming algorithms
• Networking
– Often get to see the data once
– Don’t want to store the entire data
• Databases
– Data stored on disk, sequential scans are much faster
• Data stream algorithms have been a very active research area for the
past 15 years
• Problems considered today
– Missing card
– Reservoir sampling
– Majority
– Heavy hitters
62
Reservoir Sampling [Waterman ' ??; Vitter '85]
• Maintain a (uniform) sample (w/o replacement) of
size s from a stream of n items
• Every subset of size s has equal probability to be the
sample
• When the i-th item arrives
– With probability s/i, use it to replace an item in the
current sample chosen uniformly at random
– With probability 1-s/i, throw it away
63
Reservoir Sampling: Correctness Proof
64
Problem two: Majority
• Given a sequence of items, find the majority if there is one
• AABCDBAABBAAAAAACCCDABAAA
• Answer: A
• Trivial if we have O(n) memory
• Can you do it with O(1) memory and two passes?
– First pass: find the possible candidate
– Second pass: compute its frequency and verify that it is > n/2
• How about one pass?
– Unfortunately, no
65
Problem three: Heavy hitters
• Problem: find all items with counts > φn, for some 0< φ<n
• Relaxation:
– If an item has count > φ n, it must be reported, together with its
estimated count with (absolute) error < εn
– If an item has count < (φ − ε) n, it cannot be reported
– For items in between, don’t care
• In fact, we will solve the most difficult case φ = ε
• Applications
– Frequent IP addresses
– Data mining
66
Heavy hitters
 Misra-Gries (MG) algorithm finds up to k items that occur
more than 1/k fraction of the time in a stream [MG’82]
–
Estimate their frequencies with additive error ≤ N/(k+1)
 Keep k different candidates in hand. For each item in stream:
If item is monitored, increase its counter
– Else, if < k items monitored, add new item with count 1
– Else, decrease all counts by 1
–
k=5
1 2 3 4 5 6 7 8 9
67
Mergeable Summaries
Heavy hitters
 Misra-Gries (MG) algorithm finds up to k items that occur
more than 1/k fraction of the time in a stream [MG’82]
–
Estimate their frequencies with additive error ≤ N/(k+1)
 Keep k different candidates in hand. For each item in stream:
If item is monitored, increase its counter
– Else, if < k items monitored, add new item with count 1
– Else, decrease all counts by 1
–
k=5
1 2 3 4 5 6 7 8 9
68
Mergeable Summaries
Heavy hitters
 Misra-Gries (MG) algorithm finds up to k items that occur
more than 1/k fraction of the time in a stream [MG’82]
–
Estimate their frequencies with additive error ≤ N/(k+1)
 Keep k different candidates in hand. For each item in stream:
If item is monitored, increase its counter
– Else, if < k items monitored, add new item with count 1
– Else, decrease all counts by 1
–
k=5
1 2 3 4 5 6 7 8 9
69
Mergeable Summaries
Streaming MG analysis
 N = total input size
 Error in any estimated count at most N/(k+1)
–
–
–
–
–
–
70
Estimated count a lower bound on true count
Each decrement spread over (k+1) items: 1 new one and k in MG
Equivalent to deleting (k+1) distinct items from stream
At most N/(k+1) decrement operations
Hence, can have “deleted” N/(k+1) copies of any item
So estimated counts have at most this much error
Mergeable Summaries
How about deletions?
• Any deterministic algorithm needs Ω(n) space
– Why?
– In fact, Las Vegas randomization doesn’t help
• Will design a randomized algorithm that works with high probability
– For any item x, we can estimate its actual count within error εn
with probability 1-δ for any small constant δ
71
The Count-Min sketch [Cormode, Muthukrishnan, 2003]
A Count-Min (CM) Sketch with parameters (  ,  ) is represented by
a two-dimensional array counts with width w and depth d : count [1,1]  count [ d , w ]
Given parameters
2
1

(  ,  ) , set w    and d   log  .


 
Each entry of the array is initially zero.
d hash functions are chosen uniformly at random from a 2-univeral family
For example, we can choose a prime number p > u, and random aj, bj, j=1,…,d. Define:
h1 ,  , h d : {1 n}  {1 w}
hj(x) = (aj x + bj mod p) mod w
Property: for any x ≠ y, Pr[hj(x)=hj(y)] ≤ 1/w
72
Updating the sketch
Update procedure :
When item x arrives, set 1  j  d
count [ j , h j ( x )]  count [ j , h j ( x )]  1
h1
x
hd









1






1









1





1









When item x is deleted, do the same except changing +1 to -1
73
Estimating the count of x
aˆ x  min count [ j , h j ( x )]
Q ( x)
j
actual count
Theorem 1
estimated count
a x  aˆ x
Pr[ aˆ x  a x   n ]  
74
Proof
We introduce indicator variables
I x,y, j 

( x  y )  ( h j ( x )  h j ( y ))
1
if
0
otherwise
E ( I x , y , j )  Pr[ h j ( x )  h j ( y )] 
Define the variable
I x, j 
I
x, y, j
1
w


2
ay
y
By construction, count [ j , h j ( x )]  a x  I x , j
min count [ j , h j ( i )]  a i
75
For the other direction, observe that

E ( I x, j )  E   I x,y, j a y

 y




a
y
E ( I x, y, j )  n / 2
y
Pr[ aˆ x  a x   n ]  Pr[  j , count [ j , h j ( x )]  a x   n ]
 Pr[  j , a x  I x , j  a x   n ]
 Pr[  j , I x , j  2 E ( I x , j )]  2
-d

Markov inequality
Pr[ X  t ] 
E(X )
t  0
t
1
1
So, the Count-Min Sketch has size O  log 
 

■
76
Big Data Algorithms
External memory
Algorithms
Data stream
Algorithms
Distributed
Algorithms
Parallel
Algorithms
2006
1999
1988
1980
77
Distributed Systems
 Performance vs. programmability

MPI
 MapReduce provides




Automatic parallelization & distribution
Fault tolerance
Scalability
Restricted programming model
Map/Reduce
 map(key, val) is run on each item in set

emits new-key / new-val pairs
 reduce(key, vals) is run for each unique key
emitted by map()

emits final output
count words in docs

Input consists of (url, contents) pairs

map(key=url, val=contents):
▫ For each word w in contents, emit (w, “1”)

reduce(key=word, values=uniq_counts):
▫ Sum all “1”s in values list
▫ Emit result “(word, sum)”
Count,
Illustrated
see bob throw
see spot run
map(key=url, val=contents):
For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
Sum all “1”s in values list
Emit result “(word, sum)”
see
bob
run
see
spot
throw
1
1
1
1
1
1
bob
run
see
spot
throw
1
1
2
1
1
Reverse Web-Link Graph
 Map


For each URL linking to target, …
Output <target, source> pairs
 Reduce


Concatenate list of all source URLs
Outputs: <target, list (source)> pairs
Inverted Index
 Map


For each (url, doc) pair
Emit (keyword, url) for each keyword in doc
 Reduce

For each keyword,
output (url, list of keywords)
Model is Widely Applicable
MapReduce Programs In Google Source Tree
Example uses:
distributed grep
distributed sort
web link-graph reversal
term-vector / host
web access log stats
inverted index construction
document clustering
machine learning
statistical machine
translation
...
...
...
Implementation Overview
Typical cluster:
• 100s/1000s of multicore x86 machines, 4 GB of
memory
• One or two-level tree-shaped switched network with
100 Gbps of aggregate bandwidth at the root
• Storage is on local IDE disks
• GFS: distributed file system manages data
• Job scheduling system: jobs made up of tasks,
scheduler assigns tasks to machines
Implementation is a C++ library linked into user programs
Execution
How is this distributed?

1.
2.
3.

Partition input key/value pairs into chunks,
run map() tasks in parallel
After all map()s are complete, consolidate all
emitted values for each unique emitted key
Now partition space of output map keys, and
run reduce() in parallel
If map() or reduce() fails, re-execute!
Job Processing
TaskTracker 0 TaskTracker 1
TaskTracker 2
JobTracker
TaskTracker 3
“grep”
TaskTracker 4
TaskTracker 5
1. Client submits “grep” job, indicating code
and input files
2. JobTracker breaks input file into k chunks.
Assigns work to trackers.
3. After map(), tasktrackers exchange mapoutput to build reduce() keyspace
4. JobTracker breaks reduce() keyspace into m
chunks. Assigns work.
5. reduce() output may go to NDFS
Execution
Parallel Execution
Task Granularity & Pipelining
 Fine granularity tasks: map tasks >> machines



Minimizes time for fault recovery
Can pipeline shuffling with map execution
Better dynamic load balancing
Computational Model for
MapReduce?
 Complicated, many factors, still no
consensus
 Communication (total vs maximum)
 Space
 Time vs. work (parallelism)
 # rounds
MapReduce: Pros and Cons
 Pros



Simple programming model
Excellent scalability for one-round batch jobs
Fault tolerance
 Cons


Programming model too simple
Poor performance for iterative jobs
Google Pregel for Graph Data
 Master/Worker model
 Each worker assigned a subset of a graph’s vertices
 Vertex-centric model. Each vertex has:
 An arbitrary “value” that can be get/set.
 List of messages sent to it
 List of outgoing edges (edges have a value too)
 A binary state (active/inactive)
The Pregel model
 Bulk Synchronous Parallel model (Valiant, 95)
 Synchronous iterations of asynchronous computation
 Master initiates each iteration (called a “superstep”)
 At every superstep
 Workers asynchronously execute a user function on all of its vertices
 Vertices can receive messages sent to it in the last superstep
 Vertices can send messages to other vertices to be received in the next
superstep
 Vertices can modify their value, modify values of edges, change the topology
of the graph (add/remove vertices or edges)
 Vertices can “vote to halt”
 Execution stops when all vertices have voted to halt and no vertices
have messages.
 Vote to halt trumped by non-empty message queue
Illustration: vertex partitions
Worker 3
Worker 1
http://stochastix.files.wordpress.com/
Worker 2
Loading the graph input
 Master assigns section of input to each worker
 Vertex “ownership” determined by hash(v) mod N
 N - number of partitions
 Recall each worker is assigned one or more partitions
 User can modify this to exploit data locality
 Worker reads its section of input:
 Stores vertices belonging to it
 Sends other vertices to the appropriate worker
 Input stored on something like GFS
 Section assignments determined by data locality
Simple example: max propagation
old_val := val
for each message m
if m > val then val := m
if old_val == val then
vote_to_halt
else
for each neighbor v
send_message(v, val)
Combiners
 Sometimes vertices only care about a summary value for the




messages it is sent (e.g., previous example)
Combiners allow for this (examples: min, max, sum, avg)
Messages combined locally and remotely
Reduces bandwidth overhead
User-defined, not enabled by default
Worker 1
v0
v1
Worker 2
v3
v2
v4
Combiner
Combiner
Worker 3
Combiner
vs
v5
Aggregators
 Compute aggregate statistics from vertex-reported values
 During a superstep, each worker aggregates values from its
vertices to form a partially aggregated value
 At the end of a superstep, partially aggregated values from each
worker are aggregated in a tree structure
 Allows for the parallelization of this process
 Global aggregate is sent to the master
master
Fault Tolerance (1/2)
 At start of superstep, master tells workers to save their state:
 Vertex values, edge values, incoming messages
 Saved to persistent storage
 Master saves aggregator values (if any)
 This isn’t necessarily done at every superstep
 That could be very costly
 Authors determine checkpoint frequency using mean time to
failure model
Fault Tolerance (2/2)
 When master detects one or more worker failures:
 All workers revert to last checkpoint
 Continue from there
 That’s a lot of repeated work!
 At least it’s better than redoing the whole thing.
Example 1: PageRank
1
𝑃𝑅 𝑢 = 0.15 × + 0.85 ×
𝑁
𝑣→𝑢
𝑃𝑅(𝑣)
outdegree(𝑣)
Example 2: Single Source Shortest Paths
At each superstep…
s
…
d0
d1
dv
d0 + ws
ws
ds
vertex receives messages
d0 + wt
wt
dt
if min(d0,d1) < dv, it sends messages to its neighbors
and updates its new minimum distance from s
else, it votes to halt
After execution, each vertex’s value is its minimum distance from s
Example 2: SSSP Combiner
 Each vertex interested only in minimum of its messages
 Might as well use a combiner!
Computational Model for Pregel
 # supersteps
 L (maximum # edges in any shortest path from source)
 Communication
 O(E log V)?
 O(E L)
 Better algorithms are known, but harder to implement
Conclusions
 Algorithm design facing new
constraints/challenges in the big data era
 Resources other than time may be the
main consideration
 Data movement cost often the primary
concern
 Algorithmic ideas often independent of
technological improvements
Thank you!
Download