Map - The Stanford University InfoLab

advertisement
Map-Reduce and Its Children
Distributed File Systems
Map-Reduce and Hadoop
Dataflow Systems
Extensions for Recursion
1
Distributed File Systems
Chunking
Replication
Distribution on Racks
2
Distributed File Systems
Files are very large, read/append.
They are divided into chunks.
 Typically 64MB to a chunk.
Chunks are replicated at several
compute-nodes.
A master (possibly replicated) keeps
track of all locations of all chunks.
3
Compute Nodes
Organized into racks.
Intra-rack connection typically gigabit
speed.
Inter-rack connection faster by a small
factor.
4
Racks of Compute Nodes
File
Chunks
5
3-way replication of
files, with copies on
different racks.
6
Implementations
GFS (Google File System – proprietary).
HDFS (Hadoop Distributed File System –
open source).
CloudStore (Kosmix File System, open
source).
7
Above the DFS
Map-Reduce
Key-Value Stores
SQL Implementations
8
The New Stack
SQL Implementations,
e.g., PIG (relational Object Store (key-value
algebra), HIVE
store), e.g., BigTable,
Hbase, Cassandra
Map-Reduce, e.g.
Hadoop
Distributed File System
9
Map-Reduce Systems
Map-reduce (Google) and open-source
(Apache) equivalent Hadoop.
Important specialized parallel
computing tool.
Cope with compute-node failures.
 Avoid restart of the entire job.
10
Key-Value Stores
BigTable (Google), Hbase, Cassandra
(Apache), Dynamo (Amazon).
 Each row is a key plus values over a
flexible set of columns.
 Each column component can be a set of
values.
11
SQL-Like Systems
PIG – Yahoo! implementation of
relational algebra.
 Translates to a sequence of map-reduce
operations, using Hadoop.
Hive – open-source (Apache)
implementation of a restricted SQL, called
QL, over Hadoop.
12
SQL-Like Systems – (2)
Sawzall – Google implementation of
parallel select + aggregation.
Scope – Microsoft implementation of
restricted SQL.
13
Map-Reduce
Formal Definition
Fault-Tolerance
Example: Join
14
Map-Reduce
You write two functions, Map and
Reduce.
 They each have a special form to be
explained.
System (e.g., Hadoop) creates a large
number of tasks for each function.
 Work is divided among tasks in a precise
way.
15
Map-Reduce Pattern
“key”-value
pairs
Input
from
DFS
Output
to DFS
Map
tasks
Reduce
tasks
16
Map-Reduce Algorithms
Map tasks convert inputs to key-value
pairs.
 “keys” are not necessarily unique.
Outputs of Map tasks are sorted by key,
and each key is assigned to one Reduce
task.
Reduce tasks combine values associated
with a key.
17
Coping With Failures
 Map-reduce is designed to deal with
compute-nodes failing to execute a task.
 Re-executes failed tasks, not whole jobs.
 Failure modes:
1. Compute node failure (e.g., disk crash).
2. Rack communication failure.
3. Software failures, e.g., a task requires Java
n; node has Java n-1.
18
Things Map-Reduce is Good At
1. Matrix-Matrix and Matrix-vector
multiplication.
 One step of the PageRank iteration was
the original application.
2. Relational algebra operations.
 We’ll do an example of the join.
3. Many other “embarrassingly parallel”
operations.
19
Joining by Map-Reduce
Suppose we want to compute
R(A,B) JOIN S(B,C), using k Reduce
tasks.
 I.e., find tuples with matching B-values.
R and S are each stored in a chunked
file.
20
Joining by Map-Reduce – (2)
Use a hash function h from B-values to
k buckets.
 Bucket = Reduce task.
The Map tasks take chunks from R and
S, and send:
 Tuple R(a,b) to Reduce task h(b).
• Key = b value = R(a,b).
 Tuple S(b,c) to Reduce task h(b).
• Key = b; value = S(b,c).
21
Joining by Map-Reduce – (3)
Map tasks send
R(a,b) if h(b) = i
Reduce
task i
Map tasks send
S(b,c) if h(b) = i
All (a,b,c) such that
h(b) = i, and (a,b)
is in R, and (b,c) is
in S.
22
Joining by Map-Reduce – (4)
Key point: If R(a,b) joins with S(b,c),
then both tuples are sent to Reduce
task h(b).
Thus, their join (a,b,c) will be produced
there and shipped to the output file.
23
Dataflow Systems
Arbitrary Acyclic Flow Among Tasks
Preserving Fault Tolerance
The Blocking Property
24
Generalization of Map-Reduce
Map-reduce uses only two functions
(Map and Reduce).
 Each is implemented by a rank of tasks.
 Data flows from Map tasks to Reduce tasks
only.
25
Generalization – (2)
Natural generalization is to allow any
number of functions, connected in an
acyclic network.
Each function implemented by tasks
that feed tasks of successor function(s).
Key fault-tolerance (blocking ) property:
tasks produce all their output at the
end.
26
Many Implementations
1.
2.
3.
4.
5.
6.
Clustera – University of Wisconsin.
Hyracks – Univ. of California/Irvine.
Dryad/DryadLINQ – Microsoft.
Nephele/PACT – T. U. Berlin.
BOOM – Berkeley.
epiC – N. U. Singapore.
27
Example: Join + Aggregation
Relations D(emp, dept) and S(emp,sal).
Compute the sum of the salaries for
each department.
D JOIN S computed by map-reduce.
 But each Reduce task can also group its
emp-dept-sal tuples by dept and sum the
salaries.
28
Example: Continued
A Third function is needed to take the
dept-SUM(sal) pairs from each Reduce
task, organize them by dept, and
compute the final sum for each
department.
29
3-Layer Dataflow
D
S
Final
Group +
Aggregate
Join +
Group
Tasks
Map
Tasks
Hash
by
emp
Hash
by
dept
30
Recursion
Transitive-Closure Example
Fault-Tolerance Problem
Endgame Problem
Some Systems and Approaches
Recent research ideas contributed by
F. Afrati, V. Borkar, M. Carey, N. Polyzotis
31
Applications Requiring Recursion
1. PageRank, the original map-reduce
application is really a recursion
implemented by many rounds of mapreduce.
2. Analysis of Web structure.
3. Analysis of social networks.
4. PDE’s.
32
Transitive Closure
Many recursive applications involving large
data are similar to transitive closure :
Path(X,Y) :- Arc(X,Y)
Path(X,Y) :- Path(X,Z) & Path(Z,Y)
Nonlinear
Path(X,Y) :- Arc(X,Y)
Path(X,Y) :- Arc(X,Z) & Path(Z,Y)
(Right) Linear
33
Implementing TC on a Cluster
Use k tasks.
Hash function h sends each node of the
graph to one of the k tasks.
Task i receives and stores Path(a,b) if
either h(a) = i or h(b) = i, or both.
Task i must join Path(a,c) with
Path(c,b) if h(c) = i.
34
TC on a Cluster – Basis
Data is stored as relation Arc(a,b).
Map tasks read chunks of the Arc
relation and send each tuple Arc(a,b) to
recursive tasks h(a) and h(b).
 Treated as if it were tuple Path(a,b).
 If h(a) = h(b), only one task receives.
35
TC on a Cluster – Recursive Tasks
Path(a,b)
received
Store
Path(a,b)
if new.
Otherwise,
ignore.
Task i
Send Path(a,c) to
tasks h(a) and h(c);
send Path(d,b) to
tasks h(d) and h(b)
Look up
Path(b,c) and/or
Path(d,a) for
any c and d
36
Big Problem: Managing Failure
Map-reduce depends on the blocking
property.
Only then can you restart a failed task
without restarting the whole job.
But any recursive task has to deliver
some output and later get more input.
37
HaLoop (U. Washington)
Iterates Hadoop, once for each round
of the recursion.
 Like iterative PageRank.
Similar idea: Twister (U. Indiana).
Clever piece is the way HaLoop tries to
run each task in round i at a compute
node where it can find its needed
output from round i – 1.
38
Pregel (Google)
Views all computation as a recursion on
some graph.
Nodes send messages to one another.
 Messages bunched into supersteps.
Checkpoint all compute nodes after
some fixed number of supersteps.
On failure, rolls all tasks back to
previous checkpoint.
39
Example: Shortest Paths Via Pregel
I found a path
from node M to
you of length L
I found a path
from node M to
you of length L+5
I found a path
from node M to
you of length L+3
Is this the
shortest path from
M I know about?
Node
N
5
3
I found a path
from node M to
you of length L+6
table of
shortest
paths
to N
6
40
Using Idempotence
Some recursive applications allow
restart of tasks even if they have
produced some output.
Example: TC is idempotent; you can
send a task a duplicate Path fact
without altering the result.
 But if you were counting paths, the
answer would be wrong.
41
Big Problem: The Endgame
Some recursions, like TC, take a large
number of rounds, but the number of
new discoveries in later rounds drops.
 T. Vassilakis (Google): searches forward on
the Web graph can take hundreds of
rounds.
Problem: in a cluster, transmitting small
files carries much overhead.
42
Approach: Merge Tasks
 Decide when to migrate tasks to fewer
compute nodes.
 Data for several tasks at the same node
are combined into a single file and
distributed at the receiving end.
 Downside: old tasks have a lot of state
to move.
 Example: “paths seen so far.”
43
Approach: Modify Algorithms
Nonlinear recursions can terminate in
many fewer steps than equivalent linear
recursions.
Example: TC.
 O(n) rounds on n-node graph for linear.
 O(log n) rounds for nonlinear.
44
Advantage of Linear TC
The data-volume cost (= sum of input
sizes of all tasks) for executing linear TC
is generally lower than that for
nonlinear TC.
Why? Each path is discovered only
once.
 Note: distinct paths between the same
endpoints may each be discovered.
45
Example: Linear TC
Arc + Path = Path
46
Nonlinear TC Constructs
Path + Path = Path in Many Ways
47
Smart TC
 (Valduriez-Boral, Ioannides) Construct
a path from two paths:
1. The first has a length that is a power of
2.
2. The second is no longer than the first.
48
Example: Smart TC
49
Other Nonlinear TC Algorithms
You can have the unique-decomposition
property with many variants of nonlinear
TC.
Example: Balance constructs paths from
two equal-length paths.
 Favor first path when length is odd.
50
Example: Balance
51
Incomparability of TC Algorithms
On different graphs, any of the uniquedecomposition algorithms – left-linear,
right-linear, smart, balanced – could
have the lowest data-volume cost.
Other unique-decomposition algorithms
are possible and also could win.
52
Extension Beyond TC
Can you convert any linear recursion
into an equivalent nonlinear recursion
that requires logarithmic rounds?
Answer: Not always, without increasing
arity and data size.
53
Positive Points
1. (Agarwal, Jagadish, Ness) All linear
Datalog recursions reduce to TC.
2. Right-linear chain-rule Datalog
programs can be replaced by
nonlinear recursions with the same
arity, logarithmic rounds, and the
unique-decomposition property.
54
Example: Alternating-Color Paths
P(X,Y) :- Blue(X,Y)
P(X,Y) :- Blue(X,Z) & Q(Z,Y)
Q(X,Y) :- Red(X,Z) & P(Z,Y)
55
The Case of Reachability
Reach(X) :- Source(X)
Reach(X) :- Reach(Y) & Arc(Y,X)
Takes linear rounds as stated.
Can compute nonlinear TC to get Reach
in O(log n) rounds.
But, then you compute O(n2) facts
instead of O(n) facts on an n-node
graph.
56
Reachability – (2)
Theorem: If you compute Reach using
only unary recursive predicates, then it
must take (n) rounds on a graph of n
nodes.
 Proof uses the ideas of Afrati, Cosmodakis,
and Yannakakis from a generation ago.
57
Summary: Recursion
Key problems are “endgame” and
nonblocking nature of recursive tasks.
In some applications, endgame problem
can be handled by using a nonlinear
recursion that requires O(log n) rounds
and has the unique-decomposition
property.
58
Summary: Research Questions
1. How do you best support fault
tolerance when tasks are nonblocking?
2. How do you manage tasks when the
endgame problem cannot be avoided?
3. When can you replace linear recursion
with nonlinear recursion requiring
many fewer rounds and (roughly) the
same data-volume cost?
59
Download