Map Reduce Extensions Cluster Computing on DFS

advertisement
Cluster Computing,
Recursion and Datalog
Foto N. Afrati
National Technical University of Athens, Greece
Map-Reduce Pattern
“key”-value
pairs
Input
from
DFS
Output
to DFS
Map
tasks
Reduce
tasks
Tasks and Compute-Node Failures
• Low-end hardware
• Expect failures of the compute-nodes
- disk crashes
- not updated software
- etc.
We don’t want to restart the computation.
In map-reduce this can be easily handled.
Why?
Blocking property
• Map-reduce deals with node failures by
restricting the units of computation in an
important way.
• Both Map tasks and Reduce tasks have
the blocking property:
A task does not deliver output to any other
task until it has completely finished its
work.
Extensions: Dataflow systems
• Uses function prototypes for each kind of task
the job needs, just as Map-Reduce or Hadoop
use prototypes for the Map and Reduce tasks.
• Replicates the prototype into as many tasks as
are needed or requested by the user.
• DryadLINQ (Microsoft), Clustera (U. Wisconsin),
Hyracks (UC Irvine), Nephele/PACT (T. U.
Berlin), BOOM (UC Berkeley), epiC (NUS).
• Other extensions: High-level languages (PIG,
Hive, SCOPE)
Simple extension: Several ranks of
Map-Reduce computations
Map1
Reduce1
Map2
Reduce2
Map3
Reduce3
A more advanced extension:
an acyclic network
Blocking property holds
Not Map and Reduce tasks any more, could be anything.
We need Recursion
• PageRank — the original problem for
which Map-Reduce was developed
• Studies of the structure of the web
• Discovering communities
• Centrality of persons
• Need full transitive closure
• Really operations on relational data
Outline
• Data-volume cost model, in which we can
evaluate different algorithms for executing
queries on a cluster, recursive or not
• Multiway join in map reduce
• Algorithms for implementing recursive
queries, starting with transitive closure
• Generalizing to all Datalog queries
• File transfer between compute nodes
involves substantial overhead, hence a
need to cope with many small files
Data-volume cost
• Total running time of all the tasks that
collaborate in the execution of the
algorithm.
• = Renting time on processors from a public
cloud.
• Surrogate: the sum of the sizes of the
inputs to all the tasks that collaborate on a
job.
• Upper limit on the amount of data that can
be input to any one task
Details of the cost model
(other costs)
• Execution time of a task could, in principle,
be much larger than its input data volume
• Many algorithms (e.g., SQL operations,
hash join) perform operations in time
proportional to the input size plus output
size
• The data-volume model counts only input
size, not output size (output is input to
another task and the final output succinct)
Joining by Map-Reduce
Map tasks send
R(a,b) if h(b) = i
Reduce
task i
Map tasks send
S(b,c) if h(b) = i
All (a,b,c) such that
h(b) = i, and (a,b)
is in R, and (b,c) is
in S.
Natural Join of Three Relations
A
0
1
B
1
2
The join:
B
1
2
C
2
3
A
1
A
1
2
B
2
C
3
C
3
1
Multiway join on Map-Reduce
Answer(W,X,Y,Z) :- r(W,X) & s(X,Y) & t(Y,Z)
•
•
•
•
•
We use 100 Reduce tasks
Hash X five ways and Y 20 ways
h and g hash functions for X, Y
Reduce task ID = [h(X), g(Y)].
Send r(a,b) to all tasks [h(b),v]. Similar for t(c,d):
all [u,g(c)]
• r facts are sent to 20 tasks
• t facts are sent to 5 tasks
• s facts are sent to only one task
Multiway join on Map-Reduce
(minimizing the data volume cost)
Answer(W,X,Y,Z) :- r(W,X) & s(X,Y) & t(Y,Z)
• If we have k Reduce tasks, we can use two hash
functions h(X) and g(Y), that map X-values and
Y-values, respectively, to k1 (k2 respectively).
• k 1k2 = k
• Then, a Reduce task corresponds to a pair of
buckets, one for h and the other for g.
• To minimize the number of tuples transmitted,
pick: k1 = k|r|/|t|, k2 = k|t|/|r|
Data-volume cost= k1 |t| + k2 |r|=k|r||t|
Transitive closure (TC)
Nonlinear
p(x,y) <- e(x,y)
p(x,y) <- p(x,z) & p(z,y)
Right-linear
p(x,y) <- e(x,y)
p(x,y) <- e(x,z) & p(z,y)
Lower Bound on Query Execution
Cost
• Number of Derivations: the sum, over all the
rules in the program, of the number of ways we
can assign values to the variables in order to
make the entire body (right side of the rule) true.
• Key point: An implementation of a Datalog
program that executes the rules as written must
take time on a single processor at least as great
as the number of derivations.
• Seminaive evaluation: time proportional to the
number of derivations
Number of Derivations for
Transitive closure
• Nonlinear TC:
the sum over all nodes c of the number of
nodes that can reach c times the number
of nodes reachable from c.
• Left-linear TC:
the sum over all nodes c of the in-degree
of c times the number of nodes reachable
from c.
Nonlinear TC more derivations than linear TC
NonlinearTC. Algorithm Dup-Elim
• Join tasks, which perform the join of tuples
as in earlier slide.
• Dup-elim tasks, whose job is to catch
duplicate p-tuples before they can
propagate.
NonlinearTC. Algorithm Dup-Elim
A join task
•
When p(a, b) is received by a task for the first
time, the task:
1. If this task is h(b), it searches for previously
received tuples p(b, x) for any x. For each such
tuple, it sends p(a, x) to two tasks: h(a) and h(x).
2. If this task is h(a), it searches for previously
received tuples p(y, a) for any y. For each such
tuple, it sends p(y, b) to the tasks h(y) and h(b).
3. The tuple p(a, b) is stored locally as an already
seen tuple.
A join task
NonlinearTC. Algorithm Dup-Elim
•
The method described above for
combining join and duplicate-elimination
tasks communicates a number of tuples
that is at most the sum of:
1. The number of derivations plus
2. Twice the number of path facts plus
3. Three times the number of arcs.
NonlinearTC. Algorithm Smart
(improvement on the number of derivations)
• There is a path from node X to node Y if there
are two paths:
• One path from X to some node Z with length
which is a power of 2
• Another path from Z to Y whose length is no less
• Key point: Each shortest path between two
points is discovered only once
Incomparability of TC Algorithms
• Infinite variety of algorithms that discover
shortest paths only once.
– Like Smart or Linear algorithms.
• None dominates the others on all graphs.
Implementing any Datalog program
• For each rule we create a collection of tasks
• Buckets by vectors of values, and each
component of the vector is obtained by
hashing a certain variable
• A task receives all facts P(a1, a2, . . . , ak)
discovered by any task, provided that each
component ai meets a constraint
• Alternatively rewrite the rules to have
bodies of at most two subgoals
Implementations handling recursion
• Haloop: iteration implemented as a series
of Map-Reduce ranks (vldb2010)
• Pregel: checkpointing for failure recovery
mechanism (sigmod2010)
Haloop
• Implements recursion as an iteration of
map-reduce jobs
• Trying hard to make sure that tasks for
one round are located at the node where
its input was created by the previous
round.
• They are not really recursive, and no
problem of dealing with failure arises.
Pregel
• Views all computation as a recursion on some
graph.
• Nodes send messages to one another bunched
into supersteps
• Checkpoints all tasks at intervals. If any task
fails, all tasks are rolled back to the previous
checkpoint.
• Does not scale completely (the more compute
nodes are involved, the more frequently we must
checkpoint).
Example: Shortest Paths Via Pregel
I found a path
from node M to
you of length L
I found a path
from node M to
you of length L+5
I found a path
from node M to
you of length L+3
Is this the
shortest path from
M I know about?
I found a path
from node M to
you of length L+6
Node
N
5
3
6
The endgame
(dealing with small files)
• In later rounds of a recursion, it is possible that
the number of new facts derived at a round
drops considerably.
• Recall that there is significant overhead involved
in transmitting files.
• It is desired that each of the recursive tasks
have many tuples for each of the other tasks
whenever we distribute data among the tasks.
• An approach: small number of rounds.
The polynomial fringe property
• It is highly unlikely that all Datalog programs can
be implemented in parallel in polylog time
• The division between those that can and those
that (almost surely) cannot was addressed in the
80’s
• And the polynomial-fringe property (hereafter
PFP) was defined
• Programs with the PFP can be implemented in
parallel in polylog number of rounds
– Reduces to TC
Reachability Query
• R(X) <-- R(Y), e(X,Y)
• R(X) <-- Q(X)
• Can be computed in logarithmic number or
rounds by computing the full transitive
closure.
• Theorem: Cannot be computed in
logarithmic number of rounds unless the
arity increases to 2.
Right linear chain programs
P(X,Y) :- Blue(X,Y)
P(X,Y) :- Blue(X,Z) & Q(Z,Y)
Q(X,Y) :- Red(X,Z) & P(Z,Y)
Theorem: can be executed in logarithmic
number of rounds keeping arity 2.
Thank you
Download