ullman

advertisement
Cluster Computing and
Datalog
Recursion Via Map-Reduce
Seminaïve Evaluation
Re-engineering Map-Reduce for
Recursion
1
Acknowledgements
Joint work with Foto Afrati
Alkis Polyzotis and Vinayak Borkar
contributed to the architecture
discussions.
2
Implementing Datalog via
Map-Reduce
Joins are straightforward to implement
as a round of map-reduce.
Likewise, union/duplicate-elimination is
a round of map-reduce.
But implementation of a recursion can
thus take many rounds of map-reduce.
3
Seminaïve Evaluation
Specific combination of joins and
unions.
Example: chain rule
q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z)
Let r, s, t = “old” relations; r’, s’, t’ =
incremental relations.
Simplification: assume |r’| = a|r|, etc.
4
A 3-Way Join Using Map-Reduce
q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z)
Use k compute nodes.
Give X and Y shares to determine the
reduce-task that gets each tuple.
Optimum strategy replicates r and t,
not s, using communication
|s| + 2k|r||t|.
5
Seminaïve Evaluation – (2)
Need to compute sum (union) of seven
terms (joins):
rst’+rs’t+r’st+rs’t’+r’st’+r’s’t+r’s’t’
Obvious method for computing a
round of seminaïve evaluation:
 Replicate r and r’; replicate t and t’; do not
replicate s or s’.
 Communication = (1+a)(|s| + 2k|r||t|)
6
Seminaïve Evaluation – (3)
There are many other ways we might
use k nodes to do the same task.
Example: one group of nodes does
(r+r’)s’(t+t’); a second group does
r’s(t+t’); the third group does rst’.
Theorem: no grouping does better than
the obvious method for this example.
7
Networks of Processes for
Recursions
Is it possible to do a recursion without
multiple rounds of map-reduce and
their associated communication cost?
Note: tasks do not have to be Map or
Reduce tasks; they can have other
behaviors.
8
Example: Very Simple Recursion
p(X,Y) :- e(X,Z) & p(Z,Y);
p(X,Y) :- p0(X,Y);
Use k compute nodes.
Hash Y-values to one of k buckets h(Y).
Each node gets a complete copy of e.
p0 is distributed among the k nodes,
with p0(x,y) going to node h(y).
9
Example – Continued
p(X,Y) :- e(X,Z) & p(Z,Y)
Each node applies the recursive rule
and generates new tuples p(x,y).
Key point: since new tuples have a Yvalue that hashes to the same node, no
communication is necessary.
Duplicates are eliminated locally.
10
Harder Case of Recursion
Consider a recursive rule
p(X,Y) :- p(X,Z) & p(Z,Y)
Responsibility divided among compute
nodes by hashing Z-values.
Node n gets tuple p(a,b) if either
h(a) = n or h(b) = n.
11
Compute Node for h(Z) = n
p(a,b) if
h(a) = n
or h(b) = n
Node for
h(Z) = n
p(c,d)
produced
To nodes
for h(c)
and h(d)
Search for
matches
Remember all
Received tuples
(eliminate
duplicates)
12
Comparison with Iteration
Advantage: Lets us avoid some
communication of data that would be
needed in iterated map-reduce rounds.
Disadvantage: Tasks run longer, more
likely to fail.
13
Node Failures
To cope with failures, map-reduce
implementations rely on each task
getting its input at the beginning, and
on output not being consumed
elsewhere until the task completes.
But recursions can’t work that way.
What happens if a node fails after some
of its output has been consumed?
14
Node Failures – (2)
Actually, there is no problem!
We restart the tasks of the failed node
at another node.
The replacement task will send some
data that the failed task also sent.
But each node remembers tuples to
eliminate duplicates anyway.
15
Node Failures – (3)
But the “no problem” conclusion is
highly dependent on the Datalog
assumption that it is computing sets.
Argument would fail if we were
computing bags or aggregations of the
tuples produced.
Similar problems for other recursions,
e.g., PDE’s.
16
Extension of Map-Reduce
Architecture for Recursion
Necessarily, all tasks need to operate in
rounds.
The master controller learns of all input
files that are part of the round-i input
to task T and records that T has
received these files.
17
Extension – (2)
Suppose some task S fails, and it never
supplies the round-(i +1) input to T.
A replacement S’ for S is restarted at
some other node.
The master knows that T has received
up to round i from S, so it ignores the
first i output files from S’.
18
Extension – (3)
Master knows where all the inputs ever
received by S are from, so it can
provide those to S’.
19
Checkpointing and State
Another approach is to design tasks so
that they can periodically write a state
file, which is replicated elsewhere.
Tasks take input + state.
 Initially, state is empty.
Master can restart a task from some
state and feed it only inputs received
after that state was written.
20
Example: Checkpointing
p(X,Y) :- p(X,Z) & p(Z,Y)
 Two groups of tasks:
1. Join tasks: hash on Z, using h(Z).
 Like tasks from previous example.
2. Eliminate-duplicates tasks: hash on X and
Y, using h’(X,Y).
 Receives tuples from join tasks.
 Distributes truly new tuples to join tasks.
21
Example – (2)
to h(a)
and h(b)
if new
p(a,b)
.
.
.
p(a,b)
Join tasks. State
has p(x,y) if h(x)
or h(y) is right.
to h’(a,b)
Dup-elim tasks.
State has p(x,y) if
h’(x,y) is right.
22
Example – Details
Each task writes “buffer” files locally,
one for each of the tasks in the other
rank.
The two ranks of tasks are run on
different racks of nodes, to minimize
the probability that tasks in both ranks
will fail at the same time.
23
Example – Details – (2)
Periodically, each task writes its state
(tuples received so far) incrementally
and lets the master controller replicate
it.
Problem: the controller can’t be too
eager to pass output files to their input,
or files become tiny.
24
Future Research
There is work to be done on
optimization, using map-reduce or
similar facilities, for restricted SQL such
as Datalog, Datalog–, Datalog +
aggregation.
 Check out Hive, PIG, as well as work on
multiway join optimization.
25
Future Research – (2)
Almost everything is open about
recursive Datalog implementation under
map-reduce or similar systems.
 Seminaïve evaluation in general case.
 Architectures for managing failures.
• Clustera and Hyrax are interesting examples of
(nonrecursive) extension of map-reduce.
 When can we avoid communication as with
p(X,Y) :- e(X,Z) & p(Z,Y)?
26
Download