Cluster Computing and
Datalog
Recursion Via Map-Reduce
Seminaïve Evaluation
Re-engineering Map-Reduce for
Recursion
1
Acknowledgements
Joint work with Foto Afrati
Alkis Polyzotis and Vinayak Borkar
contributed to the architecture
discussions.
2
Implementing Datalog via
Map-Reduce
Joins are straightforward to implement
as a round of map-reduce.
Likewise, union/duplicate-elimination is
a round of map-reduce.
But implementation of a recursion can
thus take many rounds of map-reduce.
3
Seminaïve Evaluation
Specific combination of joins and
unions.
Example: chain rule
q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z)
Let r, s, t = “old” relations; r’, s’, t’ =
incremental relations.
Simplification: assume |r’| = a|r|, etc.
4
A 3-Way Join Using Map-Reduce
q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z)
Use k compute nodes.
Give X and Y shares to determine the
reduce-task that gets each tuple.
Optimum strategy replicates r and t,
not s, using communication
|s| + 2k|r||t|.
5
Seminaïve Evaluation – (2)
Need to compute sum (union) of seven
terms (joins):
rst’+rs’t+r’st+rs’t’+r’st’+r’s’t+r’s’t’
Obvious method for computing a
round of seminaïve evaluation:
Replicate r and r’; replicate t and t’; do not
replicate s or s’.
Communication = (1+a)(|s| + 2k|r||t|)
6
Seminaïve Evaluation – (3)
There are many other ways we might
use k nodes to do the same task.
Example: one group of nodes does
(r+r’)s’(t+t’); a second group does
r’s(t+t’); the third group does rst’.
Theorem: no grouping does better than
the obvious method for this example.
7
Networks of Processes for
Recursions
Is it possible to do a recursion without
multiple rounds of map-reduce and
their associated communication cost?
Note: tasks do not have to be Map or
Reduce tasks; they can have other
behaviors.
8
Example: Very Simple Recursion
p(X,Y) :- e(X,Z) & p(Z,Y);
p(X,Y) :- p0(X,Y);
Use k compute nodes.
Hash Y-values to one of k buckets h(Y).
Each node gets a complete copy of e.
p0 is distributed among the k nodes,
with p0(x,y) going to node h(y).
9
Example – Continued
p(X,Y) :- e(X,Z) & p(Z,Y)
Each node applies the recursive rule
and generates new tuples p(x,y).
Key point: since new tuples have a Yvalue that hashes to the same node, no
communication is necessary.
Duplicates are eliminated locally.
10
Harder Case of Recursion
Consider a recursive rule
p(X,Y) :- p(X,Z) & p(Z,Y)
Responsibility divided among compute
nodes by hashing Z-values.
Node n gets tuple p(a,b) if either
h(a) = n or h(b) = n.
11
Compute Node for h(Z) = n
p(a,b) if
h(a) = n
or h(b) = n
Node for
h(Z) = n
p(c,d)
produced
To nodes
for h(c)
and h(d)
Search for
matches
Remember all
Received tuples
(eliminate
duplicates)
12
Comparison with Iteration
Advantage: Lets us avoid some
communication of data that would be
needed in iterated map-reduce rounds.
Disadvantage: Tasks run longer, more
likely to fail.
13
Node Failures
To cope with failures, map-reduce
implementations rely on each task
getting its input at the beginning, and
on output not being consumed
elsewhere until the task completes.
But recursions can’t work that way.
What happens if a node fails after some
of its output has been consumed?
14
Node Failures – (2)
Actually, there is no problem!
We restart the tasks of the failed node
at another node.
The replacement task will send some
data that the failed task also sent.
But each node remembers tuples to
eliminate duplicates anyway.
15
Node Failures – (3)
But the “no problem” conclusion is
highly dependent on the Datalog
assumption that it is computing sets.
Argument would fail if we were
computing bags or aggregations of the
tuples produced.
Similar problems for other recursions,
e.g., PDE’s.
16
Extension of Map-Reduce
Architecture for Recursion
Necessarily, all tasks need to operate in
rounds.
The master controller learns of all input
files that are part of the round-i input
to task T and records that T has
received these files.
17
Extension – (2)
Suppose some task S fails, and it never
supplies the round-(i +1) input to T.
A replacement S’ for S is restarted at
some other node.
The master knows that T has received
up to round i from S, so it ignores the
first i output files from S’.
18
Extension – (3)
Master knows where all the inputs ever
received by S are from, so it can
provide those to S’.
19
Checkpointing and State
Another approach is to design tasks so
that they can periodically write a state
file, which is replicated elsewhere.
Tasks take input + state.
Initially, state is empty.
Master can restart a task from some
state and feed it only inputs received
after that state was written.
20
Example: Checkpointing
p(X,Y) :- p(X,Z) & p(Z,Y)
Two groups of tasks:
1. Join tasks: hash on Z, using h(Z).
Like tasks from previous example.
2. Eliminate-duplicates tasks: hash on X and
Y, using h’(X,Y).
Receives tuples from join tasks.
Distributes truly new tuples to join tasks.
21
Example – (2)
to h(a)
and h(b)
if new
p(a,b)
.
.
.
p(a,b)
Join tasks. State
has p(x,y) if h(x)
or h(y) is right.
to h’(a,b)
Dup-elim tasks.
State has p(x,y) if
h’(x,y) is right.
22
Example – Details
Each task writes “buffer” files locally,
one for each of the tasks in the other
rank.
The two ranks of tasks are run on
different racks of nodes, to minimize
the probability that tasks in both ranks
will fail at the same time.
23
Example – Details – (2)
Periodically, each task writes its state
(tuples received so far) incrementally
and lets the master controller replicate
it.
Problem: the controller can’t be too
eager to pass output files to their input,
or files become tiny.
24
Future Research
There is work to be done on
optimization, using map-reduce or
similar facilities, for restricted SQL such
as Datalog, Datalog–, Datalog +
aggregation.
Check out Hive, PIG, as well as work on
multiway join optimization.
25
Future Research – (2)
Almost everything is open about
recursive Datalog implementation under
map-reduce or similar systems.
Seminaïve evaluation in general case.
Architectures for managing failures.
• Clustera and Hyrax are interesting examples of
(nonrecursive) extension of map-reduce.
When can we avoid communication as with
p(X,Y) :- e(X,Z) & p(Z,Y)?
26