Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion 1 Acknowledgements Joint work with Foto Afrati Alkis Polyzotis and Vinayak Borkar contributed to the architecture discussions. 2 Implementing Datalog via Map-Reduce Joins are straightforward to implement as a round of map-reduce. Likewise, union/duplicate-elimination is a round of map-reduce. But implementation of a recursion can thus take many rounds of map-reduce. 3 Seminaïve Evaluation Specific combination of joins and unions. Example: chain rule q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z) Let r, s, t = “old” relations; r’, s’, t’ = incremental relations. Simplification: assume |r’| = a|r|, etc. 4 A 3-Way Join Using Map-Reduce q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z) Use k compute nodes. Give X and Y shares to determine the reduce-task that gets each tuple. Optimum strategy replicates r and t, not s, using communication |s| + 2k|r||t|. 5 Seminaïve Evaluation – (2) Need to compute sum (union) of seven terms (joins): rst’+rs’t+r’st+rs’t’+r’st’+r’s’t+r’s’t’ Obvious method for computing a round of seminaïve evaluation: Replicate r and r’; replicate t and t’; do not replicate s or s’. Communication = (1+a)(|s| + 2k|r||t|) 6 Seminaïve Evaluation – (3) There are many other ways we might use k nodes to do the same task. Example: one group of nodes does (r+r’)s’(t+t’); a second group does r’s(t+t’); the third group does rst’. Theorem: no grouping does better than the obvious method for this example. 7 Networks of Processes for Recursions Is it possible to do a recursion without multiple rounds of map-reduce and their associated communication cost? Note: tasks do not have to be Map or Reduce tasks; they can have other behaviors. 8 Example: Very Simple Recursion p(X,Y) :- e(X,Z) & p(Z,Y); p(X,Y) :- p0(X,Y); Use k compute nodes. Hash Y-values to one of k buckets h(Y). Each node gets a complete copy of e. p0 is distributed among the k nodes, with p0(x,y) going to node h(y). 9 Example – Continued p(X,Y) :- e(X,Z) & p(Z,Y) Each node applies the recursive rule and generates new tuples p(x,y). Key point: since new tuples have a Yvalue that hashes to the same node, no communication is necessary. Duplicates are eliminated locally. 10 Harder Case of Recursion Consider a recursive rule p(X,Y) :- p(X,Z) & p(Z,Y) Responsibility divided among compute nodes by hashing Z-values. Node n gets tuple p(a,b) if either h(a) = n or h(b) = n. 11 Compute Node for h(Z) = n p(a,b) if h(a) = n or h(b) = n Node for h(Z) = n p(c,d) produced To nodes for h(c) and h(d) Search for matches Remember all Received tuples (eliminate duplicates) 12 Comparison with Iteration Advantage: Lets us avoid some communication of data that would be needed in iterated map-reduce rounds. Disadvantage: Tasks run longer, more likely to fail. 13 Node Failures To cope with failures, map-reduce implementations rely on each task getting its input at the beginning, and on output not being consumed elsewhere until the task completes. But recursions can’t work that way. What happens if a node fails after some of its output has been consumed? 14 Node Failures – (2) Actually, there is no problem! We restart the tasks of the failed node at another node. The replacement task will send some data that the failed task also sent. But each node remembers tuples to eliminate duplicates anyway. 15 Node Failures – (3) But the “no problem” conclusion is highly dependent on the Datalog assumption that it is computing sets. Argument would fail if we were computing bags or aggregations of the tuples produced. Similar problems for other recursions, e.g., PDE’s. 16 Extension of Map-Reduce Architecture for Recursion Necessarily, all tasks need to operate in rounds. The master controller learns of all input files that are part of the round-i input to task T and records that T has received these files. 17 Extension – (2) Suppose some task S fails, and it never supplies the round-(i +1) input to T. A replacement S’ for S is restarted at some other node. The master knows that T has received up to round i from S, so it ignores the first i output files from S’. 18 Extension – (3) Master knows where all the inputs ever received by S are from, so it can provide those to S’. 19 Checkpointing and State Another approach is to design tasks so that they can periodically write a state file, which is replicated elsewhere. Tasks take input + state. Initially, state is empty. Master can restart a task from some state and feed it only inputs received after that state was written. 20 Example: Checkpointing p(X,Y) :- p(X,Z) & p(Z,Y) Two groups of tasks: 1. Join tasks: hash on Z, using h(Z). Like tasks from previous example. 2. Eliminate-duplicates tasks: hash on X and Y, using h’(X,Y). Receives tuples from join tasks. Distributes truly new tuples to join tasks. 21 Example – (2) to h(a) and h(b) if new p(a,b) . . . p(a,b) Join tasks. State has p(x,y) if h(x) or h(y) is right. to h’(a,b) Dup-elim tasks. State has p(x,y) if h’(x,y) is right. 22 Example – Details Each task writes “buffer” files locally, one for each of the tasks in the other rank. The two ranks of tasks are run on different racks of nodes, to minimize the probability that tasks in both ranks will fail at the same time. 23 Example – Details – (2) Periodically, each task writes its state (tuples received so far) incrementally and lets the master controller replicate it. Problem: the controller can’t be too eager to pass output files to their input, or files become tiny. 24 Future Research There is work to be done on optimization, using map-reduce or similar facilities, for restricted SQL such as Datalog, Datalog–, Datalog + aggregation. Check out Hive, PIG, as well as work on multiway join optimization. 25 Future Research – (2) Almost everything is open about recursive Datalog implementation under map-reduce or similar systems. Seminaïve evaluation in general case. Architectures for managing failures. • Clustera and Hyrax are interesting examples of (nonrecursive) extension of map-reduce. When can we avoid communication as with p(X,Y) :- e(X,Z) & p(Z,Y)? 26