ullman

Cluster Computing and Datalog Recursion Via Map-Reduce Seminaïve Evaluation Re-engineering Map-Reduce for Recursion 1 Acknowledgements Joint work with Foto Afrati Alkis Polyzotis and Vinayak Borkar contributed to the architecture discussions. 2 Implementing Datalog via Map-Reduce Joins are straightforward to implement as a round of map-reduce. Likewise, union/duplicate-elimination is a round of map-reduce. But implementation of a recursion can thus take many rounds of map-reduce. 3 Seminaïve Evaluation Specific combination of joins and unions. Example: chain rule q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z) Let r, s, t = “old” relations; r’, s’, t’ = incremental relations. Simplification: assume |r’| = a|r|, etc. 4 A 3-Way Join Using Map-Reduce q(W,Z) :- r(W,X) & s(X,Y) & t(Y,Z) Use k compute nodes. Give X and Y shares to determine the reduce-task that gets each tuple. Optimum strategy replicates r and t, not s, using communication |s| + 2k|r||t|. 5 Seminaïve Evaluation – (2) Need to compute sum (union) of seven terms (joins): rst’+rs’t+r’st+rs’t’+r’st’+r’s’t+r’s’t’ Obvious method for computing a round of seminaïve evaluation:  Replicate r and r’; replicate t and t’; do not replicate s or s’.  Communication = (1+a)(|s| + 2k|r||t|) 6 Seminaïve Evaluation – (3) There are many other ways we might use k nodes to do the same task. Example: one group of nodes does (r+r’)s’(t+t’); a second group does r’s(t+t’); the third group does rst’. Theorem: no grouping does better than the obvious method for this example. 7 Networks of Processes for Recursions Is it possible to do a recursion without multiple rounds of map-reduce and their associated communication cost? Note: tasks do not have to be Map or Reduce tasks; they can have other behaviors. 8 Example: Very Simple Recursion p(X,Y) :- e(X,Z) & p(Z,Y); p(X,Y) :- p0(X,Y); Use k compute nodes. Hash Y-values to one of k buckets h(Y). Each node gets a complete copy of e. p0 is distributed among the k nodes, with p0(x,y) going to node h(y). 9 Example – Continued p(X,Y) :- e(X,Z) & p(Z,Y) Each node applies the recursive rule and generates new tuples p(x,y). Key point: since new tuples have a Yvalue that hashes to the same node, no communication is necessary. Duplicates are eliminated locally. 10 Harder Case of Recursion Consider a recursive rule p(X,Y) :- p(X,Z) & p(Z,Y) Responsibility divided among compute nodes by hashing Z-values. Node n gets tuple p(a,b) if either h(a) = n or h(b) = n. 11 Compute Node for h(Z) = n p(a,b) if h(a) = n or h(b) = n Node for h(Z) = n p(c,d) produced To nodes for h(c) and h(d) Search for matches Remember all Received tuples (eliminate duplicates) 12 Comparison with Iteration Advantage: Lets us avoid some communication of data that would be needed in iterated map-reduce rounds. Disadvantage: Tasks run longer, more likely to fail. 13 Node Failures To cope with failures, map-reduce implementations rely on each task getting its input at the beginning, and on output not being consumed elsewhere until the task completes. But recursions can’t work that way. What happens if a node fails after some of its output has been consumed? 14 Node Failures – (2) Actually, there is no problem! We restart the tasks of the failed node at another node. The replacement task will send some data that the failed task also sent. But each node remembers tuples to eliminate duplicates anyway. 15 Node Failures – (3) But the “no problem” conclusion is highly dependent on the Datalog assumption that it is computing sets. Argument would fail if we were computing bags or aggregations of the tuples produced. Similar problems for other recursions, e.g., PDE’s. 16 Extension of Map-Reduce Architecture for Recursion Necessarily, all tasks need to operate in rounds. The master controller learns of all input files that are part of the round-i input to task T and records that T has received these files. 17 Extension – (2) Suppose some task S fails, and it never supplies the round-(i +1) input to T. A replacement S’ for S is restarted at some other node. The master knows that T has received up to round i from S, so it ignores the first i output files from S’. 18 Extension – (3) Master knows where all the inputs ever received by S are from, so it can provide those to S’. 19 Checkpointing and State Another approach is to design tasks so that they can periodically write a state file, which is replicated elsewhere. Tasks take input + state.  Initially, state is empty. Master can restart a task from some state and feed it only inputs received after that state was written. 20 Example: Checkpointing p(X,Y) :- p(X,Z) & p(Z,Y)  Two groups of tasks: 1. Join tasks: hash on Z, using h(Z).  Like tasks from previous example. 2. Eliminate-duplicates tasks: hash on X and Y, using h’(X,Y).  Receives tuples from join tasks.  Distributes truly new tuples to join tasks. 21 Example – (2) to h(a) and h(b) if new p(a,b) . . . p(a,b) Join tasks. State has p(x,y) if h(x) or h(y) is right. to h’(a,b) Dup-elim tasks. State has p(x,y) if h’(x,y) is right. 22 Example – Details Each task writes “buffer” files locally, one for each of the tasks in the other rank. The two ranks of tasks are run on different racks of nodes, to minimize the probability that tasks in both ranks will fail at the same time. 23 Example – Details – (2) Periodically, each task writes its state (tuples received so far) incrementally and lets the master controller replicate it. Problem: the controller can’t be too eager to pass output files to their input, or files become tiny. 24 Future Research There is work to be done on optimization, using map-reduce or similar facilities, for restricted SQL such as Datalog, Datalog–, Datalog + aggregation.  Check out Hive, PIG, as well as work on multiway join optimization. 25 Future Research – (2) Almost everything is open about recursive Datalog implementation under map-reduce or similar systems.  Seminaïve evaluation in general case.  Architectures for managing failures. • Clustera and Hyrax are interesting examples of (nonrecursive) extension of map-reduce.  When can we avoid communication as with p(X,Y) :- e(X,Z) & p(Z,Y)? 26

ullman

Related documents

Products

Support

ullman

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib