Cluster Computing, Recursion and Datalog Foto N. Afrati National Technical University of Athens, Greece Map-Reduce Pattern “key”-value pairs Input from DFS Output to DFS Map tasks Reduce tasks Tasks and Compute-Node Failures • Low-end hardware • Expect failures of the compute-nodes - disk crashes - not updated software - etc. We don’t want to restart the computation. In map-reduce this can be easily handled. Why? Blocking property • Map-reduce deals with node failures by restricting the units of computation in an important way. • Both Map tasks and Reduce tasks have the blocking property: A task does not deliver output to any other task until it has completely finished its work. Extensions: Dataflow systems • Uses function prototypes for each kind of task the job needs, just as Map-Reduce or Hadoop use prototypes for the Map and Reduce tasks. • Replicates the prototype into as many tasks as are needed or requested by the user. • DryadLINQ (Microsoft), Clustera (U. Wisconsin), Hyracks (UC Irvine), Nephele/PACT (T. U. Berlin), BOOM (UC Berkeley), epiC (NUS). • Other extensions: High-level languages (PIG, Hive, SCOPE) Simple extension: Several ranks of Map-Reduce computations Map1 Reduce1 Map2 Reduce2 Map3 Reduce3 A more advanced extension: an acyclic network Blocking property holds Not Map and Reduce tasks any more, could be anything. We need Recursion • PageRank — the original problem for which Map-Reduce was developed • Studies of the structure of the web • Discovering communities • Centrality of persons • Need full transitive closure • Really operations on relational data Outline • Data-volume cost model, in which we can evaluate different algorithms for executing queries on a cluster, recursive or not • Multiway join in map reduce • Algorithms for implementing recursive queries, starting with transitive closure • Generalizing to all Datalog queries • File transfer between compute nodes involves substantial overhead, hence a need to cope with many small files Data-volume cost • Total running time of all the tasks that collaborate in the execution of the algorithm. • = Renting time on processors from a public cloud. • Surrogate: the sum of the sizes of the inputs to all the tasks that collaborate on a job. • Upper limit on the amount of data that can be input to any one task Details of the cost model (other costs) • Execution time of a task could, in principle, be much larger than its input data volume • Many algorithms (e.g., SQL operations, hash join) perform operations in time proportional to the input size plus output size • The data-volume model counts only input size, not output size (output is input to another task and the final output succinct) Joining by Map-Reduce Map tasks send R(a,b) if h(b) = i Reduce task i Map tasks send S(b,c) if h(b) = i All (a,b,c) such that h(b) = i, and (a,b) is in R, and (b,c) is in S. Natural Join of Three Relations A 0 1 B 1 2 The join: B 1 2 C 2 3 A 1 A 1 2 B 2 C 3 C 3 1 Multiway join on Map-Reduce Answer(W,X,Y,Z) :- r(W,X) & s(X,Y) & t(Y,Z) • • • • • We use 100 Reduce tasks Hash X five ways and Y 20 ways h and g hash functions for X, Y Reduce task ID = [h(X), g(Y)]. Send r(a,b) to all tasks [h(b),v]. Similar for t(c,d): all [u,g(c)] • r facts are sent to 20 tasks • t facts are sent to 5 tasks • s facts are sent to only one task Multiway join on Map-Reduce (minimizing the data volume cost) Answer(W,X,Y,Z) :- r(W,X) & s(X,Y) & t(Y,Z) • If we have k Reduce tasks, we can use two hash functions h(X) and g(Y), that map X-values and Y-values, respectively, to k1 (k2 respectively). • k 1k2 = k • Then, a Reduce task corresponds to a pair of buckets, one for h and the other for g. • To minimize the number of tuples transmitted, pick: k1 = k|r|/|t|, k2 = k|t|/|r| Data-volume cost= k1 |t| + k2 |r|=k|r||t| Transitive closure (TC) Nonlinear p(x,y) <- e(x,y) p(x,y) <- p(x,z) & p(z,y) Right-linear p(x,y) <- e(x,y) p(x,y) <- e(x,z) & p(z,y) Lower Bound on Query Execution Cost • Number of Derivations: the sum, over all the rules in the program, of the number of ways we can assign values to the variables in order to make the entire body (right side of the rule) true. • Key point: An implementation of a Datalog program that executes the rules as written must take time on a single processor at least as great as the number of derivations. • Seminaive evaluation: time proportional to the number of derivations Number of Derivations for Transitive closure • Nonlinear TC: the sum over all nodes c of the number of nodes that can reach c times the number of nodes reachable from c. • Left-linear TC: the sum over all nodes c of the in-degree of c times the number of nodes reachable from c. Nonlinear TC more derivations than linear TC NonlinearTC. Algorithm Dup-Elim • Join tasks, which perform the join of tuples as in earlier slide. • Dup-elim tasks, whose job is to catch duplicate p-tuples before they can propagate. NonlinearTC. Algorithm Dup-Elim A join task • When p(a, b) is received by a task for the first time, the task: 1. If this task is h(b), it searches for previously received tuples p(b, x) for any x. For each such tuple, it sends p(a, x) to two tasks: h(a) and h(x). 2. If this task is h(a), it searches for previously received tuples p(y, a) for any y. For each such tuple, it sends p(y, b) to the tasks h(y) and h(b). 3. The tuple p(a, b) is stored locally as an already seen tuple. A join task NonlinearTC. Algorithm Dup-Elim • The method described above for combining join and duplicate-elimination tasks communicates a number of tuples that is at most the sum of: 1. The number of derivations plus 2. Twice the number of path facts plus 3. Three times the number of arcs. NonlinearTC. Algorithm Smart (improvement on the number of derivations) • There is a path from node X to node Y if there are two paths: • One path from X to some node Z with length which is a power of 2 • Another path from Z to Y whose length is no less • Key point: Each shortest path between two points is discovered only once Incomparability of TC Algorithms • Infinite variety of algorithms that discover shortest paths only once. – Like Smart or Linear algorithms. • None dominates the others on all graphs. Implementing any Datalog program • For each rule we create a collection of tasks • Buckets by vectors of values, and each component of the vector is obtained by hashing a certain variable • A task receives all facts P(a1, a2, . . . , ak) discovered by any task, provided that each component ai meets a constraint • Alternatively rewrite the rules to have bodies of at most two subgoals Implementations handling recursion • Haloop: iteration implemented as a series of Map-Reduce ranks (vldb2010) • Pregel: checkpointing for failure recovery mechanism (sigmod2010) Haloop • Implements recursion as an iteration of map-reduce jobs • Trying hard to make sure that tasks for one round are located at the node where its input was created by the previous round. • They are not really recursive, and no problem of dealing with failure arises. Pregel • Views all computation as a recursion on some graph. • Nodes send messages to one another bunched into supersteps • Checkpoints all tasks at intervals. If any task fails, all tasks are rolled back to the previous checkpoint. • Does not scale completely (the more compute nodes are involved, the more frequently we must checkpoint). Example: Shortest Paths Via Pregel I found a path from node M to you of length L I found a path from node M to you of length L+5 I found a path from node M to you of length L+3 Is this the shortest path from M I know about? I found a path from node M to you of length L+6 Node N 5 3 6 The endgame (dealing with small files) • In later rounds of a recursion, it is possible that the number of new facts derived at a round drops considerably. • Recall that there is significant overhead involved in transmitting files. • It is desired that each of the recursive tasks have many tuples for each of the other tasks whenever we distribute data among the tasks. • An approach: small number of rounds. The polynomial fringe property • It is highly unlikely that all Datalog programs can be implemented in parallel in polylog time • The division between those that can and those that (almost surely) cannot was addressed in the 80’s • And the polynomial-fringe property (hereafter PFP) was defined • Programs with the PFP can be implemented in parallel in polylog number of rounds – Reduces to TC Reachability Query • R(X) <-- R(Y), e(X,Y) • R(X) <-- Q(X) • Can be computed in logarithmic number or rounds by computing the full transitive closure. • Theorem: Cannot be computed in logarithmic number of rounds unless the arity increases to 2. Right linear chain programs P(X,Y) :- Blue(X,Y) P(X,Y) :- Blue(X,Z) & Q(Z,Y) Q(X,Y) :- Red(X,Z) & P(Z,Y) Theorem: can be executed in logarithmic number of rounds keeping arity 2. Thank you