TDD: Topics in Distributed Databases Distributed Query Processing MapReduce Vertex-centric models for querying graphs Distributed query evaluation by partial evaluation 1 Coping with the sheer volume of big data Using SSD of 6G/s, a linear scan of a data set D would take 1.9 days when D is of 1PB (1015B) 5.28 years when D is of 1EB (1018B) Can we do better if we have n processors/SSDs? Divide and conquer: split the work across n processors interconnection network P P P M M M DB DB DB How to make use of parallelism in query answering? 2 MapReduce 3 MapReduce A programming model with two primitive functions: Map: <k1, v1> list (k2, v2) Google Reduce: <k2, list(v2)> list (k3, v3) Input: a list <k1, v1> of key-value pairs Map: applied to each pair, computes key-value pairs <k2, v2> • The intermediate key-value pairs are hash-partitioned based on k2. Each partition (k2, list(v2)) is sent to a reducer Reduce: takes a partition as input, and computes key-value pairs <k3, v3> The process may reiterate – How doesmultiple it work?map/reduce steps 4 Architecture (Hadoop) <k1, v1> <k1, v1> <k1, v1> <k1, v1> mapper mapper mapper One block for each Stored mapper in DFS (a map task) Partitioned in blocks (64M) <k2, v2> <k2, v2> <k2, v2> In local store of mappers Hash partition (k2) reducer reducer Multiple steps <k3, v3> <k3, v3> Aggregate results No need to worry about how the data is stored and sent 5 Connection with parallel database systems <k1, v1> <k1, v1> <k1, v1> mapper <k2, v2> mapper <k2, v2> <k1, v1> mapper <k2, v2> Parallel computation What parallelism? reducer reducer <k3, v3> <k3, v3> Parallel computation Data partitioned parallelism 6 Parallel database systems <k1, v1> <k1, v1> <k1, v1> mapper mapper <k2, v2> <k2, v2> <k1, v1> mapper <k2, v2> reducer reducer <k3, v3> <k3, v3> interconnection network P P P M M M DB DB DB Restricted query languages: only two primitive 7 MapReduce implementation of relational operators not necessarily a key of R Projection A R Input: for each tuple t in R, a pair (key, value), where value = t Map(key, t) • emit (t.A, t.A) Apply to each input tuple, in parallel; emit new tuples with projected attributes Reduce(hkey, hvalue[ ]) • emit(hkey, hkey) the reducer is not necessary; but it eliminates duplicates. Why? Mappers: processing each tuple in parallel 8 Selection Selection C R Input: for each tuple t in R, a pair (key, value), where value = t Map(key, t) • if C(t) • then emit (t, “1”) Apply to each input tuple, in parallel; select tuples that satisfy condition C Reduce(hkey, hvalue[ ]) • emit(hkey, hkey) How does it eliminate duplicates? Reducers: eliminate duplicates 9 Union A mapper is assigned chunks from either R1 or R2 Union R1 R2 Input: for each tuple t in R1 and s in R2, a pair (key, value) Map(key, t) • emit (t, “1”) A mapper just passes an input tuple to a reducer Reduce(hkey, hvalue[ ]) • emit(hkey, hkey) Reducers simply eliminate duplicates Map: process tuples of R1 and R2 uniformly 10 Set difference distinguishable Set difference R1 R2 Input: for each tuple t in R1 and s in R2, a pair (key, value) Map(key, t) • if t is in R1 • then emit(t, “1”) • else emit(t, “2”) tag each tuple with its source Reduce(hkey, hvalue[ ]) • if only “1” appears in the list hvelue • then emit(hkey, hkey) Why does it work? Reducers do the checking 11 Reduce-side join Natural R1 R1.A = R2.B R2, where R1[A, C], R2[B, D] Input: for each tuple t in R1 and s in R2, a pair (key, value) Map(key, t) • if t is in R1 • then emit(t.[A], (“1”, t[C])) • else emit(t.[B], (“2”, t.[D])) Hashing on join attributes Reduce(hkey, hvalue[ ]) • for each (“1”, t[C]) and each (“2”, s[D]) in the list hvalue t[C],R2 s[D]),R3? hkey) How •to emit((hkey, implement R1 Inefficient. Nested Why? loop Reduce-side join (based on hash partitioning) 12 Map-side join Recall R1 R1.A = R2.B R2 Partition R1 and R2 into n partitions, by the same partitioning function in R1.A and R2.B, via either range or hash partitioning Compute Ri1 i2 locally at processor i R R1.A = R2.B Merge the local results Partitioned join Map-side join: Input relations are partitioned and sorted based on join keys Map over R1 and read from the corresponding partition of R2 Merge join map(key, t) Limitation: sort order and partitioning i • read R 2 • for each tuple s in relation Ri2 • if t[A] = s[B] then emit((t[A], t[C], s[D]), t[A]) 13 In-memory join Recall R1 R1.A < R2.B R2 Partition R1 into n partitions, by any partitioning method, and distribute it across n processors Replicate the other relation R2 across all processors Compute Rj1 R1.A < R2.B R2 locally at processor j Merge the local results Fragment and replicate join Broadcast join A smaller relation is broadcast to each node and stored in its local memory The other relation is partitioned and distributed across mappers map(key, t) Limitation: memory • for each tuple s in relation R2 (local) • if t[A] = s[B] then emit((t[A], t[C], s[D]), t[A]) 14 Aggregation R(A, B, C), compute sum(B) group by A Map(key, t) • emit (t[A], t[B]) Grouping: done by MapReduce framework Reduce(hkey, hvalue[ ]) • sum := 0; • for each value s in the list hvalue • sum := sum + s; • emit(hkey, sum) Compute the aggregation for each group Leveraging the MapReduce framework 15 Practice: validation of functional dependencies A functional dependency (FD) defined on schema R: X Y – For any instance D of R, D satisfies the FD if for any pair of tuples t and t’, if t[X] = t’[X], then t[Y] = t’[Y] – Violations of the FD in D: { t | there exists t’ in D, such that t[X] = t’[X], but t[Y] t’[Y] } Develop a MapReduce algorithm to find all the violations of the FD in D key – Map: for each tuple t, emit (t[X], t) – Reduce: find all tuples t such that there exists t’, with but t[Y] t’[Y]; emit (1, t) Does MapReduce support recursion? : Write a MapReduce algorithm to validate a set of FDs? 16 Transitive closures The transitive closure TC of a relation R[A, B] – R is a subset of TC – if (a, b) and (b, c) are in TC, then (a, c) is in TC That is, • TC(x, y) :- R(x, y); • TC(x, z) :- TC(x, y), TC(y, z). Develop a MapReduce algorithm that given R, computes TC A fixpoint computation – How to determine when to terminate? – How to minimize redundant computation? : Write a MapReduce algorithm 17 A naïve MapReduce algorithm Given R(A, B), compute TC Initially, the input relation R Map((a, b), value) • emit (a, (“r”, b)); emit(b, (“l”, a)); Iteration: the output of reducers becomes the input of mappers in the next round of • for each (“l”, a) in hvalue MapReduce computation Reduce(b, hvalue[ ]) • for each (“r”, c) in hvalue • emit(a, c); emit(b, c); • emit(a, b); One round of recursive computation: Apply the transitivity rule Restore (a, b), (b, c). Why? Termination? 18 A MapReduce algorithm Given R(A, B), compute TC Map((a, b), value) • emit (a, (“r”, b)); emit(b, (“l”, a)); Reduce(b, hvalue[ ]) • for each (“l”, a) in hvalue • for each (“r”, c) in hvalue • emit(a, c); emit(b, c); • emit(a, b); Termination: when the intermediate result no longer changes How many rounds? controlled by a non-MapReduce driver How to improve it? Naïve: not very efficient. Why? 19 Smart transitive closure (recursive doubling) 1. P0(x, y) := { (x, x) | x is in R}; 2. Q0 := R; 3. i : = 0; 4. while Qi do a) b) c) d) e) • R: edge relation • Pi(x, y): the distance from x to y is in [0, 2i – 1] • Qi(x, y): the distance from x to y is exactly 2i [2i-1, 2i – 1] i := i + 1; Pi(x, y) := (x, y) (Q i-1 (x, z) Pi-1 (z, y)); Recursive doubling Pi := Pi Pi - 1; Qi(x, y) := (x, y) (Q i-1 (x, z) Q i-1 (z, y)); Qi := Qi Pi; Remove those guys with a shorter path How many rounds? Recursive doubling: log (|R|) A project MapReduce: you know how to do join, union and set difference Implement smart TC in MapReduce 20 Advantages of MapReduce Simple: one only needs to define two functions no need to worry about how the data is stored, distributed and how the operations are scheduled scalability: a large number of low end machines • scale out (scale horizontally): adding a new computer to a distributed software application; lost-cost “commodity” • scale up (scale vertically): upgrade, add (costly) resources to a single node independence: it can work with various storage layers flexibility: independent of data models or schema Fault tolerance: why? 21 Fault tolerance <k1, v1> <k1, v1> <k1, v1> <k1, v1> triplicated mapper mapper mapper <k2, v2> <k2, v2> <k2, v2> reducer reducer <k3, v3> <k3, v3> Detecting failures and reassigning the tasks of failed nodes to healthy nodes Redundancy checking to achieve load balancing Able to handle an average of 1.2 failures per analysis job 22 Pitfalls of MapReduce No schema: schema-free No index: index-free Why bad? Inefficient to do join A single dataflow: a single input and a single output No high-level languages: no SQL Carry invariant input data; no support for incremental Functional programming computation: redundant computation The MapReduce model does not provide a mechanism to maintain global data structures that can be accessed and updated by all Why low efficiency? mappers and reducers; maximize communication: all-to-all Low efficiency: I/O optimization, utilization Not minimize human effort, not to maximize performance 23 Inefficiency of MapReduce <k1, v1> <k1, v1> <k1, v1> <k1, v1> mapper mapper mapper <k2, v2> <k2, v2> <k2, v2> reducer reducer <k3, v3> <k3, v3> Unnecessary shipment of invariant input data in each MapReduce round – Haloop fixes Blocking: Reduce does not start until all Map tasks are completed Despite these, MapReduce is popular in industry 24 MapReduce platforms Apache Hadoop, used by Facebook, Yahoo, … – Hive, Facebook, HiveQL (SQL) – PIG, Yahoo, Pig Latin (SQL like) – SCOPE, Microsoft, SQL NoSQL – Cassandra, Facebook, CQL (no join) – HBase, Google, distributed BigTable – MongoDB, document-oriented (NoSQL) Distributed graph query engines – Pregel, Google A vertex-centric model – TAO: Facebook, – GraphLab, machine learning and data mining – Neo4j, Neo Tech; Trinity, Microsoft; HyperGraphDB (knowledge) 25 Study some of these Vertex-centric models 26 The BSP model (Pregel) Bulk Synchronous Parallel Vertex: computation is definedanalogous to run on each vertex to MapReduce rounds Supersteps: within each, all vertices compute in parallel – Each vertex modifies its states and that of outgoing edges – Sends messages to other vertices (in the next superstep) – Receives messages sent to it (from the last superstep) Termination: Message passing – Each vertex votes to halt – When all vertices are inactive and no messages in transit Synchronization: supersteps Asynchronous: all vertices within each superstep Vertex-centric, message passing 27 The vertex centric model of GraphLab No supersteps Vertex: computation is defined to run on each vertex All vertices compute in parallel – Each vertex reads and writes to data on adjacent nodes or edges asynchronous Consistency: serialization – Full consistency: no overlap for concurrent updates – Edge consistency: exclusive read-write to its vertex and adjacent edges; read only to adjacent vertices – Vertex consistency: all updates in parallel (sync operations) Asynchronous: all vertices Machine learning, data mining 28 Vertex-centric models vs. MapReduce Vertex centric: think like a vertex; MapReduce: think like a graph Vertex centric: maximize parallelism – asynchronous, minimize data shipment via message passing MapReduce: inefficiency caused by blocking; distributing intermediate results Vertex centric: limited to graphs; MapReduce: general Lack of global control: ordering for processing vertices in recursive computation, incremental computation, etc Have to re-cast algorithms in these models, hard to reuse existing (incremental) algorithms Can we do better? 29 GRAPE: A parallel model based on partial evaluation 30 Querying distributed graphs Given a big graph G, and n processors S1, …, Sn G is partitioned into fragments (G1, …, Gn) G is distributed to n processors: Gi is stored at Si Parallel query answering Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q Output: Q(G), the answer to Q in G Each processor Si processes its local fragment Gi in parallel Q( Q( G1 ) Q( G2 G ) ) How does it work? … Q( Gn ) Dividing a big G into small fragments of manageable size 31 GRAPE (GRAPh Engine) Divide and conquer manageable sizes partition G into fragments (G1, …, Gn), distributed to various sites evaluate Q on smaller Gi upon receiving a query Q, • evaluate Q( Gi ) in parallel • collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Each machine (site) Si processes the same query Q, uses only data stored in its local fragment Gi data-partitioned parallelism 32 Partial evaluation compute f( x ) f( s, d ) the part of known input yet unavailable input conduct the part of computation that depends only on s generate a partial answer a residual function at each site, Gi as the known input Partial evaluation in distributed query processing • evaluate Q( Gi ) in parallel Gj as the yet unavailable input • collect partial matches at a coordinator site, and assemble them to find the answer Q( G ) in the entire G as residual functions The connection between partial evaluation and parallel processing 33 Coordinator Each machine (site) Si is either a coordinator a worker: conduct local computation and produce partial answers Coordinator: receive/post queries, control termination, and assemble answers Upon receiving a query Q • post Q to all workers • Initialize a status flag for each worker, mutable by the worker Terminate the computation when all flags are true • Assemble partial answers from workers, and produce the final answer Q(G) Termination, partial answer assembling 34 Workers Worker: conduct local computation and produce partial answers upon receiving a query Q, use local data Gi • evaluate Q( Gi ) in parallel only With edges to other fragments • send messages to request data for “border nodes” Incremental computation Incremental computation: upon receiving new messages M • evaluate Q( Gi + M) in parallel set its flag true if no more changes to partial results, and send the partial answer to the coordinator This step repeats until the partial answer at site Si is ready Local computation, partial evaluation, recursion, partial answers 35 35 Subgraph isomorphism Input: A pattern graph Q and a graph G Output: All the matches of Q in G, i.e., all subgraphs of G that are isomorphic to Q a bijective function f on nodes: (u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G partition G into fragments (G1, …, Gn), distributed to n workers Coordinator: upon receiving a pattern query Q • post pattern Q to each worker • set flag[i] false for each worker Si How to compute partial answers? 36 Local computation at each worker Worker: upon receiving a query Q, the diameter of Q • send messages to request the d-neighbour for each border node, denoted by Gd No incremental • Compute partial answer Q(Gi Gd), by calling any existing computation is needed subgraph isomorphism algorithm • set flag[i] true, and send Q(Gi Gd) to the coordinator Coordinator: the final answer Q(G) is simply the union of all Q(Gd) the data locality of subgraph isomorphism The algorithm is correct, why? Correctness analysis 37 Performance guarantee Complexity analysis: let t(|G|, |Q|) be the sequential complexity T(|G|, |Q|, n) be the parallel complexitypolynomial reduction Then T(|G|, |Q|, n) = O(t(|G|, |Q|)/n), n: the number of processors Parallel scalability: the more processors, the faster the algorithm (provided that p << |G|). Why? Proof: computational costs + communication costs Reuse of existing algorithms: the worker can employ anyproject Necessary parts for your existing sequential algorithm forCorrectness subgraph isomorphism proof compare this with MapReduce Complexity analysis Performance guarantees Partial evaluation + data-partitioned parallelism 38 Transitive closures TC(x, y) :- R(x, y); TC(x, z) :- TC(x, y), TC(y, z). Represent R as a graph G: elements as nodes, and R as its edge relation partition G into fragments (G1, …, Gn), distributed to n workers Worker: compute TC( Gi ) in parallel send messages to request data for “border nodes” incremental step given new messages M, incrementally compute TC(Gi + M) Termination: repeat until there exist no more changes 39 : compared with MapReduce: minimize unnecessary recomputation Summary and review What is the MapReduce framework? Pros? Pitfalls? How to implement joins in the MapReduce framework? Develop algorithms in MapReduce Vertex-centric models for querying big graphs Distributed query processing with performance guarantee by partial evaluation? What performance guarantees we want for evaluating graph queries on distributed graphs? Compare the four parallel models: MapReduce, PBS, vertex- centric, and GRAPE Correctness proof, complexity analysis and performance guarantees for your algorithm 40 Project (1) A “smart” algorithm for computing the transitive closure of R is given in “Distributed Algorithms for the Transitive Closure” http://asingleneuron.files.wordpress.com/2013/10/distributedalgorithmsfortr ansitiveclosure.pdf 1. TC(x, y) := ; Q := R; 2. while Q do a) TC := Q TC; b) TC := Q TC TC; c) Q := Q Q TC; Implement the algorithm in the programming models of • MapReduce Recall that you can implement union and join in MapReduce • GRAPE Give correctness proof and complexity analysis Experimentally evaluate and compare the implementations 41 Flashback: Graph simulation Input: a graph pattern graph Q and a graph G Output: Q(G) is a binary relation S on the nodes of Q and G • each node u in Q is mapped to a node v in G, such that (u, v)∈ S • for each (u,v)∈ S, each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ )∈ S Complexity: O((| V | + | VQ |) (| E | + | EQ| )) time 42 Project (2) Develop two algorithms that, given a graph G and a pattern Q, compute Q(G) via graph simulation, based on two of the following models: • MapReduce • BSP (Pregel) • The vertex-centric model of GraphLab • GRAPE Give correctness proof and complexity analysis of your algorithms Experimentally evaluate your algorithms 43 Project (3) Given a directed graph G and a pair of nodes (s, t) in G, the distance between s and t is the length of a shortest path from s to t. Develop two algorithms that, given a graph G, compute the shortest distance for all pair of nodes in G, based on two of the following models: • MapReduce • BSP (Pregel) • The vertex-centric model of GraphLab • GRAPE Give correctness proof and complexity analysis of your algorithms Experimentally evaluate your algorithms 44 Project (4) Show that GRAPE can be revised to support relational queries, based on what you have learned about parallel/distributed query processing. Show • how to support relational operators • how to optimize your queries Implement a lightweight system to support relational queries in GRAPE • Basic relational operators • SPC queries Demonstrate your system 45 Project (5) Write a survey on any of the following topics, by evaluating 5-6 representative papers/systems: • Parallel models for query evaluation • Distributed graph query engines • Distributed database systems • MapReduce platforms • NoSQL systems • … Develop a set of criteria, and evaluate techniques/systems based on the criteria Demonstrate your understanding of the topics 46 Reading for the next week http://homepages.inf.ed.ac.uk/wenfei/publication.html 1. 2. 3. 4. 5. 6. 7. W. Fan, F. Geerts, and F. Neven. Making Queries Tractable on Big Data with Preprocessing, VLDB 2013 (BD-tractability) Y. Tao, W. Lin. X. Xiao. Minimal MapReduce Algorithms (MMC) http://www.cse.cuhk.edu.hk/~taoyf/paper/sigmod13-mr.pdf J. Hellerstein. The Declarative Imperative: Experiences and Conjectures in Distributed logic. SIGMOD Record 39(1), 2010 (communication complexity) http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS2010-90.pdf Y. Cao, W. Fan, and W. Yu. Bounded Conjunctive Queries. VLDB 2014. (scale independence) W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, 2012. (query-preserving compression) W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using Views, ICDE 2014. (query answering using views) F. Afrati, V. Borkar, M. Carey, N. Poluzotis, and J. D. Ullman. MapReduce Extensions and Recursive Queries, EDBT 2011. 47 https://asterix.ics.uci.edu/pub/EDBT11-afrati.pdf