Web-Scale Data Management Systems Prof. Oliver Kennedy Monday, September 10, 12 http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012.html okennedy@buffalo.edu Monday, September 10, 12 About Me • • • Monday, September 10, 12 New to UB (super excited to be here) 4+ years working in databases • And 3+ years in distributed systems before that Worked with analytics teams at Microsoft Azure Office Hours • Monday, September 10, 12 Davis Hall 338H • • • Monday: 2:00-4:00 PM Thursday: 2:00-4:00 PM or by appointment Course Goals • Familiarize you with systems for storing and analyzing really really big data • • Real world systems (if you want to go into industry) Interesting problems (if you want to go into academia) • Past, present, and future systems • Give (some of) you experience working with these systems. Monday, September 10, 12 Course Requirements (Everyone) • Submit paper overviews Each Week • 1-2 paragraphs per paper • What’s the message of the paper? • What are the pros and cons of the paper’s approach? • Everyone gets 2 weeks of missed overviews Monday, September 10, 12 Course Requirements (2+ Credits) • Present 2 papers • 30-40 minute presentation • 20-30 minutes of discussion • Arrange to go over your presentation with me by the prior Thursday. Monday, September 10, 12 Course Requirements (3+ Credits) • A simple implementation project • Hadoop-based Join Algorithms • A Ring DHT • Standalone Distributed Join Algorithms • Or develop your own project! Monday, September 10, 12 Course Requirements (3+ Credits; Continued) • Available Resources for Projects • A 12 core development testbed • $100 of Amazon Cloud Credit • Be careful with this credit! • Avoid running out early! Monday, September 10, 12 Course Requirements (3+ Credits; Continued) • Project Requirements • A 1-page milestone report on Nov 1 • A short (~4 page) final report • A short (10-15 minute) presentation • Confirm your project with me by Oct. 1 Monday, September 10, 12 Dryad Map Reduce & HDFS Hive & HadoopDB Pig & Dremel MonetDB & DataCyclotron Cassandra & BigTable/HBase Zookeeper & Percolator Chord & Dynamo PIQL Lipstick Borealis & DBToaster Monday, September 10, 12 Data-Flow Computation Map/Reduce SQL on M/R Other Languages on M/R Column Stores Semistructured Databases Distributed Consistency Distributed Hash Tables Enforced Scalability Workflow Provenance Stream Processors Dryad Map Reduce & HDFS Hive & HadoopDB Pig & Dremel MonetDB & DataCyclotron Cassandra & BigTable/HBase Zookeeper & Percolator Chord & Dynamo PIQL Very recent Lipstick Borealis & DBToaster (presented last month) Monday, September 10, 12 Data-Flow Computation Map/Reduce SQL on M/R Other Languages on M/R Column Stores Semistructured Databases Distributed Consistency Distributed Hash Tables Enforced Scalability Workflow Provenance Stream Processors Questions about the Course? Monday, September 10, 12 Dryad Michael Isard, Mihai Budiu,Yuan Yu, Andrew Birrell, Dennis Fetterly Microsoft Research: Silicon Valley (presented by Oliver Kennedy) Monday, September 10, 12 Parallel Programming Monday, September 10, 12 Parallel Programming Monday, September 10, 12 Parallel Programming Hard Problem! Monday, September 10, 12 Parallel Programming Locks! Hard Problem! Deadlock! IPC! Error Handling! Monday, September 10, 12 What Has Worked? Monday, September 10, 12 What Has Worked? Monday, September 10, 12 What Has Worked? The programmer knows what’s data-parallel. Monday, September 10, 12 The computer understands the infrastructure. What Has Worked? cat users.dat Monday, September 10, 12 What Has Worked? cat users.dat | sed ‘s/\([^ ]*\).*/\1/’ | grep ‘buffalo’ | wc -l Unix pipes allow programmers to compose ‘nuggets’ of computation Monday, September 10, 12 Graph Programming Use the same metaphor! Join nuggets of code into a graph Monday, September 10, 12 Graph Programming How about an Example? Monday, September 10, 12 Graph Programming SELECT FROM JOIN ON AND JOIN ON AND Monday, September 10, 12 distinct p.objID photoObjAll p neighbors n p.objID = n.objID n.objID < n.neighborObjID photoObjAll l l.objID = n.neighborObjID SimilarColor(l.rgb,n.rgb) Graph Programming SELECT FROM JOIN ON AND JOIN ON AND Monday, September 10, 12 distinct p.objID Join X photoObjAll p neighbors n p.objID = n.objID n.objID < n.neighborObjID photoObjAll l l.objID = n.neighborObjID SimilarColor(l.rgb,n.rgb) Graph Programming SELECT FROM JOIN ON AND JOIN ON AND distinct p.objID Join X photoObjAll p neighbors n p.objID = n.objID n.objID < n.neighborObjID photoObjAll l l.objID = n.neighborObjID SimilarColor(l.rgb,n.rgb) Join Y Monday, September 10, 12 Graph Programming Y X Monday, September 10, 12 Graph Programming Y l X p Monday, September 10, 12 n Graph Programming Y l Read p⋈n Emit n.neighborID : p⋈n X p Monday, September 10, 12 n Graph Programming Y l π X p Monday, September 10, 12 n Graph Programming Y l π X p Monday, September 10, 12 n Graph Programming Y l Read n.neighborID : p⋈n Emit sorted p⋈n by n.neighborId π X p Monday, September 10, 12 n Graph Programming Y l S π X p Monday, September 10, 12 n Graph Programming Y l S Channel π TCP In-Mem FIFO File X p Monday, September 10, 12 n Graph Programming Y l S Channel π TCP In-Mem FIFO File X p Monday, September 10, 12 n Graph Programming Y Y l p Monday, September 10, 12 l S S π π X X n p n • Graph Programming • Execution Model • Evaluation Monday, September 10, 12 Building a Graph < V, E, I, O > Monday, September 10, 12 Building a Graph < V, E, I, O > Monday, September 10, 12 Building a Graph < V, E, I, O > e r t i c e s Monday, September 10, 12 d g e s n p u t v e r t i c e s u t p u t v e r t i c e s Building a Graph A < {A}, {}, {A}, {A} > Monday, September 10, 12 Building a Graph A^k A < {A}, {}, {A}, {A} > Monday, September 10, 12 Building a Graph A^k A A < {A1,A2,…}, {}, {A1,A2,…}, {A1,A2,…} > Monday, September 10, 12 Building a Graph k-Times A A^k A < {A1,A2,…}, {}, {A1,A2,…}, {A1,A2,…} > Monday, September 10, 12 Building a Graph k-Times A A^k A < {A1,A2,…}, {}, {A1,A2,…}, {A1,A2,…} > Everything is Cloned Monday, September 10, 12 Building a Graph A^k >= B^k B B A A < VA⨁VB, EA Monday, September 10, 12 ∪ EB ∪ Enew, IA, OB > Building a Graph A^k >= B^k One-to-One B B A A < VA⨁VB, EA Monday, September 10, 12 ∪ EB ∪ Enew, IA, OB > Building a Graph A^k >= B^k B B A A < VA⨁VB, EA Monday, September 10, 12 ∪ EB ∪ Enew, IA, OB > Building a Graph A^k >> B^k All-to-Any B B A A < VA⨁VB, EA Monday, September 10, 12 ∪ EB ∪ Enew, IA, OB > Building a Graph (A>=C>=D>=B) || (A>=E=>B) B B D E C A < VA⨁VB, EA Monday, September 10, 12 A ∪ E B, I A ∪ I B, O A ∪ OB > Building a Graph (A>=C>=D>=B) || (A>=E=>B) B D E C A < VA⨁VB, EA Monday, September 10, 12 ∪ E B, I A ∪ I B, O A ∪ OB > Building a Graph (A>=C>=D>=B) || (A>=E=>B) B D E C A < VA⨁*VB, EA ∪*EB, IA ∪*IB, OA ∪*OB > * = except merging duplicates Monday, September 10, 12 Summary A^k Parallelize A, with k replicas A >= B Connect A’s outputs to B’s inputs (one-to-one) A >> B Connect A’s outputs to B’s inputs (all-to-any) A || B Merge graphs A and B (deduplicating nodes in both A and B) Monday, September 10, 12 • Graph Programming • Execution Model • Evaluation Monday, September 10, 12 Job Execution B D E C A Monday, September 10, 12 Job Execution Job Manager B D E C A The Job Manager coordinates graph execution Monday, September 10, 12 Job Execution B D E Edges can be implemented as... C A Monday, September 10, 12 A Job Execution B D E Edges can be implemented as... C A Monday, September 10, 12 A Files, Job Execution B D E Edges can be implemented as... C A Monday, September 10, 12 A Files, TCP Streams, Job Execution B D E Edges can be implemented as... C A Monday, September 10, 12 A Files, TCP Streams, In-Memory FIFOs, ...etc Job Execution B D E Edges can be implemented as... C A Monday, September 10, 12 A <Object> Files, TCP Streams, In-Memory FIFOs, ...etc Job Execution <Object> B D E C A Monday, September 10, 12 A The programmer doesn’t need to worry about the channels Job Execution E <Object> B D E C A Monday, September 10, 12 A For In-Memory FIFOs, downstream nodes are spawned immediately. Job Execution E <Object> B D E C A Monday, September 10, 12 A Job Execution E B D E C A Monday, September 10, 12 A Job Execution E B D E C A Monday, September 10, 12 A Files are completed before spawning downstream nodes. Job Execution E B D E C A Monday, September 10, 12 A <Object> Files are completed before spawning downstream nodes. Job Execution E <Object> B D E C A Monday, September 10, 12 A Files are completed before spawning downstream nodes. Job Execution E <Object> B D E C A Monday, September 10, 12 A <Object> <Object> <Object> Files are completed before spawning downstream nodes. Job Execution E <Object> <Object> <Object> <Object> B D E C A Monday, September 10, 12 A When the source node is finished, the file is closed. Job Execution E <Object> <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A When the source node is finished, the file is closed. Job Execution E C <Object> <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A ... and the downstream node is spawned. Job Execution E C <Object> <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A ... and the downstream node is spawned. Job Execution E C <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A ... and the downstream node is spawned. Job Execution XC E <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A If the downstream node fails... Job Execution XC E <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A If the downstream node fails... ... its connection to the Job Manager dies... Job Execution E C <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A If the downstream node fails... ... its connection to the Job Manager dies... ... and the Job Manager restarts it. Job Execution E C <Object> <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A The node restarts its computation from the start of the file. Job Execution E C <Object> <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A For a FIFO or Stream, the Job Manager recreates the data by also restarting the downstream node. Job Execution E C <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A For a FIFO or Stream, the Job Manager recreates the data by also restarting the downstream node. Job Execution XE C <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A For a FIFO or Stream, the Job Manager recreates the data by also restarting the downstream node. Job Execution E C <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A For a FIFO or Stream, the Job Manager recreates the data by also restarting the downstream node. Job Execution E C <Object> <Object> <Object> B D A <Object> E C A Monday, September 10, 12 X For a FIFO or Stream, the Job Manager recreates the data by also restarting the downstream node. Job Execution E <Object> C <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A The downstream node is assumed to be deterministic (and side-effect free) Job Execution E C <Object> <Object> <Object> B D E C A Monday, September 10, 12 X A The downstream node is assumed to be deterministic (and side-effect free) Runtime Graph Refinement What about aggregation? Monday, September 10, 12 Runtime Graph Refinement B A B A A A A A A Aggregation is Expensive! Monday, September 10, 12 Runtime Graph Refinement (Lots of work) B A (Lots of Network Traffic) B A A A A A A Aggregation is Expensive! Monday, September 10, 12 Runtime Graph Refinement Server 2 B A Server 1 B A A A A A A We can use runtime information to optimize! Monday, September 10, 12 Runtime Graph Refinement B A B A1 A2 A2 A1 A2 A1 Once the Job Manager assigns nodes to machines... Monday, September 10, 12 Runtime Graph Refinement B A B A1 A1 A1 A2 A2 A2 ... we can apply a user-provided function to pre-aggregate. Monday, September 10, 12 Runtime Graph Refinement (Less Work) B Network Stream (Lower Bandwidth) B B1 A A1 A1 In-Memory FIFO B2 A1 A2 A2 A2 ... we can apply a user-provided function to pre-aggregate. Monday, September 10, 12 • Graph Programming • Execution Model • Evaluation Monday, September 10, 12 Evaluation SELECT FROM JOIN ON AND JOIN ON AND Monday, September 10, 12 distinct p.objID photoObjAll p neighbors n p.objID = n.objID n.objID < n.neighborObjID photoObjAll l l.objID = n.neighborObjID SimilarColor(l.rgb,n.rgb) Evaluation 16.0 Each Dryad In-Memory 14.0 Dryad Two-Pass Speed-up 12.0 k SQLServer 2005 C 10.0 8.0 R k Q R 6.0 S is: C k Each R S is: 4.0 2.0 Q 0.0 0 2 4 6 Number of Computers 8 n Q D C P MS 10 n Figure 8: The speedup of the SQL query computation is nearlinear in the number of computers used. The baseline is relative to Dryad running on a singleof computer and keeping times arethe given in Table 2. Sorting increasing amounts data while volume : Figure 8: The of the Skyserver computation as the numFigure 9: speed-up The communication graphQ18 to compute a query hisper computer fixed. The total data sorted by an -machine ber of computers is varied. The baseline is relative to DryadLINQ togram. Details are in Section 6.3. This figure shows the first cut job Bytes when . ent is around GBytes, or “naive” version thattimes doesn’t on encapsulated a single computer and arescale givenwell. in Table 2. n = 6 and up, again with close to linear speed-up, and running approximately twice as fast as the two-pass variant. The omputers matches ourexpectations: SQLServer result our specialized Dryad program but not outrageously, Dryad 2167 runs 451significantly, 242 135 92 faster than SQLServer’s general-purpose query engine. We Monday, September 10, 12 After trying a number of different encapsulation and dyto a namic hand-tuned sortschemes strategy program, refinement we used arrivedby atthe the Dryad communication graphs shown in Figure 10 than for our experiment. Each subwhich is somewhat faster DryadLINQ’s automatic Evaluation 16.0 Each Dryad In-Memory 14.0 Speed-up 12.0 10.0 8.0 6.0 4.0 2.0 0.0 P: D: S: MS: C: Read/Parse Input Distribute Copies In-Memory Sort Merge Sort Count Dryad Two-Pass k SQLServer 2005 0 2 4 6 Number of Computers 8 R Q k n Q C R Q S is: C k Each R S is: D C P MS 10 n gure 8: The speedup of the SQL query computation is nearnear in the number of computers used. The baseline is relative Dryad running on a single computer and times are given in Table 2. Figure 9: The communication graph to compute a query histogram. Details are in Section 6.3. This figure shows the first cut “naive” encapsulated version that doesn’t scale well. Time: ?? (really bad) = 6 and up, again with close to linear speed-up, and pproximately twice as fast as the two-pass variant. The QLServer result matches our expectations: our specialed Dryad program runs significantly, but not outrageously, Monday, September 10, 12 ster than SQLServer’s general-purpose query engine. We After trying a number of different encapsulation and dynamic refinement schemes we arrived at the communication graphs shown in Figure 10 for our experiment. Each sub- Evaluation Dynamically Refined Into a Each 450 R 450 Q' is: b 450 Each T R C Each is: R 450 33.4 GB R R T Q' 99,713 S is: 118 GB D P C C M MS MS T 217 T 154 GB Q' 10,405 99,713 Q' 10.2 TB Time: 11 min 30 sec Figure 10: Rearranging the vertices gives better scaling performance compared with Figure 9. The user supplies graph (a) specifying that 450 buckets should be used when distributing the output, and that each Q! vertex may receive up to 1GB of input while each T may receive up to 600MB. The number of Q! and T vertices is determined at run time based on the number of partitions in the input and the network locations and output sizes of preceding vertices in the graph, and the refined graph (b) is executed by the system. Details are in Section 6.3. Monday, September 10, 12 tial data reduction and the reduce phase (our C vertex) performs some additional relatively minor data reduction. A different topology might give better perfor- each taking inputs from one or more previous stages or the file system. Nebula transforms Dryad into a generalization of the Unix piping mechanism and it allows programmers to Evaluation Figure 7: Sorting increasing amounts of data while keeping the volume of data per computer fixed. The total data sorted by an -machine experiment is around GBytes, or Bytes when . Monday, September 10, 12 Computers Dryad 2167 451 242 135 92 Figure 8: The speed-up of t ber of computers is varied. running on a single comput to a hand-tuned sort st Conclusions • The hard part of distributed programming is infrastructure. • • Programmers like thinking in code nuggets. • Dryad isn’t used much, but production dataflow systems use virtually identical techniques. Graphs are a natural representation of distributed computations. • Monday, September 10, 12 E.g., Storm (Twitter’s analytics infrastructure) Monday, September 10, 12 Conclusions • The hard part of distributed programming is infrastructure. • • Programmers like thinking in code nuggets. • Dryad isn’t used much, but production dataflow systems use virtually identical techniques. Graphs are a natural representation of distributed computations. • Monday, September 10, 12 E.g., Storm (Twitter’s analytics infrastructure) Conclusions • The hard part of distributed programming is infrastructure. • • Programmers like thinking in code nuggets. • Dryad isn’t used much, but production dataflow systems use virtually identical techniques. Graphs are a natural representation of distributed computations. • E.g., Storm (Twitter’s analytics infrastructure) Questions? Monday, September 10, 12