Hadoop/MapReduce as a Platform for Data-Intensive Computing Jimmy Lin University of Maryland (currently at Twitter) Friday, December 2, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Our World: Large Data Source: Wikipedia (Hard disk drive) processes 20 PB a day (2008) 150 PB on 50k+ servers running 15k apps 9 PB of user data + >50 TB/day (11/2011) Wayback Machine: 3 PB + 100 TB/month (3/2009) 36 PB of user data + 80-90 TB/day (6/2010) LHC: ~15 PB a year (at full capacity) S3: 449B objects, peak 290k request/second (7/2011) 640K ought to be enough for anybody. LSST: 6-10 PB a year (~2015) How much data? Why large data? Science Engineering Commerce Source: Wikipedia (Everest) Science Emergence of the 4th Paradigm Data-intensive e-Science Maximilien Brice, © CERN Engineering The unreasonable effectiveness of data Count and normalize! Source: Wikipedia (Three Gorges Dam) Commerce Know thy customers Data Insights Competitive advantages Source: Wikiedia (Shinjuku, Tokyo) Data nirvana requires the right infrastructure store, manage, organize, analyze, distribute, visualize, … Why large data? How large data? cheap commodity clusters (or utility computing) + simple, distributed programming models = data-intensive computing for the masses! Source: flickr (turtlemom_nancy/2046347762) Divide et impera Chop problem into smaller parts w1 “Work” w2 w3 r1 r2 “Result” r3 Combine partial results Source: Wikiedia (Forest) Parallel computing is hard! Fundamental issues Different programming models Message Passing Shared Memory P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 Memory scheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, … Architectural issues Flynn’s taxonomy (SIMD, MIMD, etc.), network typology, bisection bandwidth UMA vs. NUMA, cache coherence Different programming constructs mutexes, conditional variables, barriers, … masters/slaves, producers/consumers, work queues, … Common problems livelock, deadlock, data starvation, priority inversion… dining philosophers, sleeping barbers, cigarette smokers, … The reality: programmer shoulders the burden of managing concurrency… (I want my students developing new algorithms, not debugging race conditions) Source: Ricardo Guimarães Herrmann Source: MIT Open Courseware The datacenter is the computer! Source: NY Times (6/14/2006) MapReduce MapReduce Functional programming meets distributed processing Independent per-record processing in parallel Aggregation of intermediate results to generate final output Programmers specify two functions: map (k, v) → <k’, v’>* reduce (k’, v’) → <k’, v’>* All values with the same key are sent to the same reducer The execution framework handles everything else… Handles scheduling Handles data management, transport, etc. Handles synchronization Handles errors and faults Recall “count and normalize”? Perfect! k1 v1 k2 v2 map a 1 k3 v3 k4 v4 map b 2 c 3 k5 v5 k6 v6 map c 6 a 5 map c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 User Program (1) submit Master (2) schedule map (2) schedule reduce worker split 0 split 1 split 2 split 3 (5) remote read (3) read worker worker (6) write output file 0 (4) local write split 4 worker output file 1 worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files (I want my students developing new algorithms, not debugging race conditions) Adapted from (Dean and Ghemawat, OSDI 2004) MapReduce Implementations Google has a proprietary implementation in C++ Bindings in Java, Python Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem Statistical Machine Translation (Chris Dyer) Source: Wikipedia (Rosetta Stone) Statistical Machine Translation Word Alignment Training Data Phrase Extraction (vi, i saw) (la mesa pequeña, the small table) … i saw the small table vi la mesa pequeña Parallel Sentences he sat at the table the service was good Language Model Translation Model Target-Language Text Decoder maria no daba una bofetada a la bruja verde mary did not slap the green witch Foreign Input Sentence English Output Sentence eˆ1I argmax P(e1I | f1J ) argmax P(e1I )P( f1J | e1I ) e1I e1I Translation as a Tiling Problem Maria no dio una bofetada a la bruja verde Mary not give a slap to the witch green did not by a slap no green witch to the slap did not give to the slap the witch eˆ1I argmax P(e1I | f1J ) argmax P(e1I )P( f1J | e1I ) e1I e1I The Data Bottleneck “Every time I fire a linguist, the performance of our … system goes up.” - Fred Jelinek Statistical Machine Translation We’ve built MapReduce implementations of these two components! (2008) Word Alignment Training Data Phrase Extraction (vi, i saw) (la mesa pequeña, the small table) … i saw the small table vi la mesa pequeña Parallel Sentences he sat at the table the service was good Language Model Translation Model Target-Language Text Decoder maria no daba una bofetada a la bruja verde mary did not slap the green witch Foreign Input Sentence English Output Sentence eˆ1I argmax P(e1I | f1J ) argmax P(e1I )P( f1J | e1I ) e1I e1I HMM Alignment: Giza Single-core commodity server HMM Alignment: MapReduce Single-core commodity server 38 processor cluster HMM Alignment: MapReduce 38 processor cluster 1/38 Single-core commodity server What’s the point? The optimally-parallelized version doesn’t exist! MapReduce occupies a sweet spot in the design space for a large class of problems: Fast… in terms of running time + scaling characteristics Easy… in terms of programming effort Cheap… in terms of hardware costs Sequence Assembly (Michael Schatz) Source: Wikipedia (DNA) Strangely-Formatted Manuscript Dickens: A Tale of Two Cities Text written on a long spool It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … … With Duplicates Dickens: A Tale of Two Cities “Backup” on four more copies It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … Shredded Book Reconstruction Dickens accidently shreds the manuscript It was the It was best the of besttimes, of times, it was it was the worst the worstofoftimes, times,ititwas wasthe the age ageofofwisdom, wisdom,ititwas wasthe age the of agefoolishness, of foolishness, … … it was the the worst of times, it was the the it thewas It was the best of times, of wisdom, age of foolishness, It was the best of times, it was age age of wisdom, it was agethe of foolishness, … was the age It was the of times, it was thethe worst of times, it it was the age foolishness, It wasbest the best of times, it was worst of times, age of of wisdom, it it was the age of of foolishness, … … It was It the times, it was thethe worst of times, wisdom, it was the age of foolishness, wasbest the of best of times, it was worst of times, it it was was the the age age of of wisdom, it was the age of foolishness, … … It It the wasbest the best of times, it was worst wisdom, it it was the age of foolishness, … … was of times, it was thethe worst of of times, it was the age of of wisdom, was the age of foolishness, How can he reconstruct the text? 5 copies x 138,656 words / 5 words per fragment = 138k fragments The short fragments from every copy are mixed together Some fragments are identical Greedy Assembly It was the best of age of wisdom, it was best of times, it was it was the age of it was the age of it was the worst of of times, it was the of times, it was the of wisdom, it was the It was the best of was the best of times, the best of times, it best of times, it was of times, it was the of times, it was the times, it was the worst times, it was the age the age of wisdom, it the best of times, it the worst of times, it The repeated sequence make the correct reconstruction ambiguous! times, it was the age times, it was the worst was the age of wisdom, was the age of foolishness, was the best of times, Alternative: model sequence reconstruction as a graph problem… de Bruijn Graph Construction Dk = (V,E) V = All length-k subfragments (k < l) E = Directed edges between consecutive subfragments (Nodes overlap by k-1 words) Original Fragment It was the best of Directed Edge It was the best was the best of Locally constructed graph reveals the global structure Overlaps between sequences implicitly computed de Bruijn, 1946 Idury and Waterman, 1995 Pevzner, Tang, Waterman, 2001 de Bruijn Graph Assembly It was the best was the best of the best of times, it was the worst best of times, it was the worst of of times, it was the worst of times, worst of times, it times, it was the A unique Eulerian tour of the graph reconstructs the original text If a unique tour does not exist, try to simplify the graph as much as possible it was the age the age of foolishness was the age of the age of wisdom, age of wisdom, it of wisdom, it was wisdom, it was the de Bruijn Graph Assembly It was the best of times, it it was the worst of times, it of times, it was the the age of foolishness it was the age of A unique Eulerian tour of the graph reconstructs the original text If a unique tour does not exist, try to simplify the graph as much as possible the age of wisdom, it was the GATGCTTACTATGCGGGCCCC CGGTCTAATGCTTACTATGC GCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT TAATGCTTACTATGC AATGCTTAGCTATGCGGGC AATGCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT ? CGGTCTAGATGCTTACTATGC AATGCTTACTATGCGGGCCCCTT CGGTCTAATGCTTAGCTATGC ATGCTTACTATGCGGGCCCCTT Reads Human genome: 3 gbp A few billion short reads (~100 GB compressed data) Subject genome Sequencer Short Read Assembly Genome assembly as finding an Eulerian tour of the de Bruijn graph Present short read assemblers require tremendous computation: Human genome: >3B nodes, >10B edges Velvet (serial): > 2TB of RAM ABySS (MPI): 168 cores × ~96 hours SOAPdenovo (pthreads): 40 cores × 40 hours, >140 GB RAM Can we get by with MapReduce on commodity clusters? Horizontal scaling-out in the cloud! (Zerbino & Birney, 2008) (Simpson et al., 2009) (Li et al., 2010) Graph Compression Challenges – Nodes stored on different machines – Nodes can only access direct neighbors Randomized Solution – Randomly assign H / T to each compressible node – Compress H T links Fast Graph Compression Initial Graph: 42 nodes Fast Graph Compression Round 1: 26 nodes (38% savings) Fast Graph Compression Round 2: 15 nodes (64% savings) Fast Graph Compression Round 3: 6 nodes (86% savings) Fast Graph Compression Round 4: 5 nodes (88% savings) Contrail De Novo Assembly of the Human Genome African male NA18507 (SRA000271, Bentley et al., 2008) Input: 3.5B 36bp reads, 210bp insert (~40x coverage) Initial Compressed Clip Tips Pop Bubbles B’ B >7 B 27 bp >1 B 303 bp C A A N Max B’ 5.0 M 14,007 bp B 4.2 M 20,594 bp Aside: How to do this better… MapReduce is a poor abstraction for graphs Bulk synchronous parallel (BSP) as a better model: No separation of computation from graph structure Poor locality: unnecessary data movement Google’s Pregel, open source Giraph clone Interesting (open?) question: how many hammers and how many nails? Source: flickr (stuckincustoms/4051325193) Science Engineering Commerce Commoditization of large-data processing capabilities allows us to ride the rising tide! Source: Wikipedia (Tide) Source: flickr (60in3/2338247189) Best thing since sliced bread? Distributed programming models: It’s all about the right level of abstraction MapReduce is the first Definitely not the only And probably not even the best Alternatives: Pig, Dryad/DryadLINQ, Pregel, etc. The von Neumann architecture won’t cut it anymore Separating the what from how Developer specifies the computation that needs to be performed Execution framework handles actual execution Framework hides system-level details from the developers The datacenter is the computer! Source: NY Times (6/14/2006) What exciting applications do new abstractions enable? What are the appropriate abstractions for the datacenter computer? What new abstractions do applications demand? How do we achieve true impact and change the world? Education Teaching students to “think at web scale” Rise of the data scientist: necessary skill set? Source: flickr (infidelic/3008675635) Questions?