COP5725 Advanced Database Systems Spring 2016 MapReduce Tallahassee, Florida, 2016 What is MapReduce? • Programming model – expressing distributed computations at a massive scale – “the computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: map and reduce” • Execution framework – organizing and performing data-intensive computations – processing parallelizable problems across huge datasets using a large number of computers (nodes) • Open-source implementation: Hadoop and others 1 How Much Data ? • Google processes 20 PB a day (2008) • Facebook has 2.5 PB of user data + 15 TB/day (4/2009) • eBay has 6.5 PB of user data + 50 TB/day (5/2009) • CERN’s LHC (Large Hadron Collider) will generate 15 PB a year 640K ought to be enough for anybody 2 Who cares ? • Ready-made large-data problems – Lots of user-generated content, even more user behavior data • Examples: Facebook friend suggestions, Google ad. placement – Business intelligence: gather everything in a data warehouse and run analytics to generate insight • Utility computing – Provision Hadoop clusters on-demand in the cloud – Lower barriers to entry for tackling large-data problems – Commoditization and democratization of large-data capabilities 3 Spread Work Over Many Machines • Challenges – – – – – – Workload partitioning: how do we assign work units to workers? Load balancing: what if we have more work units than workers? Synchronization: what if workers need to share partial results? Aggregation: how do we aggregate partial results? Termination: how do we know all the workers have finished? Fault tolerance: what if workers die? • Common theme – Communication between workers (e.g., to exchange states) – Access to shared resources (e.g., data) • We need a synchronization mechanism 4 Current Methods • Programming models – Shared memory (pthreads) – Message passing (MPI) Message Passing Memory Shared Memory • Design Patterns – Master-slaves – Producer-consumer flows – Shared work queues P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 producer consumer master work queue slaves producer consumer 5 Problem with Current Solutions • Lots of programming work – communication and coordination – workload partitioning – status reporting – optimization – locality • Repeat for every problem you want to solve • Stuff breaks – One server may stay up three years (1,000 days) – If you have 10,000 servers, expect to lose 10 a day 6 What We Need • A Distributed System – – – – – Scalable Fault-tolerant Easy to program Applicable to many problems …… 7 How Do We Scale Up ? • Divide & Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine 8 General Ideas • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output • Key idea: provide a functional abstraction for these two operations – map (k, v) → <k’, v’> – reduce (k’, v’) → <k’’, v’’> • All values with the same key are sent to the same reducer – The execution framework handles everything else… 9 General Ideas k1 v1 k2 v2 map a 1 k3 v3 k4 v4 map b 2 c 3 k5 v5 k6 v6 map c 6 a 5 map c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 10 Two More Functions • Apart from Map and Reduce, the execution framework handles everything else… • Not quite…usually, programmers can also specify: – partition (k’, number of partitions) → partition for k’ • Divides up key space for parallel reduce operations • Often a simple hash of the key, e.g., hash(k’) mod n – combine (k’, v’) → <k’, v’>* • Mini-reducers that run in memory after the map phase • Used as an optimization to reduce network traffic 11 k1 v1 k2 v2 map a 1 k4 v4 map b 2 c 3 combine a 1 k3 v3 c 6 a 5 map c 2 b 7 combine c 9 partition k6 v6 map combine b 2 k5 v5 a 5 partition c 8 combine c 2 b 7 partition c 8 partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 12 Motivation for Local Aggregation • Ideal scaling characteristics: – Twice the data, twice the running time – Twice the resources, half the running time • Why can’t we achieve this? – Synchronization requires communication – Communication kills performance • Thus… avoid communication! – Reduce intermediate data via local aggregation – Combiners can help 13 Word Count v1.0 • Input: {<document-id, document-contents>} • Output: <word, num-occurrences-in-web>. e.g. <“obama”, 1000> 14 <doc1, “obama is the president”> <doc2, “hennesy is the president of stanford”> <“obama”, 1> <“hennesy”, 1> <“is”, 1> <“is”, 1> <“is”, 1> <“the”, 1> <“the”, 1> … <“an”, 1> <“president”, 1> … <docn, “this is an example”> … <“this”, 1> <“example”, 1> Group by reduce key <“the”, {1, 1}> <“the”, 2> <“obama”, {1}> <“obama”, 1> … <“is”, {1, 1, 1}> … <“is”, 3> 15 Word Count v2.0 16 Word Count v3.0 17 Combiner Design • Combiners and reducers share same method signature – Sometimes, reducers can serve as combiners – Often, not… • Remember: combiner are optional optimizations – Should not affect algorithm correctness – May be run 0, 1, or multiple times • Example: find average of all integers associated with the same key 18 Computing the Mean v1.0 Why can’t we use reducer as combiner? 19 Computing the Mean v2.0 • Why doesn’t this work? • combiners must have the same input and output key-value type, which also must be the same as the mapper output type and the reducer input type 20 Computing the Mean v3.0 21 Computing the Mean v4.0 22 MapReduce Runtime • Handles scheduling – Assigns workers to map and reduce tasks • Handles “data distribution” – Moves processes to data • Handles synchronization – Gathers, sorts, and shuffles intermediate data • Handles errors and faults – Detects worker failures and restarts • Everything happens on top of a distributed FS 23 Execution User Program (1) submit Master (2) schedule map (2) schedule reduce worker split 0 split 1 split 2 split 3 split 4 (5) remote read (3) read worker worker (6) write output file 0 (4) local write worker output file 1 worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files 24 Implementation • Google has a proprietary implementation in C++ – Bindings in Java, Python • Hadoop is an open-source implementation in Java – Development led by Yahoo, used in production – Now an Apache project – Rapidly expanding software ecosystem • Lots of custom research implementations – For GPUs, cell processors, etc. 25 Distributed File System • Don’t move data to workers… move workers to the data! – Store data on the local disks of nodes in the cluster – Start up the workers on the node that has the data local • Why? – Not enough RAM to hold all the data in memory – Disk access is slow, but disk throughput (data transfer rate) is reasonable • A distributed file system is the answer – GFS (Google File System) for Google’s MapReduce – HDFS (Hadoop Distributed File System) for Hadoop 26 GFS • Commodity hardware over “exotic” hardware – Scale “out”, not “up” • Scale out (horizontally): add more nodes to a system • Scale up (vertically): add resources to a single node in a system • High component failure rates – Inexpensive commodity components fail all the time • “Modest” number of huge files – Multi-gigabyte files are common, if not encouraged • Files are write-once, mostly appended to – Perhaps concurrently • Large streaming reads over random access – High sustained throughput over low latency 27 Seeks vs. Scans • Consider a 1 TB database with 100-byte records – We want to update 1 percent of the records • Scenario 1: random access – Each update takes ~30 ms (seek, read, write) – 108 updates = ~35 days • Scenario 2: rewrite all records – Assume 100 MB/s throughput – Time = 5.6 hours(!) • Lesson: avoid random seeks! 28 GFS • Files stored as chunks – Fixed size (64MB) • Reliability through replication – Each chunk replicated across 3+ chunk servers • Single master to coordinate access, keep metadata – Simple centralized management • No data caching – Little benefit due to large datasets, streaming reads • Simplify the API – Push some of the issues onto the client (e.g., data layout) 29 Relational Databases vs. MapReduce • Relational databases: – – – – – Multipurpose: analysis and transactions; batch and interactive Data integrity via ACID transactions Lots of tools in software ecosystem (for ingesting, reporting, etc.) Supports SQL (and SQL integration, e.g., JDBC) Automatic SQL query optimization • MapReduce (Hadoop): – – – – – Designed for large clusters, fault tolerant Data is accessed in “native format” Supports many query languages Programmers retain control over performance Open source 30 Workloads • OLTP (online transaction processing) – – – – Typical applications: e-commerce, banking, airline reservations User facing: real-time, low latency, highly-concurrent Tasks: relatively small set of “standard” transactional queries Data access pattern: random reads, updates, writes (involving relatively small amounts of data) • OLAP (online analytical processing) – – – – Typical applications: business intelligence, data mining Back-end processing: batch workloads, less concurrency Tasks: complex analytical queries, often ad hoc Data access pattern: table scans, large amounts of data involved per query 31 OLTP/OLAP Integration • OLTP database for user-facing transactions – Retain records of all activity – Periodic ETL (e.g., nightly) • Extract-Transform-Load (ETL) – Extract records from source – Transform: clean data, check integrity, aggregate, etc. – Load into OLAP database • OLAP database for data warehousing – Business intelligence: reporting, ad hoc queries, data mining, etc. – Feedback to improve OLTP services 32 Relational Algebra in MapReduce • Projection – Map over tuples, emit new tuples with appropriate attributes – No reducers, unless for regrouping or resorting tuples – Alternatively: perform in reducer, after some other processing • Selection – Map over tuples, emit only tuples that meet criteria – No reducers, unless for regrouping or resorting tuples – Alternatively: perform in reducer, after some other processing 33 Relational Algebra in MapReduce • Group by – Example: What is the average time spent per URL? – In SQL: • SELECT url, AVG(time) FROM visits GROUP BY url – In MapReduce: • Map over tuples, emit time, keyed by url • Framework automatically groups values by keys • Compute average in reducer • Optimize with combiners 34 Join in MapReduce • Reduce-side Join: group by join key – Map over both sets of tuples – Emit tuple as value with join key as the intermediate key – Execution framework brings together tuples sharing the same key – Perform actual join in reducer – Similar to a “sort-merge join” in database terminology 35 Reduce-side Join: Example Map keys values R1 R1 R4 R4 S2 S2 S3 S3 Reduce keys values R1 S2 S3 R4 Note: no guarantee if R is going to come first or S 36 Join in MapReduce • Map-side Join: parallel scans – Assume two datasets are sorted by the join key R1 S2 R2 S4 R4 S3 R3 S1 A sequential scan through both datasets to join (called a “merge join” in database terminology) 37 Join in MapReduce • Map-side Join – If datasets are sorted by join key, join can be accomplished by a scan over both datasets – How can we accomplish this in parallel? • Partition and sort both datasets in the same manner – In MapReduce: • Map over one dataset, read from other corresponding partition • No reducers necessary (unless to repartition or resort) 38 Join in MapReduce • In-memory Join – Basic idea: load one dataset into memory, stream over other dataset • Works if R << S and R fits into memory • Called a “hash join” in database terminology – MapReduce implementation • Distribute R to all nodes • Map over S, each mapper loads R in memory, hashed by join key • For every tuple in S, look up join key in R • No reducers, unless for regrouping or resorting tuples 39 Which Join Algorithm to Use? • In-memory join > map-side join > reduce-side join – Why? • Limitations of each? – In-memory join: memory – Map-side join: sort order and partitioning – Reduce-side join: general purpose 40 Processing Relational Data: Summary • MapReduce algorithms for processing relational data: – Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce – Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer – Multiple strategies for relational joins • Complex operations require multiple MapReduce jobs – Example: top ten URLs in terms of average time spent – Opportunities for automatic optimization 41