Map/Reduce Programming Model Ahmed Abdelsadek Outlines • Introduction • What is Map/Reduce? • Framework Architecture • Map/Reduce Algorithm Design • Tools and Libraries built on top of Map/Reduce Introduction • Big Data • Scaling ‘out’ not ‘up’ • Scaling ‘everything’ linearly with data size • Data-intensive applications Map/Reduce • Origins • Google Map/Reduce • Hadoop Map/Reduce • The Map and Reduce functions are both defined with respect to data structured in (key, value) pairs. Mapper • The Map function takes a key/value pair, processes it, and generates zero or more output key/value pairs. • The input and output types of the mapper can be different from each other. Reducer • The Reduce function takes a key and a series of all values associated with it, processes it, and generates zero or more output key/value pairs. • The input and output types of the reducer can be different from each other. Mappers/Reducers • map: (k1; v1) -> [(k2; v2)] • reduce: (k2; [v2]) -> [(k3; v3)] WordCount Example • Problem: count the number of occurrences of every word in a text collection. Map(docid a, doc d) for all term t in doc d do Emit(term t, count 1) Reduce(term t; counts [c1, c2, …]) sum = 0 for all count c in counts [c1, c2, …] do sum = sum + c Emit(term t, count sum) Map/Reduce Framework Architecture and Execution Overview Architecture - Overview • Map/Reduce runs on top of DFS Data Flow Job Timeline Job Work Flow Job Work Flow Job Work Flow Job Work Flow Job Work Flow Job Work Flow Job Work Flow Job Work Flow Job Work Flow Job Work Flow Fault Tolerance • Task Fails ▫ Re-execution • TaskTracker Fails ▫ Removes the node from pool of TaskTrackers ▫ Re-schedule its tasks • JobTracker Fails ▫ Singe point of failure. Job fails Map/Reduce Framework Features • Locality ▫ Move code to the data • Task Granularity ▫ Mappers and reducers should be much larger than the number of machines, however, not too much! Dynamic load balancing! • Backup Tasks ▫ Avoid slow workers ▫ Near completion Map/Reduce Framework Features • Skipping bad records ▫ Many failures on the same record • Local execution ▫ Debug in isolation • Status information ▫ Progress of computations • User Counters, report progress ▫ Periodically propagated to the master node Hadoop Streaming and Pipes • APIs to MapReduce that allows you to write your map and reduce functions in languages other than Java • Hadoop Streaming ▫ Uses Unix standard streams as the interface between Hadoop and your program ▫ You can use any language that can read standard input and write to standard output • Hadoop Pipes (for C++) ▫ Pipes uses sockets as the channel to communicates with the process running the C++ map or reduce function ▫ JNI is not used Keep in Mind • Programmer has little control over many aspects of execution ▫ Where a mapper or reducer runs (i.e., on which node in the cluster). ▫ When a mapper or reducer begins or finishes ▫ Which input key-value pairs are processed by a specific mapper. ▫ Which intermediate key-value pairs are processed by a specific reducer. Map/Reduce Algorithm Design Partitioners • Dividing up the intermediate key space. • Simplest: Hash value of the key mod the number of reducers ▫ Assigns same number of keys to reducers ▫ Only considers the key and ignores the value ▫ May yield large differences in the number of values sent to each reducer • More complex partitioning algorithm to handle the imbalance in the amount of data associated with each key Combiners • In WordCount example: the amount of intermediate data is larger than the input collection itself • Combiners are an optimization for local aggregation before the shuffle and sort phase ▫ Compute a local count for a word over all the documents processed by the mapper • Think of combiners as “mini-reducers” ▫ However, combiners and reducers are not always interchangeable • Combiner input and output pair are same as mapper output pairs ▫ Same as reducer input pair • Combiner may be invoked zero, one, or multiple times • Combiner can emit any number of key-value pairs Complete View of Map/Reduce Local Aggregation • Network and disk latency are high! • Features help local aggregation ▫ Single (Java) Mapper object for multiple (key,value) pairs in an input split (preserve state across multiple calls of the map() method) ▫ Share in-object data structures and counters ▫ Initialization, and finalization code across all map() calls in a single task ▫ JVM reuse across multiple tasks on the same machine Basic WordCount Example Per-Document Aggregation • Associative array inside the map() call to sum up term counts within a single document • Emits a key-value pair for each unique term, instead of emitting a key-value pair for each term in the document ▫ substantial savings in the number of intermediate key-value pairs emitted Per-Mapper Aggregation • Associative array inside the Mapper object to sum up term counts across multiple documents In-Mapper Combining • Pros ▫ More control over when local aggregation occurs and how it exactly takes place (recall: no guarantees on combiners) ▫ More efficient than using actual combiners No additional overhead with object creation, serializing, reading, and writing the key-value pairs • Cons ▫ Breaks the functional programming (not a big deal!) ▫ Scalability bottleneck Needs sufficient memory to store intermediate results Solution: Block and flush, every N key-value pairs have been processed or every M bytes have been used. Correctness with Local Aggregation • Combiners are viewed as optional optimizations ▫ Correctness of algorithm should not depend on its computations • Combiners and reducers are not interchangeable ▫ Unless reduce computation is both commutative and associative • Make sure of the semantics of your aggregation algorithm ▫ Notice for example Pair and Stripes • In some problems: common approach is to construct complex keys and values to achieve more efficiency • Example: Problem of building word co-occurrence matrix from large document collection ▫ Formally, the co-occurrence matrix of a corpus is a square N x N matrix where n is the number of unique words in the corpus ▫ Cell Mij contains the number of times word Wi co-occured with word Wj Pairs Approach • Mapper: emits co-occurring words pair as the key and the integer one • Reducer: sums up all the values associated with the same cooccurring word pair Pairs Approach • Pairs algorithm generates a massive number of key-value pairs • Combiners have few opportunities to perform local aggregation • The sparsity of the key space also limits the effectiveness of in-memory combining Stripes Approach • Store co-occurrence information in an associative array • Mapper: emits words as keys and associative arrays as values • Reducer: element-wise sum of all associative arrays of the same key Stripes Approach • Much more compact representation • Much fewer intermediate key-value pairs • More opportunities to perform local aggregation • May cause potential scalability bottlenecks of the algorithm. Which approach is faster? • APW (Associated Press Worldstream ): corpus of 2.27 million documents totaling 5.7 GB Computing Relative Frequencies • In the previous example, (Wi,Wj) co-occurrence may be high just because one of the words is very common! • Solution: Compute relative frequencies Relative Frequencies with Stripes • Straightforward! • In Reducer: ▫ Sum all words counts co-occur with the key word ▫ Divide the counts by that sum to get the relative frequency! • Lessons: ▫ Use of complex data structures to coordinate distributed computations ▫ Appropriate structuring of keys and values, bring together all the pieces of data required to perform a computation • Drawback? ▫ As with before, this algorithm also assumes that each associative array fits into memory (Scalability bottleneck!) Relative Frequencies with Pairs • Reducer receives (Wi,Wj) as the key and the counts as value ▫ From this alone it is not possible to compute f(Wj | Wi) • Hint: Reducers like Mappers, can preserve state across multiple keys • Solution: at reducer side, buffer in memory all the words that cooccur with Wi ▫ In essence building the associative array in the stripes approach • Problem? ▫ Word pairs can be in any arbitrary order! • Solution: we must define the sort order of the pair ▫ Keys are first sorted by the left word, and then by the right word • So That: when left word changes -> ▫ Sum, calculate and emit the results, flush the memory Relative Frequencies with Pairs • Problem? ▫ Same left-word pairs may be sent to different reducers! • Solution? ▫ We must ensure that all pairs with the same left word are sent to the same reducer • How? ▫ Custom Paritioners!! Pays attention to the left word and partition based on its hash only • Will it work? ▫ Yeah! • Drawback? ▫ Still scalability bottleneck! Relative Frequencies with Pairs • Another approach? With no bottlenecks? • Can we compute or ‘have’ the sum before processing the pairs counts? • The notion of ‘before’ and ‘after’ can be seen in the ordering of the key-value pairs • This insight lies in properly sequencing the data presented to the reducer ▫ Programmer should define the sort order of keys so that data needed earlier is presented earlier to the reducer • So now, we need two things ▫ Compute the sum for a give word Wi ▫ Send that sum to the reducer before any words pair where Wi is its left side Relative Frequencies with Pairs • How? • To get the sum ▫ Modify the Mapper to additionally emits a ‘special’ key of (Wi, *), with a value of one • To ensure the order ▫ defining the sort order of the keys so that pairs with the special symbol of the form (Wi, *) are ordered before any other key-value pairs where the left word is Wi • In addition: ▫ Partitioner to pay attention to only the left word Relative Frequencies with Pairs • Example • Memory bottlenecks? ▫ No! Order Inversion Design Pattern • To summarize ▫ Emitting a special key-value pair for getting the sum ▫ Controlling the sort order of the intermediate key ▫ Defining a custom partitioner ▫ Preserving state across multiple keys in the reducer • Quite common in pattern in many problems • The key insight ▫ Convert the sequencing of computations into a sorting problem Secondary Sort • In addition to sorting by key, we also need to sort by value • Implemented in Google, but not in Hadoop • Two main techniques ▫ Buffer all the readings in memory and then sort May lead to too much memory consumption ▫ Value-to-key conversion Move part of the value into the intermediate key to form a composite key We must define the intermediate key sort order We must define the partitioner so that all pairs associated with the same key are sent to the same reducer Reducer will need to preserve state across multiple pairs May lead to too much intermediate pairs Relational Joins • For databases, data-warehousing, and data analytics • Semi-structured data • Example of a join • Where S and T are datasets (relations), k is the key we want to join on, si and ti are the unique IDs of S and T respectively, Si and Ti are the rest of the tuple attributes Reduce-side Join • One-to-one join ▫ Emit tuple’s join attribute as key, rest of attributes as value • One-to-many join ▫ Buffer all tuple’s in memory ▫ Use Value-to-key pattern Reduce-side Join • Many-to-many join ▫ The previous algorithm works as well ▫ Smaller set should come first ▫ Reducer will buffer it in memory • Lessons ▫ Basic idea is to repartition the two datasets by the join key ▫ Not efficient since it shuffles both datasets across the network Map-side Joins • Assume datasets are ▫ ▫ ▫ ▫ Both sorted by the join key Divided into same number of files Partitioned in the same manner by the join key In each file, tuples are sorted by the join key • We can perform a join by scanning through both datasets simultaneously ▫ This is known as a merge join • Parallelize by partitioning and sorting both datasets in the same time ▫ Map over one of the datasets (the larger one) ▫ Inside the mapper read the corresponding part of the other dataset Non-local read ▫ Perform the merge join Map-side Joins • More efficient than a reduce-side join ▫ Doesn’t shuffle all the datasets • Drawback: ▫ Strong assumption on the input files format • Advice ▫ If used in a workflow with multiple Map/Reduce jobs, ensure the previous reducer writes its output in a convenient format. Memory-backed Join • • • • If one of the datasets can fit in memory Load it in memory Map over the other dataset Use random access to tuples based on the join key • Great performance improvement Summary • In-mapper combining ▫ Aggregates partial results ▫ Emit less intermediate pair • Pair and Stripes ▫ Keep track of joint events One by one Stripe fashion • Order inversion ▫ Convert the sequencing of computations into a sorting problem • Value-to-key conversion ▫ Scalable solution for secondary sorting ▫ Moving part of the value into the key Before we go! • Remember: Limitations of Map/Reduce Model ▫ Map/Reduce mainly designed for batch processes, not for online query ▫ Prevents modifying or adding input data while the job is running, as well as modifying the number of machines. ▫ Map/Reduce job has a single entry and a single exit We can not keep it alive waiting for an event to trigger it ▫ Map/Reduce works on flat files Lack of scheme support What’s Next? Map/Reduce vs RDBM • A living debate in databases and data analytics communities • On 2008, D. DeWitt and M. Stonebraker write ▫ ▫ ▫ ▫ ▫ ▫ “MapReduce: A major step backwards” A giant step backward in the programming paradigm An implementation uses brute force instead of indexing Not novel at all -- well known techniques developed nearly 25 years ago Missing most of the features that are routinely included in current DBMS Incompatible with all of the tools DBMS users have come to depend on • MapReduce is missing features ▫ Indexing, Bulk loader, Updates, Transactions, Integrity constraints, Referential integrity, Views • MapReduce is incompatible with the DBMS tools ▫ Report writers, Business intelligence tools, Data mining tools, Replication tools, Database design tools Map/Reduce vs RDBM • On 2010, same authors and others write “MapReduce and Parallel DBMSs:Friends or Foes?“ • Where they argue that ▫ Map/Reduce is a complement to DBMS not a competitive ▫ They are used in different application domain • Parallel DBMSs excel at efficient querying of large data sets • MR style systems excel at ETL(extract-transform-load) tasks NoSQL • Mechanism for storage and retrieval of data that use looser consistency models than traditional relational databases ▫ To achieve higher scalability and availability • Usually in form of Key-Value store • Built on top of Distributed File Systems • Examples ▫ ▫ ▫ ▫ Google Big Table Apache HBase Apache Cassandra Amazon Dynamo Tools on top of Hadoop • Apache Pig ▫ Apache Pig is a high-level procedural language platform developed to simplify querying large data sets in Apache Hadoop and MapReduce ▫ Apache Pig features a “Pig Latin”, a relational data-flow language enables SQL-like queries to be performed on distributed datasets within Hadoop applications. ▫ Pig originated as a Yahoo Research ▫ In 2007, Pig became an open source project of the Apache Software Foundation. Apache Pig • Pig Latin Example Apache Pig • Pig execution flow Tools on top of Hadoop • Apache Hive ▫ Hive is a data warehouse system for the open source Apache Hadoop project. ▫ Hive features a SQL-like HiveQL language that facilitates data analysis and summarization for large datasets stored in Hadoop-compatible file systems. ▫ Hive originated as a Facebook ▫ Later became an open source project under the Apache Software Foundation. Apache Hive • HiveQL Example Pig vs Hive • They are/were independent projects and there was no centrally coordinated goal. • They were in different spaces early on and have grown to overlap with time as both projects expand • Some differences are ▫ Pig Latin is procedural, where HiveQL is declarative. ▫ Pig Latin allows developers to insert their own code almost anywhere in the data pipeline. • Both compiles to Map and Reduce jobs. Libraries on top of Hadoop • Mahoot ▫ Machine learning library to build scalable machine learning algorithms. Libraries on top of Hadoop • HIPI (Hadoop Image Processing Interface) ▫ Framework that provides an API for performing image processing tasks in a distributed computing environment Summary • Map/Reduce • Framework Architecture • Map/Reduce Algorithm Design • Tools and Libraries built on top of Map/Reduce Demo • Starting Hadoop cluster • Copying data to HDFS • Compiling our Java Map/Reduce code and create the Jar file. • Submit Hadoop job • Show progress and dash boards • Retrieve the output from HDFS • Shut down Hadoop cluster Appendix • Hadoop Configurations • Single Node ▫ Simple guide http://hadoop.apache.org/docs/stable/single_node_setup.html ▫ More detailed: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntulinux-single-node-cluster/ • Cluster setup ▫ Simple guide http://hadoop.apache.org/docs/stable/cluster_setup.html ▫ More detailed: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntulinux-multi-node-cluster/ Appendix • Packages to install on Linux ▫ Hadoop: http://apache.mirror.nexicom.net/hadoop/common/hadoop1.1.2/hadoop-1.1.2.tar.gz ▫ Oracle Java 7: http://download.oracle.com/otn-pub/java/jdk/7u25-b15/jdk-7u25linux-x64.tar.gz ▫ SSH $ sudo apt-get install ssh $ sudo apt-get install rsync Appendix • Studying materials ▫ “Data-Intensive Text Processing with MapReduce” Jimmy Lin and Chris Dyer ▫ “Hadoop: The Definitive Guide” Tom White ▫ “MapReduce Design Patterns” Donald Miner and Adam Shook Questions?