MapReduce-I MapReduce I MapReduce-I September 2013 Alberto Abelló & Oscar Romero 1 MapReduce-I Knowledge objectives 1. 2. 3. 4. Enumerate several use cases of MapReduce Describe what the MapReduce environment is Explain 6 benefits of using MapReduce Explain four problems of MapReduce September 2013 Alberto Abelló & Oscar Romero 2 MapReduce-I Understanding Objectives 1. Describe the different steps of a MapReduce execution September 2013 Alberto Abelló & Oscar Romero 3 MapReduce-I Application Objectives 1. Identify the usefulness of MapReduce in a given use case September 2013 Alberto Abelló & Oscar Romero 4 MapReduce-I Typical uses Find which source pages link to a target page Count the number of accesses to each Web page Count the number of accesses to each domain Create an index structure that maps search terms to doc ment IDs document Retrieve introductory paragraph of all Web pages so that “x” Find all pairs of users accessing the same URL Find the average age of users accessing a given URL Find all friends of a given user Find all friends of friends of a given user Find all women friends of men friends of a given user G Grouping i diff different manifestations if i off the h same reall world object September 2013 Alberto Abelló & Oscar Romero 5 MapReduce-I Not so typical uses Mobile Commerce Electricity Agricultural Planning Fuel Conservation National N ti l IIntelligence t lli Drug Development and Personalization Financial Service Security September 2013 Alberto Abelló & Oscar Romero 6 MapReduce-I Opinions TDWI Best Practices Report, 2013 September 2013 Alberto Abelló & Oscar Romero 7 MapReduce-I Busting 10 Myths about Hadoop Hadoop consists in multiple products Hadoop is open source but available from vendors, vendors too Hadoop is an ecosystem, not a single product HDFS is a file system, s stem not a DBMS Hive resembles SQL but is not standard SQL Hadoop p and MapReduce p are related but don’t require q each other MapReduce provides control for analytics, not analytics y per p se Hadoop is about data diversity, not just data volume Hadoop complements a DW; it’s rarely a replacement Hadoop enables many types of analytics analytics, not just Web analytics Philip Russom September 2013 Alberto Abelló & Oscar Romero 8 MapReduce-I Google ecosystem High-performance is mainly achieved by means off parallelism ll li Divide-and-conquer principle M R d MapReduce It is a query language that provides parallelism in a transparent manner Query Language MapReduce Hadoop File System Database BigTable … September 2013 Alberto Abelló & Oscar Romero 9 MapReduce-I Hadoop ecosystem By HortonWorks September 2013 Alberto Abelló & Oscar Romero 10 MapReduce-I Sequential access Ben Stopford p Progscon & JAX Finance 2015 September 2013 Alberto Abelló & Oscar Romero 11 MapReduce-I MapReduce Basics Map p MergeSort Reduce Simple model to express relatively sophisticated distributed programs Processes pairs [key, value] Signature: September 2013 Alberto Abelló & Oscar Romero 12 [“Th he”,[1,1,1,1,…]] MapReduce-I WordCount Execution Example <#line, text> Map The Project Gutemberg Ebook of The Outline Of Science, Vol. 1 (of 4), by September 2013 Alberto Abelló & Oscar Romero Merge-Short 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Reduce The 57631 13 MapReduce-I WordCount Code Example public void map(LongWritable key, Text value) { Value Key String line = value.toString(); Blackbox StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { write(new Text(tokenizer.nextToken()), Text(tokenizer nextToken()) new IntWritable(1)); Value Key } } public void reduce(Text key, Iterable<IntWritable> values) { Key Values int sum = 0; for (IntWritable val : values) { Blackbox sum += val.get(); } write(key, new IntWritable(sum)); Key Value } September 2013 Alberto Abelló & Oscar Romero 14 MapReduce-I Benefits Programming (functional) model simple yet expressive Without joins Able to process structured or unstructured Elastically scalable Transparently distributes data Exploits data locality Hides parallelization Balances workload Provides fine grained fault tolerance September 2013 Alberto Abelló & Oscar Romero 15 MapReduce-I Problems Writes intermediate results to disk Reduce tasks pull intermediate data Defines the execution p plan on the fly y Schedules one block at a time Does not provide transactions Does not benefit from compression September 2013 Alberto Abelló & Oscar Romero 16 MapReduce-I Market tools “… But to really unlock the power of Hadoop, you must be able to efficiently extract data stored across multiple (often tens or hundreds) of nodes with a userfriendly ETL (extract, transform and load) tool that will then allow you to move your Hadoop data into a relational data mart or warehouse where you can use BI tools for analysis. “ Ian Fyfe Pentaho September 2013 Alberto Abelló & Oscar Romero 17 MapReduce-I Reference Architecture DM DM DM DW ETL (Extraction, Transformation and Load) WWW September 2013 Alberto Abelló & Oscar Romero 18 MapReduce-I Bigg elephant p or little elephant? p T d Trademark k • Expensive • Many functionalities • Mature Open source • Free • Simple functionalities • Young September 2013 Alberto Abelló & Oscar Romero 19 MapReduce-I Friends or foes MapRed ce MapReduce Ja a Java O acle Oracle Oracle Grid Engine Hadoop Distributed Dist ib ted File System S stem (HDFS) September 2013 Alberto Abelló & Oscar Romero 20 MapReduce-I Activity Objective: Understand the usefulness of M R d MapReduce Tasks: 1. (6’) Read two use cases 2. (12’) Explain the use cases to the others 3. (5’) Find Fi d th the main i characteristics h t i ti needed d d to t use Hadoop 4 Hand in a prioritized list of characteristics 4. Roles for the team-mates during task 2: a) Explains his/her material b) Asks for clarification of blur concepts c) Mediates and controls time September 2013 Alberto Abelló & Oscar Romero 21 MapReduce-I Algorithm: Data Load The input data is partitioned into blocks 1. o 2. It can be done by using HDFS or any other storage (e.g., HadoopDB, MongoDB, Cassandra, CouchDB, etc.) Replicate them in different nodes Input file Block Replicate … September 2013 Alberto Abelló & Oscar Romero 22 MapReduce-I Algorithm: Map Phase (I) 3. Each map subplan reads a subset of blocks (i.e., split) 4 Divides it into records 4. 5. Executes the map for each record and leaves them in memory divided into spills Split Itemize Map … Nodemapper September 2013 Alberto Abelló & Oscar Romero 23 MapReduce-I Algorithm: Map Phase (II) Each spill is then partitioned per reducers 6. o Each partition is sorted independently 7. o 8. Using g a hash function f over the new key y If a combine is defined, it is executed locally after sorting Store the spills p into disk ((massive writing) g) Partition spills Store Combine Nodemapper September 2013 Alberto Abelló & Oscar Romero 24 MapReduce-I Algorithm: Map phase (III) 9. Spill partitions are merged o Each merge is sorted independently 10. Store the result into disk Merge partitions per reducer Store Combine Nodemapper September 2013 Alberto Abelló & Oscar Romero 25 MapReduce-I Algorithm: Shuffle and Reduce 11. 12 12. 13. 14. Reducers fetch data from mappers (massive data transfer) Mappers output is merged and sorted Reduce function is executed per key Store the result into disk Fetch Reduce Store … Nodereducer September 2013 Alberto Abelló & Oscar Romero 26 MapReduce-I Local-Global Aggregation: Combine Combine is executed locally Only possible when the reducer function is: Assumes uniform random distribution of input Reduces the number of tuples sent to reducers Commutative Associative Only makes sense if |I|/|O|>>#CPU September 2013 Alberto Abelló & Oscar Romero 27 MapReduce-I Summary MapReduce MapReduce MapReduce MapReduce September 2013 usefulness benefits problems algorithm Alberto Abelló & Oscar Romero 28 MapReduce-I Bibliography J. Dean et al. MapReduce: Simplified Data Processing on Large Clusters. OSDI’04 P. Sadagale and M. Fowler. NoSQL distilled. Addison-Wesley, y, 2013 S. Abiteboul et al. Web data management. Cambridge University Press, 2011 M. Stonebraker et al. MapReduce and parallel DBMSs: friends or foes? Communication of ACM 53(1), 2010 September 2013 Alberto Abelló & Oscar Romero 29 MapReduce-I Resources http://hadoop.apache.org http://www.cloudera.com http://flink.incubator.apache.org Former http://stratosphere.eu https://spark apache org https://spark.apache.org September 2013 Alberto Abelló & Oscar Romero 30