Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA) SIGMOD 2007 (Industrial) Presented by Kisung Kim 2010. 7. 14 Contents Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Case Study Conclusion Introduction New challenges of data processing – A vast amount of data collected from the entire WWW Solutions of search engine companies – Customized parallel data processing systems – Use large clusters of shared-nothing commodity nodes – Ex) Google’s GFS, BigTable, MapReduce Ask.com’s Neptune Microsoft’s Dryad Yahoo!’s Hadoop Introduction Properties of data-intensive systems – Simple Adopt only a selected subset of database principles – Sufficiently generic and effective – Parallel data processing system deployed on large clusters of shared-nothing commodity nodes – Refactoring of data processing into two primitives: Map function Reduce function Map-Reduce allow users not to worry about the nuisance details of: – Coordinating parallel sub-tasks – Maintaining distributed file storage\ This abstraction can greatly increase user productivity Introduction Map-Reduce framework is best at handling homogeneous datasets – Ex) Joining multiple heterogeneous datasets does not quite fit into the Map-Reduce framework Extending Map-Reduce to process heterogeneous datasets simultaneously – Processing data relationships is ubiquitous – Join-enabled Map-Reduce system can provide a highly parallel yet cost effective alternative – Include relational algebra in the subset of the database principles Relational operators can be modeled using various combinations of the three primitives: Map, Reduce, and Merge Map-Reduce Input dataset is stored in GFS Mapper – – – – Read splits of the input dataset Apply map function to the input records Produce intermediate key/value sets Partition the intermediate sets into # of reducers sets Reducer Input map: (k1, v1) [(k2, v2)] reduce: (k2, [v2]) [v3] reducer reducer reducer mapper mapper mapper Intermediate Sets – Read their part of intermediate sets from mappers – Apply reduce function to the values of a same key – Output final results Signatures of Map, Reduce Function Final Results split split split split Join using Map-Reduce: Use homogenization procedure – – – – Apply one map/reduce task on each dataset Insert a data-source tag into every value Extract a key attribute common for all heterogeneous datasets Transformed datasets now have two common attributes Key and data-source Problems – Take lots of extra disk space and incur excessive map-reduce communications – Limited only to queries that can be rendered as equi-joins Join using Map-Reduce: Homogenization reduce map Key Collect records with same key Key Value Value 10 1, “Value1” 10 2, “Value4” 85 1, “Value2” 54 2, “Value5” 320 1, “Value3” 320 2, “Value6” Dataset 1 reduce reduce map map Dataset 2 Map-Reduce-Merge Signatures map: (k1, v1) [(k2, v2)] reduce: (k2, [v2]) [v3] Map-Reduce – α, β, γ represent dataset lineages – Reduce function produces a key/value list instead of just values – Merge function reads data from both lineages These three primitives can be used to implement the parallel version of several join algorithm Merge Modules Merge function – Process two pairs of key/values Processor function – Process data from one source only – Users can define two processor functions Partition selector – Determine from which reducers this merger retrieves its input data based on the merger number Configurable iterator – A merger has two logical iterators – Control their relative movement against each others Merge Modules Merge Iterator Processor Processor Partition Selector Reducers for 1st Dataset Reducers for 2nd Dataset Reducer Output Reducer Output Reducer Output Reducer Output Reducer Output Reducer Output Reducer Output Reducer Output Reducer Output Reducer Output Reducer Output Reducer Output Applications to Relational Data Processing Map-Reduce-Merge can be used to implement primitive and some derived relational operators – – – – – – – – – Projection Aggregation Generalized selection Joins Set union Set intersection Set difference Cartesian product Rename Map-Reduce-Merge is relationally complete, while being loadbalanced, scalable and parallel Example: Hash Join •Read from two sets of reducer outputs that share the same hashing buckets •One is used as a build set and the other probe merger Read from every mapper for one designated partition Use a hash partitioner merger merger reducer reducer reducer reducer reducer reducer mapper mapper mapper mapper mapper mapper split split split split split split split split Case Study: TPC-H Query 2 Involves 5 tables, 1 nested query, 1 aggregate and group by clause, and 1 order by Case Study: TPC-H Query 2 Map-Reduce-Merge workflow Combining phases 13 passes of Map-Reduce-Merge 10 mappers, 10 reducers, and 4 mergers 6 passes of Map-Reduce-Merge 5 mappers, 4 reduce-merge-mappers, 1 reduce-mapper and 1 reducer Conclusion Map-Reduce-Merge programming model – Retain Map-Reduce’s many great features – Add relational algebra to the list of database principles it upholds – Contains several configurable components that enable many dataprocessing patterns Next step – Develop an SQL-like interface and an optimizer to simplify the process of developing a Map-Reduce-Merge workflow – This work can readily reuse well-studied RDBMS techniques