YSmart: Yet Another SQL-to-MapReduce Translator Rubao Lee1, Tian Luo1, Yin Huai1, Fusheng Wang2, Yongqiang He3, Xiaodong Zhang1 1Dept. of Computer Sci. & Eng., The Ohio State University 2Center for Comprehensive Informatics, Emory University 3 Data Infrastructure Team, Facebook Data Explosion in Human Society The global storage capacity Analog Storage The amount of digital 2007 information created and Analog 18.86 billion GB replicated in a year 2000 Digital Storage Digital 276.12 billion GB Source: Exabytes: Documenting the 'digital age' and huge growth in computing capacity, The Washington Post 2 Challenges of Big Data Management and Analytics Existing DB technology is not prepared for such huge volumes • Storage and system scale (e.g. 70TB/day and 3000 nodes in Facebook) • Many applications in science, medical and business follow the same trend Big data demands more complex analytics than data transactions • Data mining, time series analysis and etc. gain deep insights The conventional DB business model is not affordable • Software license, maintain fees, $10,000/TB* The conventional DB processing model is to “scale up” • By updating CPU/memory/storage/networks (HPC model) Hadoop the most widelymodel usedisimplementation of MapReduce • Theisbig data processing to “scale-out”: continuously adding low cost computing and storage nodes incorporations/organizations a distributed way • in hundreds of society-dependent for The (MR) programming model becomes an big dataMapReduce analytics: effective data processing engine for bigYahoo! data analytics AOL, Baidu, EBay, Facebook, IBM, NY Times, …. 3 *: http://www.dbms2.com/2010/10/15/pricing-of-data-warehouse-appliances/ MapReduce Overview A simple but effective programming model designed to process huge volumes of data concurrently on large-scale clusters Key/value pair transitions Map: (k1, v1) (k2, v2) Reduce: (k2, a list of v2) (k3, v3) Shuffle: Partition Key (It could be the same as k2, or not) • Partition Key: to determine how a key/value pair in the map output would be transferred to a reduce task 4 MR(Hadoop) Job Execution Overview MR program (job) Map Tasks Reduce Tasks Data is stored in a distributed file system (e.g. Hadoop Distributed File System) 1: Job submission The execution of a MR job involves Control level work, e.g. 6jobsteps scheduling and task assignment Master node Worker nodes Worker nodes 2: Assign Tasks Do data processing work specified by Map or Reduce Function 5 MR(Hadoop) Job Execution Overview Map Tasks MR program Reduce Tasks 1: Job submission The execution of a MR job involves 6 steps Map output Master node Worker nodes Worker nodes 3: Map phase Concurrent tasks 4: Shuffle phase Map output will be shuffled to different reduce tasks based on Partition Keys (PKs) (usually Map output keys) 6 MR(Hadoop) Job Execution Overview Map Tasks MR program Reduce Tasks 1: Job submission Master node The execution of a MR job involves 6 steps 6: Output will be stored back to the distributed file system Worker nodes Worker nodes Reduce output 3: Map phase Concurrent tasks 4: Shuffle phase 5: Reduce phase Concurrent tasks 7 MR(Hadoop) Job Execution Overview Map Tasks MR program Reduce Tasks 1: Job submission Master node Worker nodes The execution of a MR job involves 6 steps 6: Output will be stored back to Distributed File System Worker nodes A MapReduce (MR) job is resource-consuming: 1: Input data scan in the Map phase => local or remote I/Os 1: Store intermediate results of Map output => local I/Os Reduce output 3: Transfer data across in the Shuffle phase => network costs 3: Map phase 5: Reduce 4: Shuffle 4: Store final results of this MRphase job => local phase I/Os + network Concurrent tasks Concurrent tasks 8 costs (replicate data) MR programming is not that “simple”! public static class Reduce extends Reducer<IntWritable,Text,IntWritable,Text> { private Text result = new Text(); package tpch; import java.io.IOException; import java.util.ArrayList; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.Mapper.Context; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class Q18Job1 extends Configured implements Tool{ public static class Map extends Mapper<Object, Text, IntWritable, Text>{ public void reduce(IntWritable key, Iterable<Text> values, Context context ) throws IOException, InterruptedException { double sumQuantity = 0.0; IntWritable newKey = new IntWritable(); boolean isDiscard = true; String thisValue = new String(); int thisKey = 0; for (Text val : values) { String[] tokens = val.toString().split("\\|"); if (tokens[tokens.length - 1].compareTo("l") == 0){ sumQuantity += Double.parseDouble(tokens[0]); } else if (tokens[tokens.length - 1].compareTo("o") == 0){ thisKey = Integer.valueOf(tokens[0]); thisValue = key.toString() + "|" + tokens[1]+"|"+tokens[2]; } else continue; } This complex code is for a simple MR job if (sumQuantity > 314){ isDiscard = false; private final static Text value = new Text(); } private IntWritable word = new IntWritable(); private String inputFile; if (!isDiscard){ private boolean isLineitem = false; thisValue = thisValue + "|" + sumQuantity; @Override newKey.set(thisKey); protected void setup(Context context result.set(thisValue); ) throws IOException, InterruptedException { context.write(newKey, result); inputFile = ((FileSplit)context.getInputSplit()).getPath().getName(); } if (inputFile.compareTo("lineitem.tbl") == 0){ } isLineitem = true; } } System.out.println("isLineitem:" + isLineitem + " inputFile:" + inputFile); public int run(String[] args) throws Exception { } Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); public void map(Object key, Text line, Context context if (otherArgs.length != 3) { ) throws IOException, InterruptedException { System.err.println("Usage: Q18Job1 <orders> <lineitem> <out>"); String[] tokens = (line.toString()).split("\\|"); System.exit(2); if (isLineitem){ } word.set(Integer.valueOf(tokens[0])); Job job = new Job(conf, "TPC-H Q18 Job1"); value.set(tokens[4] + "|l"); job.setJarByClass(Q18Job1.class); context.write(word, value); } job.setMapperClass(Map.class); else{ job.setMapOutputKeyClass(IntWritable.class); word.set(Integer.valueOf(tokens[0])); job.setMapOutputValueClass(Text.class); value.set(tokens[1] + "|" + tokens[4]+"|"+tokens[3]+"|o"); context.write(word, value); job.setReducerClass(Reduce.class); } job.setOutputKeyClass(IntWritable.class); } job.setOutputValueClass(Text.class); Low Productivity! Do you miss some thing like … “SELECT * FROM Book WHERE price > 100.00”? } FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileInputFormat.addInputPath(job, new Path(otherArgs[1])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[2])); return (job.waitForCompletion(true) ? 0 : 1); } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new Q18Job1(), args); System.exit(res); } } 9 High-Level Programming and Data Processing Environment on top of Hadoop A job description in SQL-like declarative language SQL-to-MapReduce Translator Write MR programs (jobs) An interface between users and MR programs (jobs) MR programs (jobs) Workers Hadoop Distributed File System (HDFS) 10 High-Level Programming and Data Processing Environment on top of Hadoop A job description in SQL-like declarative language Write MR programs (jobs) SQL-to-MapReduce Translator A data warehousing system (Facebook) A interface between users and MR programs (jobs) MR program (job) A Ahigh-level programming environment (Yahoo!) ImproveWorkers productivity from hand-coding MapReduce programs • 95%+ Hadoop jobs in Facebook are generated by Hive • 75%+ Hadoop jobs in Yahoo! are invoked by Pig* Hadoop Distributed File System (HDFS) * http://hadooplondon.eventbrite.com/ 11 Translating SQL-like Queries to MapReduce Jobs: Existing Approach “Sentence by sentence” translation • [C. Olston et al. SIGMOD 2008], [A. Gates et al., VLDB 2009] and [A. Thusoo et al., ICDE2010] • Implementation: Hive and Pig Three steps • Identify major sentences with operations that shuffle the data – Such as: Join, Group by and Order by • For every operation in the major sentence, a corresponding MR job is generated – e.g. a join op. => a join MR job • Add other operations, such as selection and projection, into corresponding MR jobs Can existing SQL-to-MapReduce translators also provide 12 comparable performance with optimized MR job(s)? An Example: TPC-H Q21 Execution time (min) One of the most complex and time-consuming queries in the TPC-H benchmark for data warehousing performance Optimized MR Jobs vs. Hive in a Facebook production cluster Optimized MR Jobs Hive 160 140 120 3.7x 100 80 60 40 20 0 What’s wrong? 13 The Execution Plan of TPC-H Q21 Hive handle this sub-tree in a different way from our optimized MR jobs SORT AGG3 It’s the dominated part on time (~90% of execution time) Join4 Left-outerJoin Join3 supplier Join2 Join1 lineitem orders AGG1 AGG2 lineitem lineitem nation 14 A JOIN MR Job However, inter-job correlations exist. Let’s look at the Partition Key An AGG MR Job Key: l_orderkey A Composite MR Job J5 A Table Key: l_orderkey J3 Key: l_orderkey J1 lineitem Key: l_orderkey J2 orders lineitem Key: l_orderkey J4 lineitem lineitem orders J1,to J1 J2J5 and allJ4 use allthe need same thepartition input table key‘lineitem’ ‘l_orderkey’ What’s wrong with existing SQL-to-MR translators? Existing translators are correlation-unaware 1. Ignore common data input 2. Ignore common data transition 3. Add unnecessary data re-partition 15 Performance Approaches of Big Data Analytics in MR: The landscape Hand-coding MR jobs Correlationaware SQL-toMR translator Pro: Easy programming, high productivity Pro: Con: high performance MRPoor programs performance on complex queries Con: (complex queries are usual in daily 1: lots of coding evenoperations) for a simple job 2: Redundant coding is inevitable Existing 3: Hard to debug SQL-to-MR [J. Tan et al., ICDCS 2010] Translators Productivity 16 Outline Background and Motivation Our Solution: YSmart, a correlation-aware SQL-toMapReduce translator Performance Evaluation YSmart in Hive Conclusion 17 Our Approaches and Critical Challenges SQL-like queries Correlation-aware SQL-to-MR translator Primitive MR Jobs Identify Correlations MR Jobs for best performance Merge Correlated MR jobs 1: Correlation possibilities and detection 2: Rules for automatically exploiting correlations 3: Implement high-performance and low-overhead MR jobs 18 Primitive MR Jobs Selection-Projection (SP) MR Job: • Simple query with only selection and/or projection Aggregation (AGG) MR Job: • Group input relation and apply aggregation functions • e.g. SELECT count(*) FROM faculty GROUP BY age. • Partition key can be any column(s) in the GROUP BY clause Join (JOIN) MR Job: • For an equi-join on two relations (inner or left/right/full outer join) • e.g. SELECT * FROM faculty LEFT OUTER JOIN awards ON … • Partition key is in the JOIN predicate Sort (SORT) MR Job: • Sort input relation • Usually the last step in a complex query 19 Input Correlation (IC) Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint J1 lineitem J2 orders lineitem A shared input relation set Map Func. of MR Job 1 Map Func. of MR Job 2 20 Transit Correlation (TC) Multiple MR jobs have transit correlation (TC) if • they have input correlation (IC), and • they have the same Partition Key Key: l_orderkey J1 lineitem Key: l_orderkey J2 orders Two MR jobs should first have IC A shared input relation set Map Func. of MR Job 1 21 Map Func. of MR Job 2 lineitem Partition Key Other Data Job Flow Correlation (JFC) A MR job has Job Flow Correlation (JFC) with one of its child MR jobs if it has the same partition key as that MR job J1 J2 Partition Key J2 Output of MR Job 2 Other Data Map Func. of MR Job 1 Reduce Func. of MR Job 1 Map Func. of MR Job 2 Reduce Func. of MR Job 2 J1 lineitem orders 22 Query Optimization Rules for Automatically Exploiting Correlations JOB-Merging: Exploiting both Input Correlation and Transit Correlation AGG-Pushdown: Exploiting the Job Flow Correlation associated with Aggregation jobs JOIN-Merging-Full: Exploiting the Job Flow Correlation associated with JOIN jobs and their Transit Correlated parents jobs JOIN-Merging-Half: Exploiting the Job Flow Correlation associated with JOIN jobs 23 Rule 1: JOB MERGING Condition: Job 1 and Job2 have both IC and TC Action: Merge Map/Reduce Func. of Job 1 and Job 2 into a Single Map/Reduce Func. Do all the work of Job 1 Reduce Func., Job 2 Reduce Func.. Generate two outputs J1 lineitem J2 orders lineitem Job 1 Reduce Func. Job 2 Reduce Func. of Merged Job Reduce Func. Job 1 Map Func. Map Func. of Merged Job Job 2 Map Func. lineitem orders 24 Rule 2: AGG-Pushdown Condition: AGG Job and its parent Job 1 have JFC J2 Action: Merge Job 1 Reduce Func., Job 2 Reduce Func. and JOIN Job into a single Reduce Func. Reduce AGG Func. of Reduce Merged Func. Job Do all the work of Job 1 Reduce Func., AGG Map Func. and AGG Reduce Func. . J1 lineitem orders AGG Map Func. Job 1 Reduce Func. Job 1 Map Func. lineitem orders 25 Rule 3:Join-Merging-Full Condition 1: Job 1 and Job 2 have Transit Correlation (TC) Condition 2: JOIN Job have JFC with Job 1 and Job 2 Action: Merge Job 1 Reduce Func., Job 2 Reduce Func. and JOIN Job into a single Reduce Func. JOIN Reduce Func. Do all the work of Job 1 Reduce Func., Job 2 Reduce Func., JOIN Map Func. and JOIN Reduce Func. J3 J1 lineitem J2 orders lineitem Reduce JOIN Func. ofMap Merged Func. Job Job 1 Reduce Func. Job 1 Map Func. Job 2 Reduce Func. Map Func. of Job 2 Merged Job Map Func. lineitem orders 26 26 Rule 4:Join-Merging-Half Condition: Job 2 and JOIN Job have JFC Action: Merge the Reduce Func. of Job and that of JOIN into a single Reduce Func. JOIN Map Func. Reduce JOIN Func. ofReduce Merged Task Job JOIN Map Func. Data from Job 1: do the work of Reduce Func. of JOIN Data from Job 2: Do all the work of Job 2 Reduce Func., JOIN Map Func. and JOIN Reduce Func. Must consider data dependency Job 1 Reduce Func. Job 1 Map Func. Job Job 2 1 must be executed before the merged Job Reduce Func. Job 2 Map Func. 27 The Common MapReduce Framework Aim at providing a framework for executing merged tasks in low overhead • To minimize size of intermediate data • To minmize data access in reduce phase Common Map Func. and Common Reduce Func. • Common Map Func.: Tag a record with Job-IDs it does not belong to – Less intermediate results • Common Reduce Func.: Dispatch key/value pairs => post-job computations Refer our paper for details 28 Outline Background and Motivation Our Solution: YSmart, a correlation-aware SQL-toMapReduce translator Performance Evaluation YSmart in Hive Conclusion 29 Exp1: Four Cases of TPC-H Q21 1: Sentence-to-Sentence Translation • 5 MR jobs 2: InputCorrelation+TransitCorrelation • 3 MR jobs Left-outerJoin Leftouter-Join Join2 Join2 Join1 lineitem AGG1 orders lineitem AGG2 lineitem lineitem 3: InputCorrelation+TransitCorrelation+ JobFlowCorrelation • 1 MR job lineitem orders 4: Hand-coding (similar with Case 3) • In reduce function, we optimize code according to the query semantic orders lineitem orders 30 Breakdowns of Execution Time (sec) 1200 From totally 888sec to 510sec 1000 From totally 768sec to 567sec 800 600 Job5 Reduce Job5 Map Job4 Reduce Job4 Map Job3 Reduce Job3 Map Job2 Reduce Job2 Map Job1 Reduce Job1 Map Only 17% difference 400 200 0 No Correlation Input Correlation Transit Correlation Input Correlation Hand-Coding Transit Correlation 31 JobFlow Correlation Exp2: Clickstream Analysis A typical query in production clickstream analysis: “what is the average number of pages a user visits between a page in category ‘X’ and a page in category ‘Y’?” In YSmart JOIN1, AGG1, AGG2, JOIN2 and AGG3 are executed in a single MR job Execution time (min) 800 700 600 8.4x 500 400 300 4.8x 200 100 0 YSmart Hive Pig 32 YSmart in the Hadoop Ecosystem See patch HIVE-2206 at apache.org HiveYSmart + YSmart Hadoop Distributed File System (HDFS) 33 Conclusion YSmart is a correlation-aware SQL-to-MapReduce translator Ysmart can outperform Hive by 4.8x, and Pig by 8.4x YSmart is being integrated into Hive The individual version of YSmart will be released soon Thank You! 34 35 Backup Slides Performance Evaluation Exp 1. How can correlations affect query execution performance? • In a local small cluster, the worker node is a quad-core machine • Workload: TPC-H Q21, 10GB dataset Exp 2. How YSmart can outperformce Hive and Pig? • Amazon EC2 cluster • Workload: Click-stream, 20GB • Compare the performance on execution time among YSmart, Hive and Pig 37 Why MapReduce is effective data processing engine for big data analytics? Two unique properties • Minimum dependency among tasks (almost sharing nothing) • Simple task operations in each node (low cost machines are sufficient) Two strong merits for big data analytics • Scalability (Amadal’s Law): increase throughput by increasing # of nodes • Fault-tolerance (quick and low cost recovery of the failures of tasks) Hadoop is the most widely used implementation of MapReduce • in hundreds of society-dependent corporations/organizations for big data analytics: AOL, Baidu, EBay, Facebook, IBM, NY Times, Yahoo! …. 38 An Example A typical question in click stream analysis: “what is the average number of pages a user visits between a page in category ‘Travel’ and a page in category ‘Shopping’?” User Alice Alice Category Travel Travel Timestamp (TS) 2011-01-01 04:05:06 2011-01-01 04:05:07 Alice Alice Alice Sports Shopping Shopping 2011-01-01 04:06:01 2011-01-01 04:07:06 2011-01-01 04:07:20 39 An Example A typical question in click stream analysis: “what is the average number of pages a user visits between a page in category ‘Travel’ and a page in category ‘Shopping’?” 3 steps: 1: Identify the time window for every user-visit path from a page in in category ‘Travel’ and a page in category ‘Shopping’; 2: Count the number of pages in each time window (exclude start and end pages); 3: Calculate the average number of counts generated by step 2. 40 Step 1: Identify the time window for every user-visit path from a page in in category ‘Travel’ and a page in category ‘Shopping’; Step 1-1: For each user, identify pairs of clicks from category ‘Travel’ to ‘Shopping’ Join1 Clicks (C1) C1.user=C2.user Clicks (C2) 41 User Category Timestamp (TS) Alice Travel Alice Travel Alice Sports Alice Shopping 2011-01-01 2011-01-0104:05:06 04:05:06 2011-01-01 2011-01-0104:05:07 04:05:07 2011-01-01 04:06:01 2011-01-0104:07:06 04:07:06 2011-01-01 Alice Shopping 2011-01-01 2011-01-0104:07:20 04:07:20 User TS_Travel TS_Shopping Alice 2011-01-01 04:05:06 2011-01-01 04:07:06 Alice 2011-01-01 04:05:06 2011-01-01 04:07:20 Alice 2011-01-01 04:05:07 2011-01-01 04:07:06 Alice 2011-01-01 04:05:07 2011-01-01 04:07:20 Join1 Clicks (C1) C1.user=C2.user Clicks (C2) 42 Step 1-2: Find the end TS for every time window Step 1-3: Find the start TS for every time window User TS_Travel TS_Start TS_Shopping TS_End Alice 2011-01-01 04:05:07 04:05:06 2011-01-01 04:07:06 Alice 2011-01-01 04:05:07 04:05:06 2011-01-01 04:07:06 04:07:20 Alice 2011-01-01 04:05:07 2011-01-01 04:07:06 Alice 2011-01-01 04:05:07 2011-01-01TS_End 04:07:20 Group by user, AGG2 AGG1 Group by user, TS_Travel Join1 C1.user=C2.user Clicks (C1) Clicks (C2) 43 Step 2: Count the number of pages in each time window (exclude start and end pages); Step 2-1: Identify the clicks in each time window User TS_Start TS_End Alice 2011-01-01 04:05:07 2011-01-01 04:07:06 User Category TS_Start Timestamp (TS) Alice Travel 2011-01-01 04:05:07 2011-01-01 04:05:06 Alice Travel 2011-01-01 04:05:07 2011-01-01 04:05:07 Alice Sports 2011-01-01 04:05:07 2011-01-01 04:06:01 Alice Shopping 2011-01-01 04:07:06 Alice Shopping 2011-01-01 04:07:20 Join2 user=C3.user Group by user, TS_End Clicks (C3) AGG2 AGG1 Group by user, TS_Travel Join1 C1.user=C2.user Clicks (C1) Clicks (C2) 44 Step 2-2: Count the number of pages in each time window (exclude start and end pages); Group by user, TS_Start User TS_Start Alice 2011-01-01 04:05:07 Alice 2011-01-01 04:05:07 Alice 2011-01-01 04:05:07 AGG3 Join2 user=C3.user Group by user, TS_End Clicks (C3) AGG2 User Count Alice 1 AGG1 Group by user, TS_Travel Join1 C1.user=C2.user Clicks (C1) Clicks (C2) 45 Step 3: Calculate the average number of counts generated by step 2. User Count Alice 1 AGG4 Group by user, TS_Start AGG3 Join2 user=C3.user Group by user, TS_End Average Count 1 the average number of pages a user visits between a page in category ‘Travel’ and a page in category ‘Shopping’ Clicks (C3) AGG2 AGG1 Group by user, TS_Travel Join1 C1.user=C2.user Clicks (C1) Clicks (C2) 46 2 MR jobs 6 MR jobs Hand-coded MR jobs 300 Execution time (s) Group by user, TS_Start AGG3 Join2 250 200 Hive AGG4 These 5 MR jobsuser=C3.user are Group by user, correlated (sharing the same TS_End Clicks AGG2 Partition Key), thus(C3) can be merged into 1 MR job ~3x 150 100 50 0 Q-CSA AGG1 Group by user, TS_Travel Join1 C1.user=C2.user Clicks (C1) Clicks (C2) 47 Exp2: Evaluation Environment Amazon EC2 cluster • 11 nodes • 20GB click-stream dataset • Compare the performance on execution time among YSmart, Hive and Pig Facebook cluster • • • • 747 nodes 8 cores per node 1 TB data set of TPC-H benchmark Compare the performance on execution time between YSmart and Hive • Every query were executed three times for both YSmart and Hive 52 Amazon EC2 11-node cluster Query: “what is the average number of pages a user visits between a page in category ‘X’ and a page in category ‘Y’?” In YSmart JOIN1, AGG1, AGG2, JOIN2 and AGG3 are executed in a single MR job 800 Execution time (s) 700 600 8.4x 500 400 300 4.8x 200 100 0 YSmart Hive Pig 53 Facebook Cluster Query: TPC-H Suppliers Who Kept Orders Waiting Query (Q21) 160 Execution time (s) 140 120 100 3.1x 3.7x 3.3x YSmart Hive 80 60 40 20 0 Q21 Inst.1 Q21 Inst.2 Q21 Inst.3 54