Maximizing Network and Storage Performance for Big Data Analytics Xiaodong Zhang Ohio State University Collaborators Rubao Lee, Ying Huai, Tian Luo, Yuan Yuan Ohio State University Yongqiang He and the Data Infrastructure Team, Facebook Fusheng Wang, Emory University Zhiwei Xu, Institute of Comp. Tech, Chinese Academy of Sciences Digital Data Explosion in Human Society The global storage capacity Analog Storage 1986 Analog 2.62 billion GB Digital 0.02 billion GB Digital Storage Amount of digital information created and replicated in a year 2007 Analog 18.86 billion GB PC hard disks 123 billion GB 44.5% Digital 276.12 billion GB Source: Exabytes: Documenting the 'digital age' and huge growth in computing capacity, The Washington Post 2 Challenge of Big Data Management and Analytics (1) Existing DB technology is not prepared for the huge volume • Until 2007, Facebook had a 15TB data warehouse by a big-DBMS-vendor • Now, ~70TB compressed data added into Facebook data warehouse every day (4x total capacity of its data warehouse in 2007) • Commercial parallel DBs rarely have 100+ nodes • Yahoo!’s Hadoop cluster has 4000+ nodes; Facebook’s data warehouse has 2750+ nodes, Google’s sorted10 PB data on a cluster of 8,000 nodes (2011) • Typical science and medical research examples: – Large Hadron Collider at CERN generates over 15 PB of data per year – Pathology Analytical Imaging Standards databases at Emory reaches 7TB, going to PB 3 Challenge of Big Data Management and Analytics (2) Big data is about all kinds of data • Online services (social networks, retailers …) focus on big data of online and off-line click-stream for deep analytics • Medical image analytics are crucial to both biomedical research and clinical diagnosis Complex analytics to gain deep insights from big data • • • • • Data mining Pattern recognition Data fusion and integration Time series analysis Goal: gain deep insights and new knowledge 4 Challenge of Big Data Management and Analytics (3) Conventional database business model is not affordable • • • • Expensive software license (e.g. Oracle DB, $47,500 per processor, 2010) High maintenance fees even for open source DBs Store and manage data in a system at least $10,000/TB* In contrast, Hadoop-like systems only cost $1,500/TB** Increasingly more non-profit organizations work on big data Hospitals, bio-research institutions Social networks, on-line services ….. Low cost software infrastructure is a key. 5 Challenge of Big Data Management and Analytics (4) Conventional parallel processing model is “scale-up” based • BSP model, CACM, 1990: optimizations in both hardware and software • Hardware: low ratio of comp/comm, fast locks, large cache and memory • Software: overlapping comp/comm, exploiting locality, co-scheduling … Big data processing model is “scale-out” based • DOT model, SOCC’11: hardware independent software design • Scalability: maintain a sustained throughput growth by continuously adding low cost computing and storage nodes in distributed systems • Constraints in computing patterns: communication- and data-sharing-free MapReduce programming model becomes an *: http://www.dbms2.com/2010/10/15/pricing-of-data-warehouse-appliances/ effective data processing engine for big data analytics **: http://www.slideshare.net/jseidman/data-analysis-with-hadoop-and-hive-chicagodb-2212011 6 Why MapReduce? A simple but effective programming model designed to process huge volumes of data concurrently Two unique properties • Minimum dependency among tasks (almost sharing nothing) • Simple task operations in each node (low cost machines are sufficient) Two strong merits for big data anaytics • Scalability (Amadal’s Law): increase throughput by increasing # of nodes • Fault-tolerance (quick and low cost recovery of the failures of tasks) Hadoop is the most widely used implementation of MapReduce • in hundreds of society-dependent corporations/organizations for big data analytics: AOL, Baidu, EBay, Facebook, IBM, NY Times, Yahoo! …. 7 An Example of a MapReduce Job on Hadoop Calculate average salary of each 2 organizations in a huge file. {name: (org., salary)} {org.: avg. salary} Key Value Original key/value pairs: all the person names associated with each org name and their salaries Key Value Result key/value pairs: two entries showing the org name and corresponding average salary Name (dept. ,salary) dept. avg. salary Alice (Org-1, 3000) Org-1 … Bob (Org-2, 3500) Org-2 … … … 8 An Example of a MapReduce Job on Hadoop Calculate the average salary of every organization {name: (org., salary)} {org.: avg. salary} HDFS A HDFS block Hadoop Distributed File System (HDFS) 9 An example of a MapReduce job on Hadoop Calculate the average salary of every department {name: (org., salary)} {org.: avg. salary} HDFS Map Map Map Each map task takes 4 HDFS blocks as its input and extract {org.: salary} as new key/value pairs, e.g. {Alice: (org-1, 3000)} to {org-1: 3000} 3 Map tasks concurrently process input data Records of “org-1” Records of “org-2” 10 An example of a MapReduce job on Hadoop Calculate the average salary of every department {name: (org., salary)} {org.: avg. salary} HDFS Map Map Map Shuffle the data using org. as Partition Key (PK) Records of “org-1” Records of “org-2” 11 An example of a MapReduce job on Hadoop Calculate the average salary of every department {name: (org., salary)} {org.: avg. salary} HDFS Map Calculate the average salary for “org-1” Map Reduce (Avg.) Map Reduce (Avg.) Calculate the average salary for “org-2” HDFS 12 Key/Value Pairs in MapReduce A simple but effective programming model designed to process huge volumes of data concurrently on a cluster Map: (k1, v1) (k2, v2), e.g. (name, org & salary) (org, salary) Reduce: (k2, v2) (k3, v3), e.g. (org, salary) (org, avg. salary) Shuffle: Partition Key (It could be the same as k2, or not) • Partition Key: to determine how a key/value pair in the map output be transferred to a reduce task • e.g. org. name is used to partition the map output file accordingly 13 MR(Hadoop) Job Execution Patterns MR program (job) Map Tasks Reduce Tasks Data is stored in a Distributed File System (e.g. Hadoop Distributed File System) 1: Job submission The execution of a MR job involves Control level work, e.g. 6jobsteps scheduling and task assignment Master node Worker nodes Worker nodes 2: Assign Tasks Do data processing work specified by Map or Reduce Function 14 MR(Hadoop) Job Execution Patterns Map Tasks MR program Reduce Tasks 1: Job submission The execution of a MR job involves 6 steps Map output Master node Worker nodes Worker nodes 3: Map phase Concurrent tasks 4: Shuffle phase Map output will be shuffled to different reduce tasks based on Partition Keys (PKs) (usually Map output keys) 15 MR(Hadoop) Job Execution Patterns Map Tasks MR program Reduce Tasks 1: Job submission Master node The execution of a MR job involves 6 steps 6: Output will be stored back to Distributed File System Worker nodes Worker nodes Reduce output 3: Map phase Concurrent tasks 4: Shuffle phase 5: Reduce phase Concurrent tasks 16 MR(Hadoop) Job Execution Patterns Map Tasks MR program Reduce Tasks 1: Job submission Master node Worker nodes The execution of a MR job involves 6 steps 6: Output will be stored back to Distributed File System Worker nodes A MapReduce (MR) job is resource-consuming: 1: Input data scan in the Map phase => local or remote I/Os 2: Store intermediate results of Map output => local I/Os Reduce output 3: Transfer data across in the Shuffle phase => network costs 3: Map phase 5: Reduce 4: Shuffle 4: Store final results of this MRphase job => local phase I/Os + network Concurrent tasks Concurrent tasks 17 costs (replicate data) Two Critical Challenges in Production Systems Background: Conventional databases have been moved to MapReduce Environment, e.g. Hive (facebook) and Pig (Yahoo!) Challenge 1: How to initially store the data in distributed systems • subject to minimizing network and storage cost Challenge 2: How to automatically convert relational database queries into MapReduce jobs • subject to minimizing network and storage costs Addressing these two Challenges, we maximize • Performance of big data analytics • Productivity of big data analytics 18 Challenge 1: Four Requirements of Data Placement Data loading (L) • the overhead of writing data to distributed files system and local disks Query processing (P) • local storage bandwidths of query processing • the amount of network transfers Storage space utilization (S) • Data compression ratio • The convenience of applying efficient compression algorithms Adaptive to dynamic workload patterns (W) • Additional overhead on certain queries Objective: to design and implement a data placement structure meeting these requirements in MapReduce-based data warehouses 19 Initial Stores of Big Data in Distributed Environment HDFS Blocks NameNode Store Block 1 (A part of the Master node) Store Block 2 Store Block 3 DataNode 1 DataNode 2 DataNode 3 HDFS (Hadoop Distributed File System) blocks are distributed Users have a limited ability to specify customized data placement policy • e.g. to specify which blocks should be co-located Minimizing I/O costs in local disks and intra network communication 20 MR programming is not that “simple”! public static class Reduce extends Reducer<IntWritable,Text,IntWritable,Text> { private Text result = new Text(); package tpch; import java.io.IOException; import java.util.ArrayList; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.Mapper.Context; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class Q18Job1 extends Configured implements Tool{ public static class Map extends Mapper<Object, Text, IntWritable, Text>{ public void reduce(IntWritable key, Iterable<Text> values, Context context ) throws IOException, InterruptedException { double sumQuantity = 0.0; IntWritable newKey = new IntWritable(); boolean isDiscard = true; String thisValue = new String(); int thisKey = 0; for (Text val : values) { String[] tokens = val.toString().split("\\|"); if (tokens[tokens.length - 1].compareTo("l") == 0){ sumQuantity += Double.parseDouble(tokens[0]); } else if (tokens[tokens.length - 1].compareTo("o") == 0){ thisKey = Integer.valueOf(tokens[0]); thisValue = key.toString() + "|" + tokens[1]+"|"+tokens[2]; } else continue; } This complex code is for a simple MR job if (sumQuantity > 314){ isDiscard = false; private final static Text value = new Text(); } private IntWritable word = new IntWritable(); private String inputFile; if (!isDiscard){ private boolean isLineitem = false; thisValue = thisValue + "|" + sumQuantity; @Override newKey.set(thisKey); protected void setup(Context context result.set(thisValue); ) throws IOException, InterruptedException { context.write(newKey, result); inputFile = ((FileSplit)context.getInputSplit()).getPath().getName(); } if (inputFile.compareTo("lineitem.tbl") == 0){ } isLineitem = true; } } System.out.println("isLineitem:" + isLineitem + " inputFile:" + inputFile); public int run(String[] args) throws Exception { } Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); public void map(Object key, Text line, Context context if (otherArgs.length != 3) { ) throws IOException, InterruptedException { System.err.println("Usage: Q18Job1 <orders> <lineitem> <out>"); String[] tokens = (line.toString()).split("\\|"); System.exit(2); if (isLineitem){ } word.set(Integer.valueOf(tokens[0])); Job job = new Job(conf, "TPC-H Q18 Job1"); value.set(tokens[4] + "|l"); job.setJarByClass(Q18Job1.class); context.write(word, value); } job.setMapperClass(Map.class); else{ job.setMapOutputKeyClass(IntWritable.class); word.set(Integer.valueOf(tokens[0])); job.setMapOutputValueClass(Text.class); value.set(tokens[1] + "|" + tokens[4]+"|"+tokens[3]+"|o"); context.write(word, value); job.setReducerClass(Reduce.class); } job.setOutputKeyClass(IntWritable.class); } job.setOutputValueClass(Text.class); Low Productivity! We all want to simply write: “SELECT * FROM Book WHERE price > 100.00”? } FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileInputFormat.addInputPath(job, new Path(otherArgs[1])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[2])); return (job.waitForCompletion(true) ? 0 : 1); } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new Q18Job1(), args); System.exit(res); } } 21 Challenge 2: High Quality MapReduce in Automation A job description in SQL-like declarative language SQL-to-MapReduce Translator Write MR programs (jobs) A interface between users and MR programs (jobs) MR programs (jobs) Workers Hadoop Distributed File System (HDFS) 22 Challenge 2: High Quality MapReduce in Automation A job description in SQL-like declarative language Write MR programs (jobs) SQL-to-MapReduce Translator A data warehousing system (Facebook) A interface between users and MR programs (jobs) MR program (job) A Ahigh-level programming environment (Yahoo!) ImproveWorkers productivity from hand-coding MapReduce programs • 95%+ Hadoop jobs in Facebook are generated by Hive • 75%+ Hadoop jobs in Yahoo! are invoked by Pig* Hadoop Distributed File System (HDFS) * http://hadooplondon.eventbrite.com/ 23 Outline RCFile: a fast and space-efficient placement structure • Re-examination of existing structures • A Mathematical model as the analytical basis for RCFile • Experiment results Ysmart: a high efficient query-to-MapReduce translator • Correlations-aware is the key • Fundamental Rules in the translation process • Experiment results Impact of RCFile and Ysmart in production systems Conclusion 24 Row-Store: Merits/Limits with MapReduce Table A 101 102 103 B 201 202 203 C 301 302 303 D 401 402 403 HDFS Blocks Store Block 1 Store Block 2 Store Block 3 104 105 204 205 304 305 404 405 … Data loading is fast (no additional processing); All columns of a data row are located in the same HDFS block Not all columns are used (unnecessary storage bandwidth) Compression of different types may add additional overhead 25 Distributed Row-Store Data among Nodes HDFS Blocks NameNode Store Block 1 Store Block 2 Store Block 3 A … B … C … D … DataNode 1 A … B … C … D … A … DataNode 2 B … C … D … DataNode 3 26 Column-Store: Merits/Limits with MapReduce Table A 101 102 103 B 201 202 203 C 301 302 303 D 401 402 403 HDFS Blocks Store Block 1 Store Block 2 Store Block 3 104 105 … 204 205 … 304 305 … 404 405 … … 27 Column-Store: Merits/Limits with MapReduce Column group 1 A 101 102 103 104 105 … Column group 2 B 201 202 203 204 205 Column group 3 C 301 D 401 302 303 304 402 403 404 HDFS Blocks Store Block 1 Store Block 2 Store Block 3 … … 305 405 Unnecessary I/O costs can be avoided: … … Only needed columns are loaded, and easy compression Additional network transfers for column grouping 28 Distributed Column-Store Data among Nodes HDFS Blocks NameNode Store Block 1 Store Block 2 Store Block 3 A 101 102 103 104 105 … DataNode 1 B C D 201 301 401 202 302 402 203 303 403 204 304 404 205 305 405 … … … DataNode 2 DataNode 3 29 Optimization of Data Placement Structure Consider the four requirements comprehensively The optimization problem becomes: • In an environment of dynamic workload (W) and with a suitable data compression algorithm (S) to improve the utilization of data storage, find a data placement structure (DPS) that minimizes the processing time of a basic operation (OP) on a table (T) with n columns Two basic operations • Write: the essential operation of data loading (L) • Read: the essential operation of query processing (P) 30 Write Operations Table A B C D 101 201 301 401 … … … HDFS Blocks … Load the table into HDFS blocks based on different data placement structure HDFS blocks are distributed to Datanodes DataNode 1 DataNode 2 DataNode 3 31 Read Operation in Row-store Read local rows concurrently ; Discard unneeded columns A C A 101 301 C A 102 302 C 103 303 HDFS Blocks A B C D A B C D A B C D 101 401 102 202 302 402 103 203 303 403 201 301 DataNode 1 DataNode 2 DataNode 3 32 Read Operations in Column-store Column group 1 Column group 2 Column group 3 A B C and D Project A&C A Datanode1 Datanode2 Datanode3 101 301 … Project C&D C D 301 401 … … DataNode 3 C … Transfer them to a common place via networks DataNode 1 DataNode 3 33 An Expected Value based Statistical Model In probability theory, the expected value of an random variable is • Weighted average of all possible values of this random variable can take on Estimated big data access time can take time of “read” T(r) and “write” T(w), each with a probability (p(r) and p(w)) since p(r) + p(w) = 1, and read and write have equal weights, estimated access time is a weighted average: • E[T] = p(r) * T(r) + p(w) * T(w) 34 Overview of the Optimization Model processing time of a read operation processing time of a write operation min E (op | DPS ) w E ( write | DPS ) r E ( read | DPS ) r w 1 Probabilities of write and read operations 35 Overview of the Optimization Model min E (op | DPS ) w E ( write | DPS ) r E ( read | DPS ) r w 1 The majority of operation is read, so ωr>>ωw ~70TB of compressed data added per day and PB level compressed data scanned per day in Facebook (90%+) We focus on minimization of processing time of read operations 36 Modeling Processing Time of a Read Operation # of combinations of n columns a query needs is up to n n n 2n 1 1 2 n 37 Modeling Processing Time of a Read Operation # of combinations of n columns a query needs is up to n n n 2n 1 1 2 n Processing Time of a Read Operation E ( read | DPS ) n n i S 1 S i f (i, j, n )( ( DPS ) ( DPS , i, j, n ) ) Blocal Bnetwotk n i 1 j 1 38 Expected Time of a Read Operation E (read | DPS ) n n i i 1 j 1 S 1 S i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n n: the number of columns of table T i: i columns are needed j: it is jth column combination in all combinations with i columns 39 Expected Time of a Read Operation E (read | DPS ) n n i i 1 j 1 S 1 S i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n The frequency of occurrence of jth column combination in all combinations with i columns. We fix it as a constant value to represent a environment with highly dynamic workload 40 Expected Time of a Read Operation E (read | DPS ) n n i i 1 j 1 S 1 S i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n The frequency of occurrence of jth column combination in all combinations with i columns. We fix the probability as a constant value to represent a environment with highly dynamic workload n n i f (i, j, n) 1 i 1 j 1 41 Expected Time of a Read Operation E (read | DPS ) n n i i 1 j 1 S 1 S i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n The time used to read needed columns from local disks in parallel S: The size of the compressed table. We assume with efficient data compression algorithms and configurations, different DPSs can achieve comparable compression ratio Blocal : The bandwidth of local disks ρ: the degree of parallelism, i.e. total number of concurrent nodes 42 Expected Time of a Read Operation E (read | DPS ) n n i i 1 j 1 S 1 S i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n α(DPS): Read efficiency. % of columns read from local disks i ( DPS ) n , DPS Column - store Only read necessary columns 1 , DPS Row - store Read all columns, including unnecessary columns 43 Expected Time of a Read Operation E (read | DPS ) n n i i 1 j 1 S 1 S i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n The extra time on network transfer for row construction 44 Expected Time of a Read Operation E (read | DPS ) n n i i 1 j 1 S 1 S i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n The extra time on network transfer for row construction λ(DPS, i, j, n): Communication Overhead. Additional network transfers for row constructions 45 Expected Time of a Read Operation E (read | DPS ) n n i i 1 j 1 S 1 S i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n The extra time on network transfer for row construction λ(DPS, i, j, n): Communication Overhead. Additional network transfers for row constructions 0, All needed columns are in one node (0 communication) , At least two columns are in two different nodes ( DPS , i, j , n) 46 Expected Time of a Read Operation E (read | DPS ) n n i i 1 j 1 S 1 S i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n The extra time on network transfer for row construction λ(DPS, i, j, n): Communication Overhead. Additional network transfers for row constructions 0, All needed columns are in one node (0 communication) , Transferring data via networks ( DPS , i, j , n) β: % of data transferred via networks (DPS, workload dependent), 0%≤β≤100% 47 Expected Time of a Read Operation E (read | DPS ) n n i i 1 j 1 S 1 S i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n The extra time on network transfer for row construction Bnetwork: The bandwidth of the network 48 Finding Optimal Data Placement Structure E (read | DPS ) n n i i 1 j 1 i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n Read efficiency Communication overhead S 1 S Row-Store Column-store Ideal 1 i/n (optimal) i/n (optimal) 0 (optimal) β (0%≤β≤100%) 0 (optimal) Can we find a Data Placement Structure with both optimal read efficiency and communication overhead ? 49 Goals of RCFile Eliminate unnecessary I/O costs like Column-store • Only read needed columns from disks Eliminate network costs in row construction like Row-store Keep the fast data loading speed of Row-store Can apply efficient data compression algorithms conveniently like Column-store Eliminate all the limits of Row-store and Column-store 50 RCFile: Partitioning a Table into Row Groups A HDFS block consists of one or multiple row groups Table A … B C D A…Row Group … … 101 102 201 202 301 302 401 402 103 104 105 … 203 204 205 … 303 304 305 … 403 404 405 … HDFS Blocks Store Block 1 Store Block 2 Store Block 3 Store Block 4 … 51 RCFile: Distributed Row-Group Data among Nodes For example, each HDFS block has three row groups HDFS Blocks NameNode Store Block 1 Store Block 2 Store Block 3 Row Group 1-3 Row Group 4-6 DataNode 1 Row Group 7-9 DataNode 2 DataNode 3 52 Inside a Row Group Metadata Store Block 1 101 201 301 401 102 103 104 105 202 203 204 205 302 303 304 305 402 403 404 405 53 Inside a Row Group Metadata 101 201 301 401 102 103 104 105 202 203 204 205 302 303 304 305 402 403 404 405 54 RCFile: Inside each Row Group Compressed Metadata 101 102 103 104 105 Compressed Column A 101 102 103 104 105 201 202 203 204 205 Compressed Column B 301 302 303 304 305 201 202 203 204 205 401 402 403 404 Compressed Column C 301 302 303 304 305 Compressed Column D 401 402 403 404 405 405 55 Benefits of RCFile Eliminate unnecessary I/O costs • In a row group, a table is partitioned by columns • Only read needed columns from disks Eliminate network costs in row construction • All columns of a row are located in the same HDFS block Comparable data loading speed to Row-Store • Only adding a vertical-partitioning operation in the data loading procedure of Row-Store Can apply efficient data compression algorithms conveniently • Can use compression schemes used in Column-store 56 Expected Time of a Read Operation E (read | DPS ) n n i i 1 j 1 i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n Read efficiency Communicatio n overhead S 1 S Row-Store Column-store 1 i/n (optimal) 0 (optimal) β (0%≤β≤100%) 57 Expected Time of a Read Operation E (read | DPS ) n n i i 1 j 1 i f (i, j , n)( ( DPS ) ( DPS , i, j , n) ) Blocal Bnetwotk n Read efficiency Communicatio n overhead S 1 S Row-Store Column-store RCFile 1 i/n (optimal) i/n (optimal) 0 (optimal) β (0%≤β≤100%) 0 (optimal) 58 Row-Group Size RG1 A performance sensitive and workload dependent parameter Block 1 RG2 RG4 Block 2 RG5 59 Row-Group Size A performance sensitive and workload dependent parameter Block 1 In RCFile, row group size is up to the RG1 size of HDFS block • HDFS block size: 64MB or 128MB Block 2 RG2 60 Row-Group Size A performance sensitive and workload dependent parameter Block 1 In RCFile, row group size is up to the size of HDFS block • HDFS block size: 64MB or 128MB RG1 Larger than HDFS block size? (VLDB’11) Block 2 – One column of a row group per HDFS block – Co-locate columns of a row group – Can provide flexibility on changing schema • A concern on fault-tolerance – If a node fails, the recovery time is longer than that of row-group size under current RCFile restriction – Needs additional maintenance 61 Discussion on Choosing a right Row-Group Size RCFile Restriction: Row group size ≤ HDFS block size Avg. row size ↑, # of columns ↑ => row-group size ↑ • Avoid unnecessary columns to be prefetched due to short columns – Unsuitable row-group size may introduce ~10x data read from disks, including ~90% unnecessary data • Relatively short columns may cause inefficient compression Query selectivity ↑ => row-group size ↓ • High-selectivity queries (high % of rows to read) with a large rowgroup size would not take advantage of lazy decompression, but low-selectivity queries do. – Lazy decompression: a column is only decompressed when it will be really useful for query execution 62 Evaluation Environment Cluster: • 40 nodes one in the Facebook Node: • • • • • Intel Xeon CPU with 8 cores 32GB main memory 12 1TB disks Linux, kernel version 2.6 Hadoop 0.20.1 Use Zebra library* for Column-store and Column-group • Column-store: A HDFS block only contains the data of a single column • Column-group: Some columns are grouped together and stored in the same HDFS block Benchmark is from [SIGMOD09]** *http://wiki.apache.org/pig/zebra **Pavlo et al., “A comparison of approaches to large-scale data analysis,” in SIGMOD Conference, 2009, pp. 165–178. 63 Data Loading Evaluation Data loading time for loading table USERVISITS (~120GB) Data loading time (s) 250 200 150 100 50 0 Row-store Column-store Column-group RCFile Row-store has the fastest data loading time RCFile’s data loading time is comparable to Row-store (7% slower) Zebra (column-store and column-group): each record will be written to multiple HDFS blocks => Higher network overhead 64 Storage Utilization Evaluation Compressed size of table USERVISITS (~120GB) Storage Space (GB) 120 100 80 60 40 20 0 Raw Data Row-store Column-store Column-group RCFile Use Gzip algorithm for each structure RCFile has the best data compression ratio (2.24) Zebra compresses metadata and data together, which is less efficient 65 A Case of Query Execution Time Evaluation Query: Select pagerank, pageurl FROM RANKING WHERE pagerank < 400 Query execution time 60 Execution Time (s) Column-store: Stored in different HDFS blocks Column-group: In the same column group, i.e. stored in the same HDFS block 50 40 30 20 10 0 row-store column-store Column-group RCFile Column-group has the fastest query execution time RCFile’s execution time is comparable to Column-group (3% slower) 66 Facebook Data Analytics Workloads Managed By RCFile Reporting • E.g. daily/weekly aggregations of impression/click counts Ad hoc analysis • E.g. geographical distributions and activities of users in the world Machine learning • E.g. online advertizing optimization and effectiveness studies Many other data analysis tasks on user behavior and patterns User workloads and related analysis cannot be published RCFile evaluation with public available workloads with excellent performance (ICDE’11) 67 RCFile in Facebook The interface to 800+ million users … Large amount of log data … RCFile Data … Web Servers 70TB compressed data per day Data Loaders Capacity: 21PB in May, 2010 30PB+ today Warehouse Picture source: Visualizing Friendships, http://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919 68 Summary of RCFile Data placement structure lays a foundation for MapReduce-based big data analytics Our optimization model shows RCFile meets all basic requirements RCFile: an operational system for daily tasks of big data analytics • A part of Hive, a data warehouse infrastructure on top of Hadoop. • A default option for Facebook data warehouse • Has been integrated into Apache Pig since version 0.7.0 (expressing data analytics tasks and producing MapReduce programs) • Customized RCFile systems for special applications Refining RCFile and optimization model, making RCFile as a standard data placement structure for big data analytics 69 Outline RCFile: a fast and space-efficient placement structure • Re-examination of existing structures • A Mathematical model as basis of RCFile • Experiment results Ysmart: a high efficient query-to-MapReduce translator • Correlations-aware is the key • Fundamental Rules in the translation process • Experiment results Impact of RCFile and Ysmart in production systems Conclusion 70 Translating SQL-like Queries to MapReduce Jobs: Existing Approach “Sentence by sentence” translation • [C. Olston et al. SIGMOD 2008], [A. Gates et al., VLDB 2009] and [A. Thusoo et al., ICDE2010] • Implementation: Hive and Pig Three steps • Identify major sentences with operations that shuffle the data – Such as: Join, Group by and Order by • For every operation in the major sentence that shuffles the data, a corresponding MR job is generated – e.g. a join op. => a join MR job • Add other operations, such as selection and projection, into corresponding MR jobs Existing SQL-to-MapReduce translators give unacceptable 71 performance. An Example: TPC-H Q21 Execution time (min) One of the most complex and time-consuming queries in the TPC-H benchmark for data warehousing performance Optimized MR Jobs vs. Hive in a Facebook production cluster Optimized MR Jobs Hive 160 140 120 3.7x 100 80 60 40 20 0 What’s wrong? 72 The Execution Plan of TPC-H Q21 The only difference: Hive handle this sub-tree in a different way with the optimized MR jobs It’s the dominated part on time (~90% of execution time) SORT AGG3 Join4 Left-outerJoin Join3 supplier Join2 Join1 lineitem orders AGG1 AGG2 lineitem lineitem nation 73 A JOIN MR Job However, inter-job correlations exist. Let’s look at the Partition Key An AGG MR Job Key: l_orderkey A Composite MR Job J5 A Table Key: l_orderkey J3 Key: l_orderkey J1 lineitem Key: l_orderkey J2 orders lineitem Key: l_orderkey J4 lineitem lineitem orders J1,to J1 J2J5 and allJ4 use allthe need same thepartition input table key‘lineitem’ ‘l_orderkey’ What’s wrong with existing SQL-to-MR translators? Existing translators are correlation-unaware 1. Ignore common data input 2. Ignore common data transition 74 Performance Approaches of Big Data Analytics in MR: The landscape Hand-coding MR jobs Correlationaware SQL-toMR translator Pro: Easy programming, high productivity Pro: Con: high performance MRPoor programs performance on complex queries Con: (complex queries are usual in daily 1: lots of coding evenoperations) for a simple job 2: Redundant coding is inevitable Existing 3: Hard to debug SQL-to-MR [J. Tan et al., ICDCS 2010] Translators Productivity 75 Our Approaches and Critical Challenges SQL-like queries Correlation-aware SQL-to-MR translator Primitive MR Jobs Identify Correlations MR Jobs for best performance Merge Correlated MR jobs 1: Correlation possibilities and detection 2: Rules for automatically exploiting correlations 3: Implement high-performance and low-overhead MR jobs 76 Input Correlation (IC) Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint J1 lineitem J2 orders lineitem A shared input relation set Map Func. of MR Job 1 Map Func. of MR Job 2 78 Transit Correlation (TC) Multiple MR jobs have transit correlation (TC) if • they have input correlation (IC), and • they have the same Partition Key Key: l_orderkey J1 lineitem Key: l_orderkey J2 orders Two MR jobs should first have IC A shared input relation set Map Func. of MR Job 1 79 Map Func. of MR Job 2 lineitem Partition Key Other Data Job Flow Correlation (JFC) A MR job has Job Flow Correlation (JFC) with one of its child MR jobs if it has the same partition key as that MR job J1 J2 Partition Key J2 Output of MR Job 2 Other Data Map Func. of MR Job 1 Reduce Func. of MR Job 1 Map Func. of MR Job 2 Reduce Func. of MR Job 2 J1 lineitem orders 80 Query Optimization Rules for Automatically Exploiting Correlations Exploiting both Input Correlation and Transit Correlation Exploiting the Job Flow Correlation associated with Aggregation jobs Exploiting the Job Flow Correlation associated with JOIN jobs and their Transit Correlated parents jobs Exploiting the Job Flow Correlation associated with JOIN jobs 81 Exp1: Four Cases of TPC-H Q21 1: Sentence-to-Sentence Translation • 5 MR jobs 2: InputCorrelation+TransitCorrelation • 3 MR jobs Left-outerJoin Leftouter-Join Join2 Join2 Join1 lineitem AGG1 orders lineitem AGG2 lineitem lineitem 3: InputCorrelation+TransitCorrelation+ JobFlowCorrelation • 1 MR job lineitem orders 4: Hand-coding (similar with Case 3) • In reduce function, we optimize code according query semantic orders lineitem orders 87 Breakdowns of Execution Time (sec) 1200 From totally 888sec to 510sec 1000 From totally 768sec to 567sec 800 600 Job5 Reduce Job5 Map Job4 Reduce Job4 Map Job3 Reduce Job3 Map Job2 Reduce Job2 Map Job1 Reduce Job1 Map Only 17% difference 400 200 0 No Correlation Input Correlation Transit Correlation Input Correlation Hand-Coding Transit Correlation 88 JobFlow Correlation Exp2: Clickstream Analysis A typical query in production clickstream analysis: “what is the average number of pages a user visits between a page in category ‘X’ and a page in category ‘Y’?” In YSmart JOIN1, AGG1, AGG2, JOIN2 and AGG3 are executed in a single MR job Execution time (min) 800 700 600 8.4x 500 400 300 4.8x 200 100 0 YSmart Hive Pig 89 YSmart in the Hadoop Ecosystem See patch HIVE-2206 at apache.org HiveYSmart + YSmart Hadoop Distributed File System (HDFS) 90 Summary of YSmary YSmart is a correlation-aware SQL-to-MapReduce translator Ysmart can outperform Hive by 4.8x, and Pig by 8.4x YSmart is being integrated into Hive The individual version of YSmart was released in January 2012 91 Translate SQL-like queries to MapReduce jobs YSmart … A Hadoop-powered Data Warehousing System (Hive) RCFile Data … Web servers for 800M facebook users 92 Conclusion We have contributed two important system components: RCFile and Ysmart in the critical path of Big Data analytics Ecosystem. The ecosystem of Hadoop-based big data analytics is created: Hive and Pig, will soon merge into an unified system RCFile and Ysmart provides two basic standards in such a new Ecosystem. Thank You! 93