A Hadoop MapReduce Performance Prediction Method Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin# * University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France + Ecole Centrale de Paris, France # Beihang University, Beijing China 1 Background • Hadoop MapReduce Job Map Reduce (Key, Value) + I N P U T Partion1 Partion2 Map Map Reduce Map Split Reduce D A T A Map HDFS 2 Background • Hadoop • Many steps within Map stage and Reduce stage • Different step may consume different type of resource Map R E A D Map S O R T M E R G E O U T P U T 3 Motivation • Problems Scheduling CPU Intensive CPU Intensive No consideration about the execution time and different type of resources consumed Hadoop Hadoop Hadoop Parameter Tuning Numerous parameters, default value is not optimal Job Hadoop Job Default Hadoop Default Conf 4 Motivation • Solution Scheduling No consideration about the execution time and different type of resources consumed Predict the performance of Hadoop Jobs Hadoop Parameter Tuning Numerous parameters, default value is not optimal 5 Related Work • Existing Prediction Method 1: - Black Box Based Hadoop Hard to choose Job Features Lack of the analysis about Hadoop Statistic/Learning Models Execution Time 6 Related Work • Existing Prediction Method 2: - Cost Model Based Hadoop Read map … Out HadoopRead put … reduce Lots of concurrent processes Hard to divide stages Job Feature F(map)=f(read,map,sort,spill,merge,write) F(reduce)=f(read,write,merge,reduce,write) Out put Difficult to ensure accuracy Execution Time 7 Related Work • A Brief Summary about Existing Prediction Method Black Box Cost Model Advantage Simple and Effective High accuracy High isomorphism Detailed analysis about Hadoop processing Division is flexible (stage, resource) Multiple prediction Short Coming Lack of job feature extraction Lack of analysis Hard to divide each step and resource Lack of job feature extraction A lot of concurrent, hard to model Better for theoretical analysis, not suitable for prediction o Simple prediction, o Lack of jobs (jar package + data) analysis 8 Goal • Design a Hadoop MapReduce performance prediction system to: - Predict the job consumption of various type of resources (CPU, Disk IO, Network) - Predict the execution time of Map phase and Reduce phase Job Prediction System - Map execution time - Reduce execution time - CPU Occupation Time - Disk Occupation Time - Network Occupation Time 9 Design - 1 • Cost Model Job C O S T M O D E L - Map execution time - Reduce execution time - CPU Occupation Time - Disk Occupation Time - Network Occupation Time 10 Cost Model [1] • Analysis about Map - Modeling the resources (CPU Disk Network) consumption - Each stage involves only one type of resources CPU: Disk: Net: Initiation Map Read Data Network Transfer Create Object Sort In Memory Map Function Merge Sort Seriali zation Read/Writ e Disk Write Disk 11 [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239. Cost Model [1] • Cost Function Parameters Analysis – Type One:Constant • Hadoop System Consume,Initialization Consume – Type Two:Job-related Parameters • Map Function Computational Complexity,Map Input Records – Type Three:Parameters defined by Cost Model • Sorting Coefficient, Complexity Factor 12 [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239. Parameters Collection • Type One and Type Three – Type one: Run empty map tasks,calculate the system consumed from the logs – Type Three: Extract the sort part from Hadoop source code, sort a certain number of records. • Type Two – Run a new job,analyze log • • High Latency Large Overhead – Sampling Data,only analyze the behavior of map function and reduce function • • Almost no latency Very low extra overhead Job Analyzer 13 Job Analyzer - Implementation • Job Analyzer – Implementation – Hadoop virtual execution environment • Accept the job Jar File & Input Data – Sampling Module • Sample input data by a certain percentage (less than 5%). – MR Module • Instantiate user job’s class in using Java reflection – Analyze Module • • • Input Data (Amount & Number) Relative computational complexity Data conversion rates (output/input) Jar File + Input Data Hadoop virtual execution environment MR Module Sampling Module Analyze Module Job Feature 14 Job Analyzer - Feasibility – Data similarity: Logs have uniform format – Execution similarity: each record will be processed by the same map & reduce function repeatedly I N P U T D A T A Map Map Split Reduce Map Reduce Map 15 Design - 2 • Parameters Collection Job Analyzer: Collect Parameters of Type 2 Static Parameters Collection Module: Collect Parameters of Type1 & Type 3 C O S T M O D E L - Map execution time - Reduce execution time - CPU Occupation Time - Disk Occupation Time - Network Occupation Time 16 Prediction Model • Problem Analysis -Many concurrent steps -- the total time can not be added up by the time of each part CPU: Disk: Net: Initiation Read Data Network Transfer Sort In Memory Create Object Map Function Merge Sort Serializat ion Read/Write Disk Write Disk 17 Prediction Model • Main Factors (according to the performance model) - Map Stage Initiation Read Data Network Transfer Create Object Sort In Memory Map Function The amount of input data Merge Sort Serializ ation Read/Write Disk Write Disk The number of input records (N) NlogN The complexity of Map function Tmap=α0 +α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data The conversion rate of Map data 18 Prediction Model • Experimental Analysis – Test 4 kinds of jobs (0-10000 records) – Extract the features for linear regression – Calculate the correlation coefficient (R2) Jobs Dedup WordCount Project Grep Total R2 0.9982 0.9992 0.9991 0.9949 0.6157 19 Prediction Model 3500000 Execution Time of Map 3000000 2500000 - Very good linear relationship within the same kind of jobs. 2000000 Dedup Grep Project 1500000 WordCount - But no linear relationship among different kind of jobs. 1000000 500000 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of Records 20 Find the nearest jobs! • Instance-Based Linear Regression – Find the nearest samples to the jobs to be predicted in history logs – “nearest”-> similar jobs (Top K nearest, with K=10%-15%) – Do linear regression to the samples we have found – Calculate the prediction value • Nearest: – The weighted distance of job features (weight w) – High contribution for job classification: • map/reduce complexity,map/reduce data conversion rate – Low contribution for job classification: • Data amount、Number of records 21 Prediction Module • Procedure Job Features 3 Search for the nearest samples 1 Main Factors Cost Model 4 2 Tmap=α0+α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data 6 5 Prediction Function 7 Prediction Results 22 Prediction Module • Procedure Cost Model Find-Neighbor Module Prediction Function Training Set Prediction Results 23 Design - 3 • Parameters Collection Job Analyzer: Collect Parameters of Type 2 Static Parameters Collection Module: Collect Parameters of Type1 & Type 3 C O S T M O D E L Prediction Module - Map execution time - Reduce execution time - CPU Occupation Time - Disk Occupation Time - Network Occupation Time 24 Experience • Task Execution Time (Error Rate) – – – – K=12%, and with w different for each feature K=12%, and with w the same for each feature K=25%, and with w different for each feature 4 kinds of jobs, 64M-8G Reduce Tasks 90 180 80 160 70 140 Error Rate (100%) Error Rate (100%) Map Tasks 60 50 40 30 20 120 100 k=12% 80 k=25% 60 k=12%,w=1 40 10 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Job ID 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Job ID 25 Conclusion • Job Analyzer : – Analyze Job Jar + Input File – Collect parameters • Prediction Module: – – – – Find the main factor Propose a linear equation Job classification Multiple prediction 26 Thank you! Question? 27 Cost Model [1] • Analysis about Reduce - Modeling the resources (CPU Disk Network) consumption - Each stage involves only one type of resources Initiation Reduce Read Data Network Transfer CPU: Disk: Net: Merge Sort Read/Write Disk Serialization Deserialization Create Object Reduce Function Write Disk Network 28 Prediction Model • Main Factors (according to the performance model) Initiation - Reduce Stage Read Data Network Transfer Merge Sort Read/Write Disk Serialization Deserialization Create Object The amount of input data Reduce Function The number of input records Write Disk Network NlogN The complexity of Reduce function Treduce=β0 +β1*MapInput +β2*N +β3*Nlog(N) +β4*The complexity of Reduce function +β5*The conversion rate of Map data +β6*The conversion rate of Reduce data The conversion rate of Map data The conversion rate of Reduce data 29