Improving MapReduce Performance in Heterogeneous Environments Sturzu Antonio-Gabriel, SCPD Table of Contents Intro Scheduling in Hadoop Heterogeneity in Hadoop The LATE Scheduler(Longest Approximate Time to End) Evaluation of Performance Intro The big volume of data that internet services work on has led to the need of parallel processing The leading example is Google which uses MapReduce to process 20 petabytes of data per day MapReduce breaks a computation into small tasks that run in parallel on multiple machines and scales easy to very large clusters The two key benefits of MapReduce are: – Fault tolerance – Speculative execution Intro Google has noted that speculative execution improves response time by 44% The paper shows an efficient way to do speculative execution in order to maximize performance It also shows that Hadoop’s simple speculative algorithm based on comparing each task’s progress to the average progress brakes down in heterogeneous systems Intro The proposed scheduling algorithm increases Hadoop’s response time by a factor of two Most of the examples and tests are done on Amazon’s Elastic Compute Cloud( EC2) The paper adresses two important problems in speculative execution: – Choosing the best node to run the speculative task – Distinguishing between nodes slightly slower than the mean and stragglers Scheduling in Hadoop Hadoop divides each MapReduce job into tasks The input file is split into even-sized chunks replicated for fault tolerance Each chunk of input is first processed by a map task that outputs a set of key-value pairs Map outputs are split into buckets based on the key Scheduling in Hadoop When all map tasks finish reducers apply a reduce function on the set of values associated with each key Scheduling in Hadoop Hadoop runs several maps and reduces concurrently on each slave in order to overlap I/O with computation Each slave tells the master when it has empty task slots – First any failed task is given priority – Second a non-running task – Third a task to execute speculatively Scheduling in Hadoop To select speculative tasks Hadoop uses a progress score between 0 and 1 For a map the progress score is the fraction of input data read For a reduce task the execution is divided into 3 phases each of which accounts for 1/3 of the score: – The copy phase – The sort phase – The reduce phase Scheduling in Hadoop In each phase the score is the fraction of data processed Hadoop calculates an average score for each category of tasks in order to define a threshold for speculative execution When a task’s progress score is less than the average for its category minus 0.2 and the task has run for at least a minute it is marked as a straggler Scheduling in Hadoop All tasks beyond the threshold are considered equally slow and ties between them are broken by data locality This threshold works well in homogeneous systems because tasks tend to start and finish in “waves” at roughly the same times When running multiple jobs Hadoop uses a FIFO discipline Scheduling in Hadoop Assumptions made by Hadoop Scheduler: – Nodes can perform work at roughly the same rate – Tasks progress at a constant rate throughout time – There is no cost to launching a speculative task on a node that would otherwise have an idle slot Scheduling in Hadoop – A task’s progress score is a representative of fraction of its total work that it has done – Tasks tend to finish in waves, so a task with a low progress score is likely a straggler – Tasks in the same category (map or reduce) require roughly the same amount of work Heterogeneity in Hadoop Too many speculative tasks are launched because of the fixed threshold (assumption 3 falls) Because the scheduler uses data locality to rank candidates for speculative execution the wrong tasks may be chosen first Assumptions 3, 4 and 5 fall on both homogeneous and heterogeneous clusters The LATE Scheduler The main idea is that it speculatively executes the task that will finish farthest in the future Estimates progress rate as ProgressScore/T, where T is the amount of time the task has been running for The time to completion is (1ProgressScore)/ProgressRate The LATE Scheduler In order to get the best chance to beat the original task which was speculated the algorithm launches speculative tasks only on fast nodes It does this using a SlowNodeThreshold which is a metric of the total work performed Because speculative tasks cost resources LATE uses two additional heuristics: – A limit on the number of speculative tasks executed (SpeculativeCap) – A SlowTaskThreshold that determines if a task is slow enough in order to get speculated (uses progress rate for comparison) The LATE Scheduler When a node asks for a new task and the number of speculative tasks is less than the threshold: – If the node’s progress score is below SlowNodeThreshold ignore the request – Rank currently running tasks that are not being speculated by estimating completion time – Launch a copy of the highest ranked task whose progress rate is below SlowTaskThreshold Doesn’t take into account data locality The LATE Scheduler Advantages of the algorithm: – Robust to node heterogeneity because it launches only the slowest tasks and only few of them – Prioritized among slow tasks based on how they hurt response time – Takes into account node heterogeneity when choosing on which node to run a speculative task – Executes only tasks that will improve the total response time, not any slow task The LATE Scheduler The time completion estimation can produce errors when a task’s progress rate decreases but in general gets correct approximations in typical MapReduce jobs Evaluation In order to create heterogeneity they mapped a variable number of virtual machines (from 1 to 8) on each host in the EC2 cluster They measured the impact of contention on I/O performance and Application Level Performance Evaluation Evaluation For Application Level they sorted 100 GB of random data using Hadoop’s Sort benchmark with speculative execution disabled With isolated VM’s the job completed in 408 s and with VM’s packed densely onto physical hosts (7 VM’s per host) it took 1094s For evaluating the scheduling algorithms they used clusters of about 200 VM’s and they performed 5-7 runs Evaluation Results for scheduling in a heterogeneous cluster of 243 VM’s using the Sort job (128 MB per host) for a total of 30GB of data: Evaluation On average LATE finished jobs 27% faster than Hadoop’s native scheduler and 31% faster than no speculation Results for scheduling with stragglers In order to simulate stragglers they manually slowed down eight VM’s in a cluster of 100 Evaluation For each run they sorted 256 MB per host for a total of 25GB: Evaluation They also ran other two workloads on a heterogeneous cluster with stragglers: – Grep – WordCount They used a 204 node cluster with 1 to 8 VM’s per host For the Grep test they searched on 43GB of text data or about 200 MB per host Evaluation Evaluation On average LATE finished jobs 36% faster than Hadoop’s native scheduler and 56% faster than no speculation For the WordCount test they used a data set of 21GB or 100 MB per host Evaluation Evaluation Sensitivity analysis – SpeculativeCap – SlowTaskThreshold – SlowNodeThreshold SpeculativeCap results They ran experiments at six SpeculativeCap values from 2.5% to 100% repeating each experiment 5 times Evaluation Evaluation Sensitivity to SlowTaskThreshold Here the idea is to not speculate tasks that are progressing fast if they are the only tasks left They tested 6 values from 5% to 100% Evaluation Evaluation We observe that values past 25% all work well with 25% being the optimum value Sensitivity to SlowNodeThreshold Evaluation