Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang • References: – Reining in the Outliers in Map-Reduce Clusters using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris – http://research.microsoft.com/enus/UM/people/srikanth/data/Combating%20Outlier s%20in%20Map-Reduce.web.pptx log(size of cluster) 105 104 mapreduce 103 102 HPC, || databases e.g., the Internet, click logs, bio/genomic data 101 1 log(size of dataset) GB 109 TB 1012 PB 1015 EB 1018 MapReduce • Decouples customized data operations from mechanisms to scale • Is widely used • Cosmos (based on SVC’s Dryad) + Scope @ Bing • MapReduce @ Google • Hadoop inside Yahoo! and on Amazon’s Cloud (AWS) 3 An Example Goal Find frequent search queries to Bing What the user says: SELECT Query, COUNT(*) AS Freq FROM QueryTable HAVING Freq > X How it Works: job manager file block 0 file block 1 file block 2 file block 3 assign work, get progress task task Local write task output block 0 task output block 1 task Read Map Reduce 4 We find that: Outliers slow down map-reduce jobs File System Map.Read 22K Map.Move 15K Map 13K Barrier Reduce 51K Goals • Speeding up jobs improves productivity • Predictability supports SLAs • … while using resources efficiently 5 What is an outlier • A phase (map or reduce) has n tasks and s slots (available compute resources) • Every task takes T seconds to run • ti = f (datasize, code, machine, network) • Ideally run time = ceiling (n/s) * T • A naïve scheduler n i t s • Goal is to be closer to max t i t n i s From a phase to a job • A job may have many phases • An outlier in an early phase has a cumulative effect • Data loss may cause multi-phase recompute outliers Why outliers? Problem: Due to unavailable input, tasks have to be recomputed map reduce sort Delay due to a recompute Delay due to a recompute readily cascades 8 Previous work • The original MapReduce paper observed the problem • But didn’t deal with it in depth • Solution was to duplicate the slow tasks • Drawbacks – Some may be unnecessary – Use extra resources – Placement may be the problem Quantifying the Outlier Problem • Approach: – Understanding the problem first before proposing solutions – Understanding often leads to solutions 1. Prevalence of outliers 2. Causes of outliers 3. Impact of outliers Why bother? Frequency of Outliers stragglers = Tasks that take 1.5 times the median task in that phase recomputes = Tasks that are re-run because their output was lost straggler straggler Outlier •50% phases have 10% stragglers and no recomputes •10% of the stragglers take >10X longer 11 Causes of outliners: data skew Duplicating will not help! • In 40% of the phases, all the tasks with high runtimes (>1.5x the median task) correspond to large amount of data over the network Non-outliers can be improved as well • 20% of them are 55% longer than median Problem: Tasks reading input over the network experience variable congestion Reduce task Map output uneven placement is typical in production • reduce tasks are placed at first available slot 14 Causes of outliers: cross rack traffic 50% phases takes 62% longer to finish than ideal placement • • • • • 70% of cross track traffic is reduce traffic Tasks in a spot with slow network run slower Tasks compete network among themselves Reduce reads from every map Reduce is put into any spare slot Cause of outliers: bad and busy machines • 50% of recomputes happen on 5% of the machines • Recompute increases resource usage • Outliers cluster by time – Resource contention might be the cause • Recomputes cluster by machines – Data loss may cause multiple recomputes Why bother? Cost of outliers (what-if analysis, replays logs in a trace driven simulator) At median, jobs slowed down by 35% due to outliers 18 Mantri Design High-level idea • Cause aware, and resource aware • Runtime = f (input, network, machine, datatoProcess, …) • Fix each problem with different strategies Resource-aware restarts • Duplicate or kill long outliers When to restart • Every ∆ seconds, tasks report progress • Estimate trem and tnew • γ= 3 • Schedule a duplicate if the total running time is smaller • P(c trem > (c+1) tnew) > δ • When there are available slots, restart if reduction time is more than restart time – E(trem – tnew ) > ρ ∆ Network Aware Placement • Compute the rack location for each task • Find the placement that minimizes the maximum data transfer time If rack i has di map output and ui, vi bandwidths available on uplink and downlink, Place ai fraction of reduces such that: Avoid recomputation • Replicating the output – Restart a task if data are lost – Replicate the most costly job Data-aware task ordering • Outliers due to large input • Schedule tasks in descending order of dataToProcess • At most 33% worse than optimal scheduling Estimation of trem and tnew • d: input data size • dread: the amount read Estimation of tnew • processRate: estimated of all tasks in the phase • locationFactor: machine performance • d: input size Results Deployed in production cosmos clusters • Prototype Jan’10 baking on pre-prod. clusters release May’10 Trace driven simulations • thousands of jobs • mimic workflow, task runtime, data skew, failure prob. • compare with existing schemes and idealized oracles 29 Evaluation Methodology • Mantri run on production clusters • Baseline is results from Dryad • Use trace-driven simulations to compare with other systems Comparing jobs in the wild 340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release) • w/ and w/o Mantri for one month of jobs in Bing production cluster 31 In production, restarts… improve on native cosmos by 25% while using fewer resources 32 CDF % cluster resources CDF % cluster resources In trace-replay simulations, restarts… are much better dealt with in a cause-, resource- aware manner. Each job repeated thrice 33 Network-aware Placement • Equal: all links have the same bandwidth • Start: same as the start • Ideal: available bandwidth at run time 34 CDF % cluster resources Protecting against recomputes 35 Summary a) Reduce recomputation: preferentially replicate costly-to-recompute tasks b) Poor network: each job locally avoids network hot-spots c) Bad machines: quarantine persistently faulty machines d) DataToProcess: schedule in descending order of data size e) Others: restart or duplicate tasks, cognizant of resource cost. Prune Conclusion • Outliers in map-reduce clusters are a significant problem • happen due to many causes – interplay between storage, network and map-reduce • cause-, resource- aware mitigation improves on prior art 37