Reining Outliers

Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang • References: – Reining in the Outliers in Map-Reduce Clusters using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris – http://research.microsoft.com/enus/UM/people/srikanth/data/Combating%20Outlier s%20in%20Map-Reduce.web.pptx log(size of cluster) 105 104 mapreduce 103 102 HPC, || databases e.g., the Internet, click logs, bio/genomic data 101 1 log(size of dataset) GB 109 TB 1012 PB 1015 EB 1018 MapReduce • Decouples customized data operations from mechanisms to scale • Is widely used • Cosmos (based on SVC’s Dryad) + Scope @ Bing • MapReduce @ Google • Hadoop inside Yahoo! and on Amazon’s Cloud (AWS) 3 An Example Goal Find frequent search queries to Bing What the user says: SELECT Query, COUNT(*) AS Freq FROM QueryTable HAVING Freq > X How it Works: job manager file block 0 file block 1 file block 2 file block 3 assign work, get progress task task Local write task output block 0 task output block 1 task Read Map Reduce 4 We find that: Outliers slow down map-reduce jobs File System Map.Read 22K Map.Move 15K Map 13K Barrier Reduce 51K Goals • Speeding up jobs improves productivity • Predictability supports SLAs • … while using resources efficiently 5 What is an outlier • A phase (map or reduce) has n tasks and s slots (available compute resources) • Every task takes T seconds to run • ti = f (datasize, code, machine, network) • Ideally run time = ceiling (n/s) * T • A naïve scheduler n i t s • Goal is to be closer to  max t i t n i s From a phase to a job • A job may have many phases • An outlier in an early phase has a cumulative effect • Data loss may cause multi-phase recompute  outliers Why outliers? Problem: Due to unavailable input, tasks have to be recomputed map reduce sort Delay due to a recompute Delay due to a recompute readily cascades 8 Previous work • The original MapReduce paper observed the problem • But didn’t deal with it in depth • Solution was to duplicate the slow tasks • Drawbacks – Some may be unnecessary – Use extra resources – Placement may be the problem Quantifying the Outlier Problem • Approach: – Understanding the problem first before proposing solutions – Understanding often leads to solutions 1. Prevalence of outliers 2. Causes of outliers 3. Impact of outliers Why bother? Frequency of Outliers stragglers = Tasks that take  1.5 times the median task in that phase recomputes = Tasks that are re-run because their output was lost straggler straggler Outlier •50% phases have 10% stragglers and no recomputes •10% of the stragglers take >10X longer 11 Causes of outliners: data skew Duplicating will not help! • In 40% of the phases, all the tasks with high runtimes (>1.5x the median task) correspond to large amount of data over the network Non-outliers can be improved as well • 20% of them are 55% longer than median Problem: Tasks reading input over the network experience variable congestion Reduce task Map output uneven placement is typical in production • reduce tasks are placed at first available slot 14 Causes of outliers: cross rack traffic 50% phases takes 62% longer to finish than ideal placement • • • • • 70% of cross track traffic is reduce traffic Tasks in a spot with slow network run slower Tasks compete network among themselves Reduce reads from every map Reduce is put into any spare slot Cause of outliers: bad and busy machines • 50% of recomputes happen on 5% of the machines • Recompute increases resource usage • Outliers cluster by time – Resource contention might be the cause • Recomputes cluster by machines – Data loss may cause multiple recomputes Why bother? Cost of outliers (what-if analysis, replays logs in a trace driven simulator) At median, jobs slowed down by 35% due to outliers 18 Mantri Design High-level idea • Cause aware, and resource aware • Runtime = f (input, network, machine, datatoProcess, …) • Fix each problem with different strategies Resource-aware restarts • Duplicate or kill long outliers When to restart • Every ∆ seconds, tasks report progress • Estimate trem and tnew • γ= 3 • Schedule a duplicate if the total running time is smaller • P(c trem > (c+1) tnew) > δ • When there are available slots, restart if reduction time is more than restart time – E(trem – tnew ) > ρ ∆ Network Aware Placement • Compute the rack location for each task • Find the placement that minimizes the maximum data transfer time If rack i has di map output and ui, vi bandwidths available on uplink and downlink, Place ai fraction of reduces such that: Avoid recomputation • Replicating the output – Restart a task if data are lost – Replicate the most costly job Data-aware task ordering • Outliers due to large input • Schedule tasks in descending order of dataToProcess • At most 33% worse than optimal scheduling Estimation of trem and tnew • d: input data size • dread: the amount read Estimation of tnew • processRate: estimated of all tasks in the phase • locationFactor: machine performance • d: input size Results Deployed in production cosmos clusters • Prototype Jan’10 baking on pre-prod. clusters  release May’10 Trace driven simulations • thousands of jobs • mimic workflow, task runtime, data skew, failure prob. • compare with existing schemes and idealized oracles 29 Evaluation Methodology • Mantri run on production clusters • Baseline is results from Dryad • Use trace-driven simulations to compare with other systems Comparing jobs in the wild 340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release) • w/ and w/o Mantri for one month of jobs in Bing production cluster 31 In production, restarts… improve on native cosmos by 25% while using fewer resources 32 CDF % cluster resources CDF % cluster resources In trace-replay simulations, restarts… are much better dealt with in a cause-, resource- aware manner. Each job repeated thrice 33 Network-aware Placement • Equal: all links have the same bandwidth • Start: same as the start • Ideal: available bandwidth at run time 34 CDF % cluster resources Protecting against recomputes 35 Summary a) Reduce recomputation: preferentially replicate costly-to-recompute tasks b) Poor network: each job locally avoids network hot-spots c) Bad machines: quarantine persistently faulty machines d) DataToProcess: schedule in descending order of data size e) Others: restart or duplicate tasks, cognizant of resource cost. Prune Conclusion • Outliers in map-reduce clusters are a significant problem • happen due to many causes – interplay between storage, network and map-reduce • cause-, resource- aware mitigation improves on prior art 37

Reining Outliers

Related documents

Products

Support

Reining Outliers

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib