Combating Outliers in map-reduce Srikanth Kandula Ganesh Ananthanarayanan, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Ed Harris 1 log(size of cluster) 105 104 mapreduce 103 102 HPC, || databases e.g., the Internet, click logs, bio/genomic data 101 1 log(size of dataset) GB 109 TB 1012 PB 1015 EB 1018 map-reduce • decouples operations on data (user-code) from mechanisms to scale • is widely used • Cosmos (based on SVC’s Dryad) + Scope @ Bing • MapReduce @ Google • Hadoop inside Yahoo! and on Amazon’s Cloud (AWS) 2 An Example What the user says: Goal Find frequent search queries to Bing How it Works: file block 0 file block 1 file block 2 file block 3 SELECT Query, COUNT(*) AS Freq FROM QueryTable HAVING Freq > X job manager assign work, get progress task task Local write task output block 0 task output block 1 task Read Map Reduce 3 We find that: Outliers slow down map-reduce jobs File System Map.Read 22K Map.Move 15K Map 13K Barrier Reduce 51K Goals • speeding up jobs improves productivity • predictability supports SLAs • … while using resources efficiently 4 This talk… Identify fundamental causes of outliers – concurrency leads to contention for resources – heterogeneity (e.g., disk loss rate) – map-reduce artifacts Current schemes duplicate long-running tasks Mantri: A cause-, resource-aware mitigation scheme • takes distinct actions based on cause • considers resource cost of actions Results from a production deployment 5 Why bother? Frequency of Outliers stragglers = Tasks that take 1.5 times the median task in that phase recomputes = Tasks that are re-run because their output was lost straggler straggler Outlier •The median phase has 10% stragglers and no recomputes •10% of the stragglers take >10X longer 6 Why bother? Cost of outliers (what-if analysis, replays logs in a trace driven simulator) At median, jobs slowed down by 35% due to outliers 7 Why outliers? 𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, … Problem: Due to unavailable input, tasks have to be recomputed map reduce sort Delay due to a Delay due to a recompute recompute readily cascades 8 Why outliers? 𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, … Problem: Due to unavailable input, tasks have to be recomputed (simple) Idea: Replicate intermediate data, use copy if original is unavailable Challenge(s) What data to replicate? Where? What if we still miss data? Insights: • 50% of the recomputes are on 5% of machines 9 Why outliers? 𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, … Problem: Due to unavailable input, tasks have to be recomputed (simple) Idea: Replicate intermediate data, use copy if original is unavailable Challenge(s) What data to replicate? Where? What if we still miss data? Insights: • 50% of the recomputes are on 5% of machines • cost to recompute vs. cost to replicate M1 M2 tredo = r2(t2 +t1redo) t = predicted runtime of task r = predicted probability of recompute at machine trep = cost to copy data over within rack Mantri preferentially acts on the more costly recomputes 10 Why outliers? 𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘, … Problem: Tasks reading input over the network experience variable congestion Reduce task Map output uneven placement is typical in production • reduce tasks are placed at first available slot 11 Why outliers? 𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘, … Problem: Tasks reading input over the network experience variable congestion Idea: Avoid hot-spots, keep traffic on a link proportional to bandwidth Challenge(s) Global co-ordination across jobs? Where is the congestion? Insights: • local control is a good approximation (each job balances its traffic) • link utilizations average out on the long term and are steady on the short term If rack i has di map output and ui, vi bandwidths available on uplink and downlink, Place ai fraction of reduces such that: 𝑎𝑖 = arg 𝑚𝑖𝑛 𝑚𝑎𝑥 𝑇𝑖 𝑢𝑝 , 𝑇𝑖 𝑑𝑜𝑤𝑛 12 Why outliers? 𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘, 𝑚𝑎𝑐ℎ𝑖𝑛𝑒, … Persistently slow machines rarely cause outliers Cluster Software (Autopilot) quarantines persistently faulty machines 13 Why outliers? 𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘, 𝑚𝑎𝑐ℎ𝑖𝑛𝑒, 𝑑𝑎𝑡𝑎𝑇𝑜𝑃𝑟𝑜𝑐𝑒𝑠𝑠, … Problem: About 25% of outliers occur due to more dataToProcess Solution: Ignoring these is better than the state-of-the-art! (duplicating) In an ideal world, we could divide work evenly… We schedule tasks in descending order of dataToProcess Theorem [due to Graham, 1969] Doing so is no more than 33% worse than the optimal 14 Why outliers? 𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘, 𝑚𝑎𝑐ℎ𝑖𝑛𝑒, 𝑑𝑎𝑡𝑎𝑇𝑜𝑃𝑟𝑜𝑐𝑒𝑠𝑠, … Problem: 25% outliers remain, likely due to contention@machine Idea: Restart tasks elsewhere in the cluster Challenge(s) The earlier the better, but to restart outlier or start a pending task? trem Running task Save time and resources iff 𝑐 𝑃 𝑡𝑛𝑒𝑤 < 𝑡 𝑐 + 1 𝑟𝑒𝑚 (a) (b) (c) Potential restart (tnew) now time If predicted time is much better, kill original, restart elsewhere Else, if other tasks are pending, duplicate iff save both time and resource Else, (no pending work) duplicate iff expected savings are high Continuously, observe and kill wasteful copies 15 Summary 𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘, 𝑚𝑎𝑐ℎ𝑖𝑛𝑒, 𝑑𝑎𝑡𝑎𝑇𝑜𝑃𝑟𝑜𝑐𝑒𝑠𝑠, … (a) a) b) c) d) e) (b) (c) (d) (e) preferentially replicate costly-to-recompute tasks each job locally avoids network hot-spots quarantine persistently faulty machines schedule in descending order of data size restart or duplicate tasks, cognoscent of resource cost. Prune. Theme: Cause-, Resource- aware action Explicit attempt to decouple solutions, partial success 16 Results Deployed in production cosmos clusters • Prototype Jan’10 baking on pre-prod. clusters release May’10 Trace driven simulations • thousands of jobs • mimic workflow, task runtime, data skew, failure prob. • compare with existing schemes and idealized oracles 17 In production, restarts… improve on native cosmos by 25% while using fewer resources 18 Comparing jobs in the wild CDF % cluster resources CDF % cluster resources 340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release) 19 CDF % cluster resources CDF % cluster resources In trace-replay simulations, restarts… are much better dealt with in a cause-, resource- aware manner 20 CDF % cluster resources Protecting against recomputes 21 Outliers in map-reduce clusters • are a significant problem • happen due to many causes – interplay between storage, network and map-reduce • cause-, resource- aware mitigation improves on prior art 22 Back-up 23 Network-aware Placement 24