Reining in the Outliers in MapReduce Jobs using Mantri Ganesh Ananthanarayanan†, Srikanth Kandula*, Albert Greenberg*, Ion Stoica†,Yi Lu*, Bikas Saha*, Ed Harris* † UC Berkeley * Microsoft 1 MapReduce Jobs Basis of analytics in modern Internet services ◦ E.g., Dryad, Hadoop Job {Phase} {Task} Graph flow consists of pipelines as well as strict blocks 2 Example Dryad Job Graph Distr. File System Distr. File System EXTRACT EXTRACT Map.1 AGGREGATE_PAR TITION AGGREGATE_PAR TITION Reduce.1 FULL_AGGREGATE FULL_AGGREGATE Map.2 Reduce.2 PROCESS COMBINE Join Phase Pipeline Blocked until input is done PROCESS Distr. File System 3 Log Analysis from Production Logs from production cluster with thousands of machines, sampled over six months 10,000+ jobs, 80PB of data, 4PB network transfers ◦ Task-level details ◦ Production and experimental jobs 4 Outliers hurt! Tasks that run longer than the rest in the phase Median phase has 10% outliers, running for >10x longer Slow down jobs by 35% at median Operational Inefficiency ◦ Unpredictability in completion times affect SLAs ◦ Hurts development productivity ◦ Wastes compute-cycles 5 Why do outliers occur? Read Input Input Unavailable Network Congestio n Execute Local Contention Workload Imbalance Mantri: A system that mitigates outliers based on root-cause analysis 6 Mantri’s Outlier Mitigation Avoid Recomputation Network-aware Task Placement Duplicate Outliers Cognizant of Workload Imbalance 7 Recomputes: Illustration (a) Barrier phases (b) Cascading Recomputes Actual Actual Inflation Inflation Ideal Ideal Normal task Recompute task 8 What causes recomputes? [1] Faulty machines ◦ Bad disks, non-persistent hardware quirks Set of faulty machines varies with time, not constant (4%) 9 What causes recomputes? [2] Transient machine load ◦ Recomputes correlate with machine load ◦ Requests for data access dropped 10 Replicate costly outputs MR: Recompute Probability of a machine Task1 Task 2 MR2 Replicate (TRep) Task 3 MR3 TRecomp = ((MR Recompute * T3 3*(1-MR2))only Task3 or+both Task3 as well as Task2 (MR3 * MR2) (T3+T2) TRep < TRecomp REPLICATE 11 Transient Failure Causes Recomputes manifest in clutches Machine prone to cause recomputes till the problem is fixed ◦ Load abates, critical process restart etc. Clue: At least r recomputes within t time window on a machine 12 Speculative Recomputes Anticipatorily recompute tasks whose outputs are unread Task Input Data (Read Fail) Speculative Recompute Speculative Recompute Unread Data 13 Mantri’s Outlier Mitigation Avoid Recomputation ◦ Preferential Replication + Speculative Recomp. Network-aware Task Placement Duplicate Outliers Cognizant of Workload Imbalance 14 Reduce Tasks Tasks access output of tasks from previous phases Reduce phase (74% of total traffic) Distr. File System Local Map Network Reduce Outlier! 15 Variable Congestion Reduce task Map output Rack Smart placement smoothens hotspots 16 Traffic-based Allotment Goal: Minimize phase completion time For every rack: ◦ d : data ◦ u : available uplink bandwidth ◦ v : available downlink bandwidth Solve for task allocation fractions, ai 17 Local Control is a good approx. Goal: Minimize phase completion time For every rack: ◦ d : data, D: data over all racks u : available uplink bandwidth Link ◦utilizations average out in long term, ◦ v : available downlink bandwidth are steady on the short term Let rack i have ai fraction of tasks ◦ Time uploading, Tu = di (1 - ai) / ui ◦ Time downloading, Td = (D – di) ai / vi Timei = max {Tu ,Td} 18 Mantri’s Outlier Mitigation Avoid Recomputation ◦ Preferential Replication + Speculative Recomp. Network-aware Task Placement ◦ Traffic on link proportional to bandwidth Duplicate Outliers Cognizant of Workload Imbalance 19 Contentions cause outliers Tasks contend for local resources ◦ Processor, memory etc. Duplicate tasks elsewhere in the cluster ◦ Current schemes duplicate towards end of the phase (e.g., LATE [OSDI 2008]) Duplicate outlier or schedule pending task? 20 Resource-Aware Restart trem Running task Potential restart (tnew) now Save time and resources: P(c tnew < (c + 1) trem) time Continuously observe and kill wasteful copies 21 Mantri’s Outlier Mitigation Avoid Recomputation ◦ Preferential Replication + Speculative Recomp. Network-aware Task Placement ◦ Traffic on link proportional to bandwidth Duplicate Outliers ◦ Resource-Aware Restart Cognizant of Workload Imbalance 22 Workload Imbalance A quarter of the outlier tasks have more data to process ◦ Unequal key partitions for reduce tasks Ignoring these better than duplication Schedule tasks in descending order of data to process ◦ Time α (Data to Process) ◦ [Graham ‘69] At worse, 33% of optimal 23 Mantri’s Outlier Mitigation Avoid Recomputation ◦ Preferential Replication + Speculative Recomp. Predict to Task act early Network-aware Placement Reactive Proactive ◦ Traffic on link proportional to bandwidth Be resource-aware Duplicate Outliers Cognizant of Workload Imbalance Act based Restart on the cause ◦ Resource-Aware ◦ Schedule in descending order of size 24 Results Deployed in production Bing clusters Trace-driven simulations ◦ Mimic workflow, failures, data skew ◦ Compare with existing and idealized schemes 25 Jobs in the Wild Jobs faster by 32% at median, consuming lesser resources Act Early: Duplicates issued when task 42% done (77% for Dryad) Light: Issues fewer copies (.47X as many as Dryad) Accurate: 2.8x higher success rate of copies 26 Recomputation Avoidance Eliminates most recomputes with minimal extra resources (Replication + Speculation) work well in tandem 27 Network-Aware Placement Bandwidth approximations Mantri wellapproximates the ideal 28 Summary From measurements in a production cluster, ◦ Outliers are a significant problem ◦ Are due to an interplay between storage, network and map-reduce Mantri, a cause-, resource-aware mitigation Deployment shows encouraging results “Reining in the Outliers in MapReduce Clusters using Mantri”, USENIX OSDI 2010 29