‘s Overload Tolerant Design Exacerbates Failure Detection and Recovery Florin Dinu T. S. Eugene Ng Rice University is Widely Used * Protein Sequencing Image Processing Web Indexing Machine Learning Advertising Analytics Recent research work Log Storage and Analysis 2010 * Source: http://wiki.apache.org/hadoop/PoweredBy 2 Compute-Node Failures Are Common “ ... typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours” Jeff Dean – Google I/O 2008 “ 5.0 average worker deaths per job” Jeff Dean – Keynote I – PACT 2006 Revenue Reputation User experience 3 vs is widely used Compute-node failures are common and damaging How does behave under compute-node failures? Inflated, variable and unpredictable job running times. Sluggish failure detection. What are the design decisions responsible? Answer in this work. 4 Focus of This Work Task Tracker Reducer Mapper Data Node Job Tracker Name Node Task Tracker failures • Loss of intermediate data • Loss of running tasks • Data Nodes not failed Types of failures • Task Tracker process fail-stop failures • Task Tracker node fail-stop failures Single failures • Expose mechanisms and their interactions • Findings also apply to multiple failures 5 Declaring a Task Tracker Dead Heartbeats from Task Tracker to Job Tracker Usually every 3s <200 <400 <600 >600 Time 200s Restart running tasks Restart completed maps Job Tracker checks if heartbeats not sent for at least 600s Conservative design 6 Declaring a Task Tracker Dead <200 <400 <600 >600 Time Detection time ~ 800s <200 <400 <600 >600 Time Detection time ~ 600s Variable failure detection time 7 Declaring Map Output Lost • Uses notifications from running reducers to Job Tracker • A message that a specific map output is unavailable • Restart map M to re-compute its lost output • #notif(M) > (0.5* #running reducers) and #notif(M) > 3 Job Tracker X Time <200 <400 Conservative design <600 >600 Static parameters 8 Reducer Notifications Signals a specific map output is unavailable Job Tracker On connection error (R1) • re-attempt connection • send notification when nr of attempts % 10 = 0 • exponential wait between attempts wait = 10*(1.3)^(nr_failed_attempts) • usually 416s needed for 10 attempts R1 X M5 X R2 On read error (R2) • send notification immediately Conservative design Static parameters 9 Declaring a Reducer Faulty • Reducer faulty if (simplified version): #shuffles failed > 0.5* #shuffles attempted and #shuffles succeeded < 0.5* #shuffles necessary or reducer stalled for too long X Static parameters Ignores cause of failed shuffles. 10 Experiment: Methodology • 15-node, 4-rack testbed in the OpenCirrus* cluster • 14 compute nodes, 1 reserved for Job Tracker and Name Node • Sort job, 10GB input, 160 maps, 14 reducers, 200 runs/experiment • Job takes 220s in the absence of failures • Inject single Task Tracker process failure randomly between 0 and 220s * https://opencirrus.org/ the HP/Intel/Yahoo! Open Cloud Computing Research Testbed 11 Experiment: Results Large variability in job running times 12 Experiment: Results Group G2 Group G1 Group G4 Group G3 Group G5 Group G6 Group G7 Large variability in job running times 13 Group G1 – few reducers impacted M1 copied by all reducers before failure. After failure R1_1 cannot access M1. R1_1 needs to send 3 notifications ~ 1250s Task Tracker declared dead after 600-800s M2 M3 Job Tracker R1 M1 R2 X R3 Notif (M1) R1_1 Slow recovery when few reducers impacted 14 Group G2 – timing of failure 200s difference between G1 and G2. Job end 200s 170s G1 Time 600s G2 Time 170s 600s Timing of failure relative to Job Tracker checks impacts job running time 15 Group G3 – early notifications • G1 notifications sent after 416s • G3 early notifications => map outputs declared lost Causes: • Code-level race conditions • Timing of a reducer’s shuffle attempts Regular notification (416s) 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Early notification (<416s) Early notifications increase job running time variability 16 Group G4 & G5 – many reducers impacted G4 - Many reducers send notifications after 416s - Map output is declared lost before the Task Tracker is declared dead G5 - Same as G4 but early notifications are sent M2 M3 Job Tracker R1 R2 X M1 R3 Notif (M1,M2,M3, M4,M5) R1_1 Job running time under failure varies with nr of reducers impacted 17 Induced Reducer Death Reducer faulty if (simplified version): #shuffles failed ------------------------------ > 0.5 #shuffles attempted X and #shuffles succeeded ------------------------------ < 0.5 #shuffles necessary or stalled for too long • If failed Task Tracker is contacted among first Task Trackers => the reducer dies • If failed Task Tracker is attempted too many times => the reducer dies A failure can induce other failures in healthy reducers. CPU time and network bandwidth are unnecessarily wasted. 18 CDF 56 vs 14 Reducers Job running times are spread out even more Increased chance for induced reducer death or early notifications 19 CDF Simulating Node Failure Without RST packets all affected tasks wait for Task Tracker to be declared dead. 20 Lack of Adaptivity Recall: • Notification sent after 10 attempts Inefficiency: • A static, one size fits all solution cannot handle all situations Efficiency varies with number of reducers A way forward: • Use more detailed information about current job state 21 Conservative Design Recall: • Declare a Task Tracker dead after at least 600s • Send a notification after 10 attempts and 416 seconds Inefficiency: • Assumes most problems are transient • Sluggish response to permanent compute-node failure A way forward: • Additional information should be leveraged • Network state information • Historical information of compute-node behavior [OSDI ‘10] 22 Simplistic Failure Semantics • Lack of TCP connectivity = problem with tasks Inefficiency: • Cannot distinguish between multiple causes for lack of connectivity • Transient congestion • Compute-node failure A way forward: • Decouple failure recovery from overload recovery • Use AQM/ECN to provide extra congestion information • Allow direct communication between application and infrastructure 23 Thank you Company and product logos from company’s website. Conference logos from the conference websites. Links to images: http://t0.gstatic.com/images?q=tbn:ANd9GcTQRDXdzM6pqTpcOil-k2d37JdHnU4HKue8AKqtKCVL5LpLPV-2 http://www.scanex.ru/imgs/data-processing-sample1.jpg http://t3.gstatic.com/images?q=tbn:ANd9GcQSVkFAbm-scasUkz4lQ-XlPNkbDX9SVD-PXF4KlGwDBME4ugxc http://criticalmas.com/wp-content/uploads/2009/07/the-borg.jpg http://www.brightermindspublishing.com/wp-content/uploads/2010/02/advertising-billboard.jpg http://www.planetware.com/i/photo/logs-stacked-in-port-st-lucie-fla513.jpg 24 Group G3 – early notifications • G1 notifications sent after 416s • G3 early notifications => map outputs declared lost Causes: • Code-level race conditions • Timing of a reducer’s shuffle attempts M5 M6 R2 X 0 R2 M5 M6 X M5-1 M6-1 M5-2 M6-2 1 M6-1 M5-1 M6-2 0 1 M5-3 M6-3 M5-4 M6-4 2 3 4 M5-2 M6-3 M5-3 M6-4 M5-4 M6-5 2 3 4 Early notifications increase job running time variability 5 6 5 6 25 Task Tracker Failure-Related Mechanisms Declaring a Task Tracker dead Declaring a map output lost Declaring a reducer faulty 26