slides - Rice University

advertisement
‘s Overload Tolerant
Design Exacerbates Failure
Detection and Recovery
Florin Dinu
T. S. Eugene Ng
Rice University
is Widely Used
*
Protein
Sequencing
Image
Processing
Web
Indexing
Machine
Learning
Advertising
Analytics
Recent
research
work
Log Storage
and Analysis
2010
* Source:
http://wiki.apache.org/hadoop/PoweredBy
2
Compute-Node Failures Are Common
“ ... typical that 1,000 individual machine failures will occur; thousands
of hard drive failures will occur; one power distribution unit will fail,
bringing down 500 to 1,000 machines for about 6 hours”
Jeff Dean – Google I/O 2008
“ 5.0 average worker deaths per job”
Jeff Dean – Keynote I – PACT 2006
Revenue
Reputation
User experience
3
vs
is widely used
Compute-node failures are
common and damaging
How does
behave under compute-node failures?
Inflated, variable and unpredictable job running times.
Sluggish failure detection.
What are the design decisions responsible?
Answer in this work.
4
Focus of This Work
Task Tracker
Reducer
Mapper
Data Node
Job
Tracker
Name
Node
Task Tracker failures
• Loss of intermediate data
• Loss of running tasks
• Data Nodes not failed
Types of failures
• Task Tracker process fail-stop failures
• Task Tracker node fail-stop failures
Single failures
• Expose mechanisms and their
interactions
• Findings also apply to multiple
failures
5
Declaring a Task Tracker Dead
Heartbeats from Task
Tracker to Job Tracker
Usually every 3s
<200
<400
<600
>600
Time
200s
Restart running tasks
Restart completed maps
Job Tracker checks if
heartbeats not sent
for at least 600s
Conservative design
6
Declaring a Task Tracker Dead
<200
<400
<600
>600
Time
Detection time ~ 800s
<200
<400
<600
>600
Time
Detection time ~ 600s
Variable failure detection time
7
Declaring Map Output Lost
• Uses notifications from running reducers to Job Tracker
• A message that a specific map output is unavailable
• Restart map M to re-compute its lost output
• #notif(M) > (0.5* #running reducers)
and
#notif(M) > 3
Job
Tracker
X
Time
<200
<400
Conservative design
<600
>600
Static parameters
8
Reducer Notifications
Signals a specific map output is unavailable
Job
Tracker
On connection error (R1)
• re-attempt connection
• send notification when
nr of attempts % 10 = 0
• exponential wait between attempts
wait = 10*(1.3)^(nr_failed_attempts)
• usually 416s needed for 10 attempts
R1
X
M5
X
R2
On read error (R2)
• send notification immediately
Conservative design
Static parameters
9
Declaring a Reducer Faulty
• Reducer faulty if (simplified version):
#shuffles failed > 0.5* #shuffles attempted
and
#shuffles succeeded < 0.5* #shuffles necessary
or
reducer stalled for too long
X
Static parameters
Ignores cause of failed shuffles.
10
Experiment: Methodology
• 15-node, 4-rack testbed in the OpenCirrus* cluster
• 14 compute nodes, 1 reserved for Job Tracker and Name Node
• Sort job, 10GB input, 160 maps, 14 reducers, 200 runs/experiment
• Job takes 220s in the absence of failures
• Inject single Task Tracker process failure randomly between 0 and 220s
* https://opencirrus.org/ the HP/Intel/Yahoo! Open Cloud Computing Research Testbed
11
Experiment: Results
Large variability in job running times
12
Experiment: Results
Group G2
Group G1
Group G4
Group G3
Group G5
Group G6
Group G7
Large variability in job running times
13
Group G1 – few reducers impacted
M1 copied by all reducers before failure.
After failure R1_1 cannot access M1.
R1_1 needs to send 3 notifications ~ 1250s
Task Tracker declared dead after 600-800s
M2
M3
Job
Tracker
R1
M1
R2
X
R3
Notif
(M1)
R1_1
Slow recovery when few reducers impacted
14
Group G2 – timing of failure
200s difference between G1 and G2.
Job
end
200s
170s
G1
Time
600s
G2
Time
170s
600s
Timing of failure relative to Job Tracker checks impacts job running time
15
Group G3 – early notifications
• G1 notifications sent after 416s
• G3 early notifications => map outputs declared lost
Causes:
• Code-level race conditions
• Timing of a reducer’s shuffle attempts
Regular notification (416s)
0
1
2
3
4
5
6
0
1
2
3
4
5
6
Early notification (<416s)
Early notifications increase job running time variability
16
Group G4 & G5 – many reducers impacted
G4 - Many reducers send notifications after 416s
- Map output is declared lost before the Task
Tracker is declared dead
G5 - Same as G4 but early notifications are sent
M2
M3
Job
Tracker
R1
R2
X
M1
R3
Notif
(M1,M2,M3,
M4,M5)
R1_1
Job running time under failure varies with nr of reducers impacted
17
Induced Reducer Death
Reducer faulty if (simplified version):
#shuffles failed
------------------------------ > 0.5
#shuffles attempted
X
and
#shuffles succeeded
------------------------------ < 0.5
#shuffles necessary
or
stalled for too long
• If failed Task Tracker is contacted among first Task Trackers => the reducer dies
• If failed Task Tracker is attempted too many times => the reducer dies
A failure can induce other failures in healthy reducers.
CPU time and network bandwidth are unnecessarily wasted.
18
CDF
56 vs 14 Reducers
Job running times are spread out even more
Increased chance for induced reducer death or early notifications
19
CDF
Simulating Node Failure
Without RST packets all affected tasks wait for Task Tracker to be
declared dead.
20
Lack of Adaptivity
Recall:
• Notification sent after 10 attempts
Inefficiency:
• A static, one size fits all solution cannot handle all situations
Efficiency varies with number of reducers
A way forward:
• Use more detailed information about current job state
21
Conservative Design
Recall:
• Declare a Task Tracker dead after at least 600s
• Send a notification after 10 attempts and 416 seconds
Inefficiency:
• Assumes most problems are transient
• Sluggish response to permanent compute-node failure
A way forward:
• Additional information should be leveraged
• Network state information
• Historical information of compute-node behavior [OSDI ‘10]
22
Simplistic Failure Semantics
• Lack of TCP connectivity = problem with tasks
Inefficiency:
• Cannot distinguish between multiple causes for lack of connectivity
• Transient congestion
• Compute-node failure
A way forward:
• Decouple failure recovery from overload recovery
• Use AQM/ECN to provide extra congestion information
• Allow direct communication between application and
infrastructure
23
Thank you
Company and product logos from company’s website.
Conference logos from the conference websites.
Links to images:
http://t0.gstatic.com/images?q=tbn:ANd9GcTQRDXdzM6pqTpcOil-k2d37JdHnU4HKue8AKqtKCVL5LpLPV-2
http://www.scanex.ru/imgs/data-processing-sample1.jpg
http://t3.gstatic.com/images?q=tbn:ANd9GcQSVkFAbm-scasUkz4lQ-XlPNkbDX9SVD-PXF4KlGwDBME4ugxc
http://criticalmas.com/wp-content/uploads/2009/07/the-borg.jpg
http://www.brightermindspublishing.com/wp-content/uploads/2010/02/advertising-billboard.jpg
http://www.planetware.com/i/photo/logs-stacked-in-port-st-lucie-fla513.jpg
24
Group G3 – early notifications
• G1 notifications sent after 416s
• G3 early notifications => map outputs declared lost
Causes:
• Code-level race conditions
• Timing of a reducer’s shuffle attempts
M5
M6
R2
X
0
R2
M5
M6
X
M5-1
M6-1
M5-2
M6-2
1
M6-1 M5-1
M6-2
0
1
M5-3
M6-3
M5-4
M6-4
2
3
4
M5-2
M6-3
M5-3
M6-4
M5-4
M6-5
2
3
4
Early notifications increase job running time variability
5
6
5
6
25
Task Tracker Failure-Related Mechanisms
Declaring a
Task Tracker dead
Declaring a map
output lost
Declaring a
reducer faulty
26
Download