Efficient Failure Resilience for Big

advertisement
Understanding the Effects and
Implications of Compute Node
Failures in
Florin Dinu
T. S. Eugene Ng
Computing in the Big Data Era
100PB
20PB
15PB
120PB
• Big Data – Challenging for previous systems
• Big Data Frameworks
– MapReduce @ Google
– Dryad @Microsoft
– Hadoop @ Yahoo & Facebook
2
Is Widely Used
and many more …..
Protein
Sequencing
Image
Processing
Web
Indexing
Machine
Learning
Advertising
Analytics
Log Storage
and Analysis
3
Building Around
SIGMOD 2010
4
Building On Top Of
Building on core Hadoop functionality
5
The Danger of Compute-Node Failures
“ In each cluster’s first year,
it’s typical that 1,000 individual machine failures will occur;
thousands of hard drive failures will occur”
Jeff Dean – Google I/O 2008
“ Average worker deaths per job: 5.0 ”
Jeff Dean – Keynote I – PACT 2006
Causes:
• large scale
• use of commodity components
6
The Danger of Compute-Node Failures
Amazon, SOSP 2009
In the cloud compute node failures
are the norm NOT the exception
7
Failures From Hadoop’s Point of View
Situations indistinguishable from compute node failures:
• Switch failures
• Longer-term dis-connectivity
• Unplanned reboots
• Maintenance work (upgrades)
• Quota limits
•
Challenging environments
• Spot markets (price driven availability)
• Volunteering systems
• Virtualized environments
Important to understand effect of compute-node
failures on Hadoop
8
The Problem
• Hadoop is widely used
• Compute node failures are common
to beresilient
failure
HadoopHadoop
needs toneeds
be failure
resilient in an efficient way
• Minimize impact on job running times
• Minimize resources needed
9
Contribution
• First in-depth analysis of the impact of
failures on Hadoop
– Uncover several inefficiencies
• Potential for future work
– Immediate practical relevance
– Basis for realistic modeling of Hadoop
10
Quick Hadoop
Background
11
Background – the Tasks
Give me work !
Master
M
MGR
TaskTracker
R
More work ?
M
M
R
R
JobTracker
NameNode
2 waves of R
2 waves of M
M
DataNode
Map task
R
Reducer task
12
Background – Data Flow
HDFS
M
M
M
Map Tasks
Shuffle
R
R
R
Reducer
Tasks
HDFS
13
Background – Speculative Execution
M
M
M
Ideal case:
Similar progress rates
0 <= Progress Score <= 1
Progress Rate = (Progress Score/time)
Ex: 0.05/sec
14
Background – Speculative Execution (SE)
!
M
M
Goal of SE:
Detect underperforming nodes
Duplicate the computation
M
Reality:
Varying progress rates
Reasons for underperforming tasks
Node overload, network congestion, etc.
Underperforming tasks (outliers) in Hadoop:
> 1 STD slower than mean progress rate
15
How does Hadoop
detect failures?
16
Failures of the Distributed Processes
TaskTracker
Master
M
MGR
R
Heartbeats
DataNode
Timeouts, Heartbeats & Periodic Checks
17
Timeouts, Heartbeats & Periodic Checks
AHA !
It failed
Time
Declare failure after a
number of checks
Periodically check for
changes
Failure interrupts
heartbeat stream
Conservative approach – last line of defense
18
Failures of the Individual Tasks (Maps)
M
M
M
1 2 3
R
R
MGR
R
Give me
data!
Δt
R
Infer map failures from notifications
Conservative – not react to temporary failures
Δt
R
19
Failures of the Individual Tasks (Reducers)
M
R
M
R
• R complains too much?
(failed/ succ. attempts)
• R stalled for too long?
(no new succ. attempts)
MGR
M
Give me
data!
M does not answer !!
M
Give me
data!
R
Notifications also help infer reducer failures
20
Do these mechanisms
work well?
21
Methodology
• Focus on failures of distributed
components
(TaskTracker and DataNode)
• Inject these failures separately
TaskTracker
M
R
• Single failures
– Enough to catch many shortcomings
– Identified mechanisms responsible
– Relevant to multiple failures too
DataNode
22
Mechanisms Under Task Tracker Failure?
• OpenCirrus
• Sort 10GB
• 15 nodes
• 14 reducers
220s running time without failures
Findings also relevant to larger jobs
• Inject fail at
random time
LARGE, VARIABLE, UNPREDICTABLE job running times
Poor performance under failure
23
Clustering Results Based on Cause
Few reducers impacted.
Notification mechanism
ineffective
Timeouts fire.
Failure has no impact
Not due to notifications
70% cases – notification mechanism ineffective
24
Clustering Results Based on Cause
More reducers impacted
Notification mechanism
detects failure
Timeouts do not fire.
Notification mechanism detects failure in:
• Few cases
• Specific moment in the job
25
Side Effects: Induced Reducer Death
Failures propagate to healthy tasks
Negative Effects:
• Time and resource waste for re-execution
• Job failure - a small number of runs fail completely
M
• R complains too much?
(failed/ total attempts)
e.g. 3 out of 3 failed
Give me
data!
MGR
R
Unlucky reducers die early
26
Side Effects: Induced Reducer Death
• R stalled for too long?
(no new succ. attempts)
M
Give me
data!
MGR
M does not answer !!
R
All reducers may eventually die
Fundamental problem:
• Inferring task failures from connection failures
• Connection failures have many possible causes
• Hadoop has no way to distinguish the cause (src? dst?)
27
CDF
More Reducers: 4/Node = 56 Total
Job running time spread out even more
More reducers = more chances for explained effects
28
Effect of DataNode Failures
TaskTracker
M
R
DataNode
29
Timeouts When Writing Data
M R
Write Timeout (WTO)
30
Timeouts When Writing Data
M R
Connect Timeout (CTO)
31
Effect on Speculative Execution
Outliers in Hadoop:
>1 STD slower than mean progress rate
High
PR
Very high PR
AVG
AVG
AVG – 1*STD
Low
PR
AVG – 1*STD
Outliers
32
Delayed Speculative Execution
Avg(PR)- STD(PR)
9
11
9
11
M
M
M
M
50s
Waiting for
mappers
100s
Map outputs
read
9
11
150s
Reducer
write output
33
Delayed Speculative Execution
!!
9
WTO
Finally low
enough
Very low
11
WTO
9
WTO
9
11
WTO
M
11
11
WTO
M
200s
Failure occurs
Reducers timeout
R9 speculatively exec
> 200s
New R9 skews stats
400s
R11 finally speculatively
exec.
34
Delayed Speculative Execution
• Hadoop’s assumptions about
progress rates invalidated
Very low
• Stats skewed by very fast
speculated task
9
• Significant impact on job running time
35
52 reducers – 1 Wave
Reducers stuck in WTO
Delayed speculative execution
CTO after WTO
Reconnect to failed DataNode
36
Delayed SE – A General Problem
• Failures and timeouts are not the only cause
• To suffer from delayed SE :
• Slow tasks that benefit from SE
• I showed the ones stuck in a WTO
• Other: slow or heterogeneous nodes,
slow transfers (heterogeneous networks)
• Fast advancing tasks
• I showed varying data input availability
• Other: varying task input size
varying network speed
Statistical SE algorithms need to be carefully used
37
Conclusion - Inefficiencies Under Failures
• Task Tracker failures
– Large, variable and unpredictable job running times
– Variable efficiency depending on reducer number
– Failures propagate to healthy tasks
– Success of TCP connections not enough
• Data Node failures
– Delayed speculative execution
– No sharing of potential failure information
(details in paper)
38
Ways Forward
• Provide dynamic info about infrastructure to
applications (at least in the private DCs)
• Make speculative execution cause aware
– Why is a task slow at runtime?
– Move beyond statistical SE algorithms
– Estimate PR of tasks (use envir, data characteristics)
• Share some information between tasks
– In Hadoop tasks rediscover failures individually
– Lots of work on SE decisions (when, where to SE)
– This decisions can be invalidate by such runtime
inefficiencies
39
Thank you
40
Backup slides
41
Experiment: Results
Group G2
Group G1
Group G4
Group G3
Group G5
Group G6
Group G7
Large variability in job running times
42
Group G1 – few reducers impacted
M1 copied by all reducers before failure.
After failure R1_1 cannot access M1.
R1_1 needs to send 3 notifications ~ 1250s
Task Tracker declared dead after 600-800s
M2
M3
Job
Tracker
R1
M1
R2
X
R3
Notif
(M1)
R1_1
Slow recovery when few reducers impacted
43
Group G2 – timing of failure
200s difference between G1 and G2.
Job
end
200s
170s
G1
Time
600s
G2
Time
170s
600s
Timing of failure relative to Job Tracker checks impacts job running time
44
Group G3 – early notifications
• G1 notifications sent after 416s
• G3 early notifications => map outputs declared lost
Causes:
• Code-level race conditions
• Timing of a reducer’s shuffle attempts
M5
M6
R2
X
0
R2
M5
M6
X
M5-1
M6-1
M5-2
M6-2
1
M6-1 M5-1
M6-2
0
1
M5-3
M6-3
M5-4
M6-4
2
3
4
M5-2
M6-3
M5-3
M6-4
M5-4
M6-5
2
3
4
Early notifications increase job running time variability
5
6
5
6
45
Group G4 & G5 – many reducers impacted
G4 - Many reducers send notifications after 416s
- Map output is declared lost before the Task
Tracker is declared dead
G5 - Same as G4 but early notifications are sent
M2
M3
Job
Tracker
R1
R2
X
M1
R3
Notif
(M1,M2,M3,
M4,M5)
R1_1
Job running time under failure varies with nr of reducers impacted
46
Task Tracker Failures
Gew reducers impacted.
Not enough notifications.
Timeouts fire.
Many reducers impacted.
Enough notifications sent
Timeouts do not fire
LARGE, VARIABLE, UNPREDICTABLE job running times
Efficiency varies with number of affected reducers
47
CDF
Node Failures: No RST Packets
No RST -> No Notifications -> Timeouts always fire
48
Not Sharing Failure Information
Different SE algorithm
(OSDI 08)
Tasks SE even before
failure.
Delayed SE not the
cause.
Both initial and SE task connect to failed node
No sharing of potential failure information
49
Delayed Speculative Execution
t Outlier:
avg(PR(all)) – std(PR(all)) > PR(t)
M
M
11
9
limit
R9
R11
WTO
WTO
11
9
Stats skewed by very fast speculative tasks.
Hadoop’s assumptions about prog. rates invalidated
50
Delayed Speculative Execution
Timeline:
• ~50s reducers wait for map outputs
• ~100s reducers get map outputs
• ~200s failure => reducers timeout
M
M
11
9
WTO
• ~200s R9 speculatively executed
huge progress rate
statistics skewed
• ~400s R11 finally speculatively executed
WTO
11
9
Stats skewed by very fast speculative tasks.
Hadoop’s assumptions about prog. rates invalidated
51
Download