Combating Outliers in map-reduce Srikanth Kandula Ganesh Ananthanarayanan , Albert Greenberg,

advertisement
Combating Outliers in map-reduce
Srikanth Kandula
Ganesh Ananthanarayanan, Albert Greenberg,
Ion Stoica, Yi Lu, Bikas Saha, Ed Harris


1
log(size of cluster)
105
104
mapreduce
103
102
HPC,
|| databases
e.g., the Internet,
click logs,
bio/genomic data
101
1
log(size of dataset)
GB
109
TB
1012
PB
1015
EB
1018
map-reduce
• decouples operations on data (user-code) from mechanisms to scale
• is widely used
• Cosmos (based on SVC’s Dryad) + Scope @ Bing
• MapReduce @ Google
• Hadoop inside Yahoo! and on Amazon’s Cloud (AWS)
2
An Example
What the user says:
Goal Find frequent search
queries to Bing
How it Works:
file
block 0
file
block 1
file
block 2
file
block 3
SELECT Query, COUNT(*) AS Freq
FROM
QueryTable
HAVING Freq > X
job manager
assign work,
get progress
task
task
Local
write
task
output
block 0
task
output
block 1
task
Read
Map
Reduce
3
We find that:
Outliers slow down map-reduce jobs
File System
Map.Read 22K
Map.Move 15K
Map 13K
Barrier
Reduce 51K
Goals
• speeding up jobs improves productivity
• predictability supports SLAs
• … while using resources efficiently
4
This talk…
Identify fundamental causes of outliers
– concurrency leads to contention for resources
– heterogeneity (e.g., disk loss rate)
– map-reduce artifacts
Current schemes duplicate long-running tasks
Mantri: A cause-, resource-aware mitigation scheme
• takes distinct actions based on cause
• considers resource cost of actions
Results from a production deployment
5
Why bother? Frequency of Outliers
stragglers = Tasks that take  1.5 times the median task in that phase
recomputes = Tasks that are re-run because their output was lost
straggler
straggler
Outlier
•The median phase has 10% stragglers and no recomputes
•10% of the stragglers take >10X longer
6
Why bother? Cost of outliers
(what-if analysis, replays logs in a trace driven simulator)
At median, jobs slowed down by 35%
due to outliers
7
Why outliers?
𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, …
Problem: Due to unavailable input, tasks have to be recomputed
map
reduce
sort
Delay due to a
Delay
due to a recompute
recompute
readily cascades
8
Why outliers?
𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, …
Problem: Due to unavailable input, tasks have to be recomputed
(simple) Idea: Replicate intermediate data, use copy if original is unavailable
Challenge(s) What data to replicate? Where? What if we still miss data?
Insights:
• 50% of the recomputes are on 5% of machines
9
Why outliers?
𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, …
Problem: Due to unavailable input, tasks have to be recomputed
(simple) Idea: Replicate intermediate data, use copy if original is unavailable
Challenge(s) What data to replicate? Where? What if we still miss data?
Insights:
• 50% of the recomputes are on 5% of machines
• cost to recompute vs. cost to replicate
M1
M2
tredo = r2(t2 +t1redo)
t = predicted runtime of task
r = predicted probability of recompute at machine
trep = cost to copy data over within rack
Mantri preferentially acts on the more costly recomputes
10
Why outliers?
𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘, …
Problem: Tasks reading input over the network
experience variable congestion
Reduce task
Map output
uneven placement is typical in production
• reduce tasks are placed at first available slot
11
Why outliers?
𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘, …
Problem: Tasks reading input over the network
experience variable congestion
Idea: Avoid hot-spots, keep traffic on a link proportional to bandwidth
Challenge(s) Global co-ordination across jobs? Where is the congestion?
Insights:
• local control is a good approximation (each job balances its traffic)
• link utilizations average out on the long term and are steady on the short term
If rack i has di map output and ui, vi bandwidths available on uplink and downlink,
Place ai fraction of reduces such that:
𝑎𝑖 = arg 𝑚𝑖𝑛 𝑚𝑎𝑥 𝑇𝑖
𝑢𝑝
, 𝑇𝑖 𝑑𝑜𝑤𝑛
12
Why outliers?
𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘, 𝑚𝑎𝑐ℎ𝑖𝑛𝑒, …
Persistently slow machines rarely cause outliers
Cluster Software (Autopilot) quarantines persistently faulty machines
13
Why outliers?
𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘, 𝑚𝑎𝑐ℎ𝑖𝑛𝑒, 𝑑𝑎𝑡𝑎𝑇𝑜𝑃𝑟𝑜𝑐𝑒𝑠𝑠, …
Problem:
About 25% of outliers occur due to more dataToProcess
Solution:
Ignoring these is better than the state-of-the-art! (duplicating)
In an ideal world, we could divide work evenly…
We schedule tasks in descending order of dataToProcess
Theorem [due to Graham, 1969]
Doing so is no more than 33% worse than the optimal
14
Why outliers?
𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘, 𝑚𝑎𝑐ℎ𝑖𝑛𝑒, 𝑑𝑎𝑡𝑎𝑇𝑜𝑃𝑟𝑜𝑐𝑒𝑠𝑠, …
Problem: 25% outliers remain, likely due to contention@machine
Idea:
Restart tasks elsewhere in the cluster
Challenge(s) The earlier the better, but to restart outlier or start a pending task?
trem
Running task
Save time and resources iff
𝑐
𝑃 𝑡𝑛𝑒𝑤 <
𝑡
𝑐 + 1 𝑟𝑒𝑚
(a)
(b)
(c)
Potential restart
(tnew)
now
time
If predicted time is much better, kill original, restart elsewhere
Else, if other tasks are pending, duplicate iff save both time and resource
Else, (no pending work) duplicate iff expected savings are high
Continuously, observe and kill wasteful copies
15
Summary
𝑟𝑢𝑛𝑡𝑖𝑚𝑒 = 𝑓 𝑖𝑛𝑝𝑢𝑡, 𝑛𝑒𝑡𝑤𝑜𝑟𝑘, 𝑚𝑎𝑐ℎ𝑖𝑛𝑒, 𝑑𝑎𝑡𝑎𝑇𝑜𝑃𝑟𝑜𝑐𝑒𝑠𝑠, …
(a)
a)
b)
c)
d)
e)
(b)
(c)
(d)
(e)
preferentially replicate costly-to-recompute tasks
each job locally avoids network hot-spots
quarantine persistently faulty machines
schedule in descending order of data size
restart or duplicate tasks, cognoscent of resource cost. Prune.
Theme: Cause-, Resource- aware action
Explicit attempt to decouple solutions, partial success
16
Results
Deployed in production cosmos clusters
• Prototype Jan’10 baking on pre-prod. clusters  release May’10
Trace driven simulations
• thousands of jobs
• mimic workflow, task runtime, data skew, failure prob.
• compare with existing schemes and idealized oracles
17
In production, restarts…
improve on native cosmos by 25%
while using fewer resources
18
Comparing jobs in the wild
CDF % cluster resources
CDF % cluster resources
340 jobs that each repeated at least five times during
May 25-28 (release) vs. Apr 1-30 (pre-release)
19
CDF % cluster resources
CDF % cluster resources
In trace-replay simulations, restarts…
are much better dealt with in a cause-,
resource- aware manner
20
CDF % cluster resources
Protecting against recomputes
21
Outliers in map-reduce clusters
• are a significant problem
• happen due to many causes
– interplay between storage, network and map-reduce
• cause-, resource- aware mitigation improves on
prior art
22
Back-up
23
Network-aware Placement
24
Download