Reining Outliers

advertisement
Lecture 14:Combating Outliers in
MapReduce Clusters
Xiaowei Yang
• References:
– Reining in the Outliers in Map-Reduce Clusters
using Mantri by Ganesh Ananthanarayanan,
Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi
Lu, Bikas Saha, Edward Harris
– http://research.microsoft.com/enus/UM/people/srikanth/data/Combating%20Outlier
s%20in%20Map-Reduce.web.pptx
log(size of cluster)
105
104
mapreduce
103
102
HPC,
|| databases
e.g., the Internet,
click logs,
bio/genomic data
101
1
log(size of dataset)
GB
109
TB
1012
PB
1015
EB
1018
MapReduce
• Decouples customized data operations from mechanisms to scale
• Is widely used
• Cosmos (based on SVC’s Dryad) + Scope @ Bing
• MapReduce @ Google
• Hadoop inside Yahoo! and on Amazon’s Cloud (AWS)
3
An Example
Goal Find frequent search
queries to Bing
What the user says:
SELECT Query, COUNT(*) AS Freq
FROM
QueryTable
HAVING Freq > X
How it Works:
job manager
file
block 0
file
block 1
file
block 2
file
block 3
assign work,
get progress
task
task
Local
write
task
output
block 0
task
output
block 1
task
Read
Map
Reduce
4
We find that:
Outliers slow down map-reduce jobs
File
System
Map.Read 22K
Map.Move
15K
Map 13K
Barrier
Reduce 51K
Goals
• Speeding up jobs improves productivity
• Predictability supports SLAs
• … while using resources efficiently
5
What is an outlier
• A phase (map or reduce) has n tasks and s slots
(available compute resources)
• Every task takes T seconds to run
• ti = f (datasize, code, machine, network)
• Ideally run time = ceiling (n/s) * T
• A naïve scheduler
n i
t
s
• Goal is to be closer to
 max t i
t
n i
s
From a phase to a job
• A job may have many phases
• An outlier in an early phase has a cumulative
effect
• Data loss may cause multi-phase recompute 
outliers
Why outliers?
Problem: Due to unavailable input, tasks have to be recomputed
map
reduce
sort
Delay due to a
recompute
Delay due to a recompute
readily cascades
8
Previous work
• The original MapReduce paper observed the
problem
• But didn’t deal with it in depth
• Solution was to duplicate the slow tasks
• Drawbacks
– Some may be unnecessary
– Use extra resources
– Placement may be the problem
Quantifying the Outlier Problem
• Approach:
– Understanding the problem first before proposing
solutions
– Understanding often leads to solutions
1. Prevalence of outliers
2. Causes of outliers
3. Impact of outliers
Why bother? Frequency of Outliers
stragglers = Tasks that take  1.5 times the median task in that phase
recomputes = Tasks that are re-run because their output was lost
straggler
straggler
Outlier
•50% phases have 10% stragglers and no recomputes
•10% of the stragglers take >10X longer
11
Causes of outliners: data skew
Duplicating
will not help!
• In 40% of the phases, all the tasks with high
runtimes (>1.5x the median task) correspond to
large amount of data over the network
Non-outliers can be improved as well
• 20% of them are 55% longer than median
Problem: Tasks reading input over the network
experience variable congestion
Reduce task
Map output
uneven placement is typical in production
• reduce tasks are placed at first available slot
14
Causes of outliers: cross rack traffic
50% phases takes 62% longer
to finish than ideal placement
•
•
•
•
•
70% of cross track traffic is reduce traffic
Tasks in a spot with slow network run slower
Tasks compete network among themselves
Reduce reads from every map
Reduce is put into any spare slot
Cause of outliers: bad and busy
machines
• 50% of recomputes
happen on 5% of the
machines
• Recompute increases
resource usage
• Outliers cluster by time
– Resource contention might be the cause
• Recomputes cluster by machines
– Data loss may cause multiple recomputes
Why bother? Cost of outliers
(what-if analysis, replays logs in a trace driven simulator)
At median, jobs slowed down by 35%
due to outliers
18
Mantri Design
High-level idea
• Cause aware, and resource aware
• Runtime = f (input, network, machine,
datatoProcess, …)
• Fix each problem with different strategies
Resource-aware restarts
• Duplicate or kill long outliers
When to restart
• Every ∆ seconds, tasks report progress
• Estimate trem and tnew
• γ= 3
• Schedule a duplicate if the total running time is
smaller
• P(c trem > (c+1) tnew) > δ
• When there are available slots, restart if
reduction time is more than restart time
– E(trem – tnew ) > ρ ∆
Network Aware Placement
• Compute the rack location for each task
• Find the placement that minimizes the
maximum data transfer time
If rack i has di map output and ui, vi bandwidths available on uplink and downlink,
Place ai fraction of reduces such that:
Avoid recomputation
• Replicating the output
– Restart a task if data are lost
– Replicate the most costly job
Data-aware task ordering
• Outliers due to large input
• Schedule tasks in descending order of
dataToProcess
• At most 33% worse than optimal scheduling
Estimation of trem and tnew
• d: input data size
• dread: the amount read
Estimation of tnew
• processRate: estimated of all tasks in the phase
• locationFactor: machine performance
• d: input size
Results
Deployed in production cosmos clusters
• Prototype Jan’10 baking on pre-prod. clusters  release May’10
Trace driven simulations
• thousands of jobs
• mimic workflow, task runtime, data skew, failure prob.
• compare with existing schemes and idealized oracles
29
Evaluation Methodology
• Mantri run on production clusters
• Baseline is results from Dryad
• Use trace-driven simulations to compare with
other systems
Comparing jobs in the wild
340 jobs that each repeated at least five times during
May 25-28 (release) vs. Apr 1-30 (pre-release)
• w/ and w/o Mantri for one month of jobs in
Bing production cluster
31
In production, restarts…
improve on native cosmos by 25%
while using fewer resources
32
CDF % cluster resources
CDF % cluster resources
In trace-replay simulations, restarts…
are much better dealt with in a cause-,
resource- aware manner.
Each job repeated thrice
33
Network-aware Placement
• Equal: all links have the same bandwidth
• Start: same as the start
• Ideal: available bandwidth at run time
34
CDF % cluster resources
Protecting against recomputes
35
Summary
a) Reduce recomputation: preferentially replicate
costly-to-recompute tasks
b) Poor network: each job locally avoids network
hot-spots
c) Bad machines: quarantine persistently faulty
machines
d) DataToProcess: schedule in descending order of
data size
e) Others: restart or duplicate tasks, cognizant of
resource cost. Prune
Conclusion
• Outliers in map-reduce clusters are a significant
problem
• happen due to many causes
– interplay between storage, network and map-reduce
• cause-, resource- aware mitigation improves on
prior art
37
Download