Reining in the Outliers in MapReduce Jobs using Mantri

advertisement
Reining in the Outliers in
MapReduce Jobs using
Mantri
Ganesh Ananthanarayanan†, Srikanth Kandula*, Albert
Greenberg*, Ion Stoica†,Yi Lu*, Bikas Saha*, Ed Harris*
†
UC Berkeley
* Microsoft
1
MapReduce Jobs

Basis of analytics in modern Internet services
◦ E.g., Dryad, Hadoop

Job  {Phase}  {Task}

Graph flow consists of pipelines as well as
strict blocks
2
Example Dryad Job Graph
Distr. File System
Distr. File System
EXTRACT
EXTRACT
Map.1
AGGREGATE_PAR
TITION
AGGREGATE_PAR
TITION
Reduce.1
FULL_AGGREGATE
FULL_AGGREGATE
Map.2
Reduce.2
PROCESS
COMBINE
Join
Phase
Pipeline
Blocked until
input is done
PROCESS
Distr. File System
3
Log Analysis from Production

Logs from production cluster with
thousands of machines, sampled over six
months

10,000+ jobs, 80PB of data, 4PB network
transfers
◦ Task-level details
◦ Production and experimental jobs
4
Outliers hurt!

Tasks that run longer than the rest in the phase
Median phase has 10% outliers, running for
>10x longer
 Slow down jobs by 35% at median


Operational Inefficiency
◦ Unpredictability in completion times affect SLAs
◦ Hurts development productivity
◦ Wastes compute-cycles
5
Why do outliers occur?
Read
Input
Input
Unavailable
Network
Congestio
n
Execute
Local
Contention
Workload
Imbalance
Mantri: A system that mitigates
outliers based on root-cause analysis
6
Mantri’s Outlier Mitigation

Avoid Recomputation

Network-aware Task Placement

Duplicate Outliers

Cognizant of Workload Imbalance
7
Recomputes: Illustration
(a) Barrier phases
(b) Cascading Recomputes
Actual
Actual
Inflation
Inflation
Ideal
Ideal
Normal task
Recompute task
8
What causes recomputes? [1]

Faulty machines
◦ Bad disks, non-persistent hardware quirks
Set of faulty
machines varies with
time, not constant
(4%)
9
What causes recomputes? [2]

Transient machine load
◦ Recomputes correlate with machine load
◦ Requests for data access dropped
10
Replicate costly outputs
MR: Recompute Probability of a machine
Task1
Task 2
MR2
Replicate
(TRep)
Task 3
MR3 TRecomp = ((MR
Recompute
* T3
3*(1-MR2))only
Task3 or+both Task3
as well as Task2
(MR3 * MR2) (T3+T2)
TRep < TRecomp
REPLICATE
11
Transient Failure Causes
Recomputes manifest in clutches
 Machine prone to cause recomputes till
the problem is fixed

◦ Load abates, critical process restart etc.

Clue: At least r recomputes within t time
window on a machine
12
Speculative Recomputes

Anticipatorily recompute tasks whose
outputs are unread
Task
Input Data
(Read Fail)
Speculative
Recompute
Speculative
Recompute
Unread
Data
13
Mantri’s Outlier Mitigation

Avoid Recomputation
◦ Preferential Replication + Speculative Recomp.

Network-aware Task Placement

Duplicate Outliers

Cognizant of Workload Imbalance
14
Reduce Tasks
Tasks access output of tasks from
previous phases
 Reduce phase (74% of total traffic)

Distr. File System
Local
Map
Network
Reduce
Outlier!
15
Variable Congestion
Reduce task
Map output
Rack
Smart placement smoothens hotspots
16
Traffic-based Allotment
Goal: Minimize phase completion time
For every rack:
◦ d : data
◦ u : available uplink bandwidth
◦ v : available downlink bandwidth
Solve for task allocation fractions, ai
17
Local Control is a good approx.
Goal: Minimize phase completion time
For every rack:
◦ d : data, D: data over all racks
u : available uplink
bandwidth
Link ◦utilizations
average
out in long term,
◦ v : available
downlink
bandwidth
are steady
on the
short term

Let rack i have ai fraction of tasks
◦ Time uploading, Tu = di (1 - ai) / ui
◦ Time downloading, Td = (D – di) ai / vi

Timei = max {Tu ,Td}
18
Mantri’s Outlier Mitigation

Avoid Recomputation
◦ Preferential Replication + Speculative Recomp.

Network-aware Task Placement
◦ Traffic on link proportional to bandwidth

Duplicate Outliers

Cognizant of Workload Imbalance
19
Contentions cause outliers

Tasks contend for local resources
◦ Processor, memory etc.

Duplicate tasks elsewhere in the cluster
◦ Current schemes duplicate towards end of the
phase (e.g., LATE [OSDI 2008])
 Duplicate
outlier or schedule pending task?
20
Resource-Aware Restart
trem
Running task
Potential restart
(tnew)
now
Save time and resources:
P(c tnew < (c + 1) trem)
time
Continuously observe and kill wasteful copies
21
Mantri’s Outlier Mitigation

Avoid Recomputation
◦ Preferential Replication + Speculative Recomp.

Network-aware Task Placement
◦ Traffic on link proportional to bandwidth

Duplicate Outliers
◦ Resource-Aware Restart

Cognizant of Workload Imbalance
22
Workload Imbalance

A quarter of the outlier tasks have more
data to process
◦ Unequal key partitions for reduce tasks

Ignoring these better than duplication

Schedule tasks in descending order of
data to process
◦ Time α (Data to Process)
◦ [Graham ‘69] At worse, 33% of optimal
23
Mantri’s Outlier Mitigation

Avoid Recomputation
◦ Preferential Replication + Speculative Recomp.

Predict to Task
act early
Network-aware
Placement
Reactive
Proactive ◦ Traffic on link proportional to bandwidth
Be resource-aware

Duplicate Outliers

Cognizant of Workload Imbalance
Act based Restart
on the cause
◦ Resource-Aware
◦ Schedule in descending order of size
24
Results

Deployed in production Bing clusters

Trace-driven simulations
◦ Mimic workflow, failures, data skew
◦ Compare with existing and idealized schemes
25
Jobs in the Wild
Jobs faster by 32% at median,
consuming lesser resources
Act Early: Duplicates issued when task 42% done
(77% for Dryad)
 Light: Issues fewer copies (.47X as many as Dryad)
 Accurate: 2.8x higher success rate of copies

26
Recomputation Avoidance
Eliminates most
recomputes with
minimal extra resources
(Replication +
Speculation) work
well in tandem
27
Network-Aware Placement
Bandwidth
approximations
Mantri wellapproximates
the ideal
28
Summary

From measurements in a production cluster,
◦ Outliers are a significant problem
◦ Are due to an interplay between storage,
network and map-reduce
Mantri, a cause-, resource-aware mitigation
 Deployment shows encouraging results


“Reining in the Outliers in MapReduce Clusters
using Mantri”, USENIX OSDI 2010
29
Download