The Case for Tiny Tasks in Compute Clusters

advertisement
The Case for Tiny Tasks in
Compute Clusters
Kay Ousterhout*, Aurojit Panda*,
Joshua Rosen*, Shivaram
Venkataraman*, Reynold Xin*,
Sylvia Ratnasamy*, Scott Shenker*+,
* UC Berkeley,
Ion Stoica+* ICSI
Setting
Tas
k
Tas
k
Map
Reduce/Spark/Dr
yad
…
Tas
k
Job
…
Tas
k
0
1
2
3
tasks
Slots
Slots
Use smaller
tasks!
Today’s
Tiny Tasks
Time
0
1
2
3
Time
Why How Wher
?
?
e?
Why How Wher
?
?
e?
Slots
Problem: Skew and Stragglers
0
1
2
3
Contended
machine?
Data skew?
Time
Benefit: Handling of Skew and
Stragglers
0
1
2
3
Tiny Tasks
0
1
2
3
Slots
Slots
Today’s tasks
Timeas 5.2x reductionTim
As much
in ejob
completion time!
Problem: Batch and Interactive
Sharing
Clusters forced to trade off utilization and
responsiveness!
Slots
0
1
2
3
Low priority
batch task
High priority
interactive job
Time
Benefit: Improved Sharing
Today’s tasks
Tiny Tasks
0
1
2
3
Slots
Slots
0
1
2
3
Time
Time
High-priority tasks not subject to
long wait times!
Benefits: Recap
0
1
2
3
0
1
2
3
Slots
Slots
(1)
Straggler
mitigation
Time
Quincy (SOSP ‘09
Amoeba (SOCC ’12)
…
0
1
2
3
Slots
0
1
2
3
Slots
(2)
Improved
sharing
Time
Time
Mantri (OSDI ‘10)
Scarlett (EuroSys ’11
SkewTune (SIGMOD ‘
Dolly (NSDI ’13)
…
Time
Why How Wher
?
?
e?
Sched
ule
task
Scheduling
requirements:
(millions per
High
second)
Throughput
Low (millisecon
Latency
ds)
Distributed
Scheduling
(e.g., Sparrow
Sched
ule
task
Launc
h task
Use existing thread
pool to launch
tasks
Sched
ule
task
Launc
h task
Use existing thread
pool to launch
tasks
+
Cache task
binaries
Task launch = RPC
time (<1ms)
Sched
ule
task
Launch
task
Read
input
data
Smallest
efficient file
block
size:
8M
B
Distribute
Metadata (à la Flat
Datacenter Storage, OSDI
Sched
ule
task
Launch
task
…
…
Read
input
data
Execute
task +
read
data for
Tons of tiny
transfers!
Framework(enables optimizations, e.g.,
Sched
ule
task
Launch
task
How low can
you
go?
8MB disk block
Read
input
data
Execute
task +
read data
for next
100’s of
millisecon
Why How Wher
?
?
e?
Original
Job
…
Map Task 1
K2:
K2:
K3:

K5:
K5:

…
Reduce Task 1
Reduc
e
Tasks
N
K1: K1: K1:



K2: K2:


Kn: Kn: Kn:



…
K1:
K1:
K1:

Map
Task
s
2
3
4
…
…
Map Task 2
Tiny Tasks
1
Job
Original Reduce Phase
K1:
K1:
K1:
K1:

K1:
K1:
K1:
K1:

K1:
K1:
K1:
K1:

K1:
K1:
K1:
K1:

Reduce Task 1
Tiny Tasks
Splitting Large Tasks
• Aggregation trees
– Works for functions that are associative and
commutative
• Framework-managed temporary state
store
• Ultimately, need to allow a small number
of large tasks
Slots
Slots
0
1
2
3
0
1
2
3
Time
Time
0
1
2
3
Slots
0
1
2
3
Slots
mitigate
stragglers
+
Improve
sharing
Time
Time
Distribu Pipelined
Launch
Distribu
ted file
task in
task
ted
metada execution
existing
schedul
ta
thread
ing
Questions?pool
Find me or Shivaram:
Backup Slides
Benefit of Eliminating
Stragglers
0
1
2
3
Slots
Time
0
1
2
3
Cumulative Fraction
Slots
Based on Facebook Trace
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Original Job Size
1 - 9 Tasks
10 - 99 Tasks
100+ Tasks
1
2
3
4
5
6
7
8
9
10
Improvement in Job Completion Time
Time
th
95
5.2x at the
percentile!
Why Not Preemption?
• Preemption only handles sharing (not
stragglers)
• Task migration is time consuming
• Tiny tasks improve fault tolerance
Dremel/Drill/Impala
• Similar goals and challenges (supporting
short tasks)
• Dremel statically assigns tablets to
machines; rebalances if query dispatcher
notices that a machine is processing a
tablet slowly  standard straggler
mitigation
• Most jobs expected to be interactive (no
Scheduling Throughput
10,000 Machines
16 cores/machine
100 millisecond tasks
Over 1 million task
scheduling decisions per
Sparrow: Technique
Place m tasks on the least loaded of dm
slaves
Job
m=2
tasks
4
Schedu
probes
ler
(d = 2)
Schedu
ler
Schedu
ler
Schedu
ler
Slave
Slave
Slave
Slave
Slave
Slave
More at tinyurl.com/sparrow-scheduler
Sparrow: Performance on TPCH Workload
Within 12% of offline optimal; median
delay of 8ms
More atqueuing
tinyurl.com/sparrow-scheduler
29
Download