Tetris Multi-Resource Packing for Cluster Schedulers

advertisement
Multi-Resource Packing for Cluster Schedulers
Robert Grandl, Ganesh Ananthanarayanan,
Srikanth Kandula, Sriram Rao, Aditya Akella
Performance of cluster schedulers
We observe that:
 Resources are fragmented i.e. machines are running below capacity
 Even at 100% usage, goodput is much smaller due to over-allocation
 Even pareto-efficient multi-resource fair schemes result in much lower performance
Tetris
up to 40% improvement in makespan1 and job
completion time with near-perfect fairness
1Time
to finish a set of jobs
Findings from Bing and Facebook traces analysis
Diversity in multi-resource requirements:
 Tasks need varying amounts of each resource
 Demands for resources are weakly correlated
Multiple resources become tight
This matters because no single bottleneck resource:
 Enough cross-rack network bandwidth to use all CPU cores
Upper bounding potential gains
 reduce makespan1 by up to 49%
 reduce avg. job compl. time by up to 46%
3
Why so bad #1
Production schedulers neither pack
tasks nor consider all their relevant
resource demands
#1 Resource Fragmentation
#2 Over-allocation
4
Resource Fragmentation (RF)
Current Schedulers
T3: 4 GB
T2: 2 GB
Avg. task compl.time = 1.33 t
T1: 2 GB
Time
Time
T3: 4 GB
T2: 2 GB
Avg. task compl. time = 1 t
T1: 2 GB
Resources allocated
in terms of Slots
STOP
Machine A
4 GB Memory
Machine B
4 GB Memory
Current Schedulers
Free resources unable
to be assigned to tasks
Machine A
4 GB Memory
Machine B
4 GB Memory
“Packer” Scheduler
RF increase with the
number of resources
being allocated !
5
Over-Allocation
Current Schedulers
T3: 2 GB
Memory
T2: 2 GB
Memory
20 MB/s
Nw.
T1: 2 GB
Memory
20 MB/s
Nw.
Avg. task compl.time= 2.33 t
Time
Time
T3: 2 GB
Memory
T2: 2 GB
Memory
20 MB/s
Nw.
T1: 2 GB
Memory
20 MB/s
Nw.
Avg. task compl. time = 1.33 t
STOP
 Not all tasks resource
demands are
explicitly allocated
 Disk and network
are over-allocated
20 MB/s
Nw.
20 MB/s
Nw.
Machine A
4 GB Memory; 20 MB/s Nw.
Current Schedulers
Machine A
4 GB Memory; 20 MB/s Nw.
“Packer” Scheduler
6
Why so bad #2
Multi-resource Fairness Schemes do not help either
Packer Scheduler vs. DRF
 Avg. Job Compl.Time: 50%
 Makespan: 33%
Work Conserving != no fragmentation, over-allocation
Pareto1 efficient != performant
 Treat cluster as a big bag of resources
 Hides the impact of resource fragmentation
 Assume job has a fixed resource profile
 Different tasks in the same job have different demands
 The schedule impacts job’s current resource profiles
 Can schedule to create complementarity profiles
1no
job can increase share without decreasing the share of another
7
Current Schedulers
1. Resource Fragmentation
2. Over-Allocation
3. Fair allocations sacrifice performance
Competing objectives
Cluster efficiency
vs.
Job completion time
vs.
Fairness
8
#1
Pack tasks along multiple resources to improve
cluster efficiency and reduce makespan
9
Theory
Multi-Resource Packing of Tasks
similar to
APX-Hard1
Multi-Dimensional Bin Packing
Avoiding fragmentation looks like:
 Tight bin packing
 Reduces # of bins used -> reduce makespan
Practice
Existing heuristics do not directly apply here:
 Assume balls of a fixed size
 Assume balls are known apriori
 vary with time / machine placed
 elastic
 cope with online arrival of jobs,
dependencies, cluster activity
1APX-Hard
is a strict subset of NP-hard
Balls could be tasks
Bin could be machine, time
10
Fit
A packing heuristic
 Tasks resources demand vector
<
 Machine resource vector
#1
Packing heuristic
Alignment score (A)
“A” works because:
1. Check for fit ensure no over-allocation
2. Bigger balls get bigger scores
3. Abundant resources used first
 Over-Allocation
 Resource Fragmentation
4. Can spread load across machines
11
#2
Faster average job completion time
12
CHALLENGE
Q: What is the shortest “remaining time” ?
#2
remaining # tasks
“remaining work”
&
= tasks durations
&
tasks resource demands
Job Completion
Time Heuristic
A job completion time heuristic
 Gives a score P to every job
 Extended SRTF to incorporate multiple resources
Shortest Remaining Time First1 (SRTF)
schedules jobs in ascending order of their remaining time
13
1SRTF
– M. Harchol-Balter et al. Connection Scheduling in Web Servers [USITS’99]
#2
CHALLENGE
A: delays job completion time
Packing
Efficiency
?
Completion
Time
P: loss in packing efficiency
Job Completion
Time Heuristic
Combine A and P scores !
1:
2:
3:
4:
among J runnable jobs
score (j) = A(t, R)+ P(j)
max task t in j, demand(t) ≤ R (resources free)
pick j*, t* = argmax score(j)
14
#3
Achieve performance and fairness
15
Performance and fairness do not mix well in general
But ….
We can get “perfect fairness” and much better performance
#3
Fairness
Heuristic
 A says: “task i should go here to improve packing efficiency”
 P says: “schedule job j next to improve job completion time”
 Fairness says: “this set of jobs should be scheduled next”
Feasible solution which typically can satisfy all of them
16
Fairness is not a tight constraint
 Lose a bit of fairness for a lot of gains in performance
 Long term fairness not short term fairness
Heuristic
 Fairness Knob, F  [0, 1)
 F = 0
 F→1
#3
Fairness
Heuristic
most efficient scheduling
close to perfect fairness
Pick the best-for-perf. task from among
1-F
fraction of jobs furthest from fair share
17
Putting it all together
We saw:
 Packing efficiency
 Prefer small remaining work
 Fairness knob
Job Manager1
Multi-resource asks;
barrier hint
Allocations
Offers
Asks
Other things in the paper:
 Estimate task demands
 Deal with inaccuracies, barriers
 Ingestion / evacuation
Node Manager1
Track resource usage;
enforce allocations
Resource
availability reports
Cluster-wide Resource Manager
New logic to match tasks to machines
(+packing, +SRTF, +fairness)
Yarn architecture
Changes to add Tetris(shown in orange)
18
 Pluggable scheduler in Yarn 2.4
Evaluation
 250 machine cluster deployment
 Replay Bing and Facebook traces
19
CPU
200
Tetris
Mem
Capacity Scheduler
In St
Over-allocation
150
100
50
CPU
200
Utilization (%)
Utilization (%)
Efficiency
Mem
In St
Over-allocation
150
100
Lower value => higher resource fragmentation
0
0
Tetris vs.
5000
10000
Time (s)
Makespan
15000
Avg. Job Compl. Time
Capacity
Scheduler
29 %
30 %
DRF
28 %
35%
50
Lower value => higher resource fragmentation
0
0
4500
9000
13500
Time (s)
18000
22500
Gains from
 avoiding fragmentation
 avoid over-allocation
20
Fairness
Fairness Knob
 quantifies the extent to which Tetris adheres to fair allocation
Avg. Slowdown
[over impacted jobs]
Makespan
Job Compl.
Time
No Fairness
F=0
50 %
40 %
25 %
Full Fairness
F→1
10 %
23 %
2%
F = 0.25
25 %
35 %
5%
21
Pack efficiently
along multiple
resources
Prefer jobs
with less
“remaining
work”
Incorporate
Fairness
 combine heuristics that improve packing efficiency with those that
lower average job completion time
 achieving desired amounts of fairness can coexist with improving
cluster performance
 implemented
inside
YARN;
trace-driven
simulations
and deployment
We
are
working
towards
a Yarn
check-in
show encouraging
initial results
http://research.microsoft.com/en-us/UM/redmond/projects/tetris/
22
Backup slides
23
Estimating Resource Demands
Estimating resource requirements
 peak usage demands estimates
 from:
inputs size/location of tasks
o
o
finished tasks in the same phase
collecting statistics from recurring jobs
Placement
Impacts network/disk requirements
Resource Tracker
Machine1 - In Network
1024
MBytes / sec
o
850
Peak Demand
512
0
In Network Used
In Network Free
Time (sec)
Under-utilization
Resource Tracker
o
o
report unused resources
aware of other cluster activities: ingestion and evacuation
24
Packer Scheduler vs. DRF
A
B
C
6 tasks
2 tasks
2 tasks
6 tasks
2 tasks
2 tasks
6 tasks
2 tasks
2 tasks
t
2t
Job Schedule
A
B
C
18 tasks
0 tasks
0 tasks
6 tasks
0 tasks
6 tasks
t
2t
Job Schedule
3t
3t
18 cores
18 cores
18 cores
18 cores
18 cores
18 cores
16 GB
16 GB
16 GB
36 GB
6 GB
6 GB
Resources used
Resources used
DRF Scheduler
Packer Schedulers
Dominant Resource Fairness (DRF)
computes theDurations
dominant share :(DS) of every user and
seeks to maximize the minimum DS across all users
A: 3t
B: 3tallocations)
max (qA, qB, qC) (Maximize
qA + 3qB + 3qC ≤ 18C:
(CPU
3tconstraint)
Durations
: Memory]
Cluster [18
Cores, 36 GB
Job: [Task A:
Prof.],
t # tasks
A
𝟏
2qA + 1qB + 1qC ≤ 36 (Memory constraint) DS = 𝟑
qA qB qC
=
=
(Equalize DS)
18
6
6
B
C
B: 2t2 GB], 18
[1 Core,
33%
3t 1 GB], 6
[3 C:
Cores,
improvement
[3 Cores, 1 GB], 6
25
Packing efficiency does not achieve everything
Achieving packing efficiency does not
necessarily improve job completion time
Pack
Machine 1,2: [2 Cores, 4 GB]
Job: [Task Prof.], # tasks
A
[2 Cores, 3 GB], 6
B
[1 Core, 2 GB], 2
2
tasks
2
tasks
2
tasks
2t
t
Durations:
A: 3t
B: 4t
2
tasks
No Pack
4
cores
4
cores
2
cores
6 GB
6 GB
6 GB
4 GB
Resources used
2
tasks
4t
3t
Job Schedule
Job Schedule
4
cores
2t
t
4t
3t
2
tasks
2
tasks
2
tasks
Durations:
A: 4t
B: t
2
cores
4
cores
4
cores
4
cores
4 GB
6 GB
6 GB
6 GB
29% improvement
Resources used
1Time
to finish a set of jobs
26
Ingestion / evacuation
Other cluster activities which produce background traffic
ingestion = storing incoming data for later analytics
 e.g. some clusters reports volumes of up to 10 TB per hour
evacuation = data evacuated and re-replicated before
maintenance operations
 e.g. rack decommission for machines re-imaging
Resource Tracker reports, used by Tetris to avoid
contention between its tasks and these activities
27
Workload analysis
28
Alternative Packing Heuristics
29
Fairness vs. Efficiency
30
Fairness vs. Efficiency
31
Virtual Machine Packing != Tetris
Virtual Machine Packing
Consolidating VMs, with multi-dimensional resource
requirements, on to the fewest number of servers
But focus on different challenges and not task packing:
 balance load across servers
 ensure VM availability inspite of failures
 allow for quick software and hardware updates
 NO corresponding entity to a job and hence job completion time is inexpressible
 Explicit resource requirements (e.g. small VM) makes VM packing simpler
32
Barrier knob, b  [0, 1)
 b=1
no tasks preferentially treated
Tetris gives preference for last tasks in a stage
stage preceding a
barrier, where b fraction of tasks have finished
Offer resources to tasks in a
33
Starvation Prevention
It could take a long time to accommodate large tasks ?
But …
1. most tasks have demands within one order of magnitude of one another
2. machines report resource availability to the scheduler periodically
 scheduler learn about all the resources freed up by tasks that finish in the
preceding period together => can to reservation for large tasks
34
Cluster load vs. Tetris performance
35
Download