Multi-Resource Packing for Cluster Schedulers Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, Aditya Akella Performance of cluster schedulers We observe that: Resources are fragmented i.e. machines are running below capacity Even at 100% usage, goodput is much smaller due to over-allocation Even pareto-efficient multi-resource fair schemes result in much lower performance Tetris up to 40% improvement in makespan1 and job completion time with near-perfect fairness 1Time to finish a set of jobs Findings from Bing and Facebook traces analysis Diversity in multi-resource requirements: Tasks need varying amounts of each resource Demands for resources are weakly correlated Multiple resources become tight This matters because no single bottleneck resource: Enough cross-rack network bandwidth to use all CPU cores Upper bounding potential gains reduce makespan1 by up to 49% reduce avg. job compl. time by up to 46% 3 Why so bad #1 Production schedulers neither pack tasks nor consider all their relevant resource demands #1 Resource Fragmentation #2 Over-allocation 4 Resource Fragmentation (RF) Current Schedulers T3: 4 GB T2: 2 GB Avg. task compl.time = 1.33 t T1: 2 GB Time Time T3: 4 GB T2: 2 GB Avg. task compl. time = 1 t T1: 2 GB Resources allocated in terms of Slots STOP Machine A 4 GB Memory Machine B 4 GB Memory Current Schedulers Free resources unable to be assigned to tasks Machine A 4 GB Memory Machine B 4 GB Memory “Packer” Scheduler RF increase with the number of resources being allocated ! 5 Over-Allocation Current Schedulers T3: 2 GB Memory T2: 2 GB Memory 20 MB/s Nw. T1: 2 GB Memory 20 MB/s Nw. Avg. task compl.time= 2.33 t Time Time T3: 2 GB Memory T2: 2 GB Memory 20 MB/s Nw. T1: 2 GB Memory 20 MB/s Nw. Avg. task compl. time = 1.33 t STOP Not all tasks resource demands are explicitly allocated Disk and network are over-allocated 20 MB/s Nw. 20 MB/s Nw. Machine A 4 GB Memory; 20 MB/s Nw. Current Schedulers Machine A 4 GB Memory; 20 MB/s Nw. “Packer” Scheduler 6 Why so bad #2 Multi-resource Fairness Schemes do not help either Packer Scheduler vs. DRF Avg. Job Compl.Time: 50% Makespan: 33% Work Conserving != no fragmentation, over-allocation Pareto1 efficient != performant Treat cluster as a big bag of resources Hides the impact of resource fragmentation Assume job has a fixed resource profile Different tasks in the same job have different demands The schedule impacts job’s current resource profiles Can schedule to create complementarity profiles 1no job can increase share without decreasing the share of another 7 Current Schedulers 1. Resource Fragmentation 2. Over-Allocation 3. Fair allocations sacrifice performance Competing objectives Cluster efficiency vs. Job completion time vs. Fairness 8 #1 Pack tasks along multiple resources to improve cluster efficiency and reduce makespan 9 Theory Multi-Resource Packing of Tasks similar to APX-Hard1 Multi-Dimensional Bin Packing Avoiding fragmentation looks like: Tight bin packing Reduces # of bins used -> reduce makespan Practice Existing heuristics do not directly apply here: Assume balls of a fixed size Assume balls are known apriori vary with time / machine placed elastic cope with online arrival of jobs, dependencies, cluster activity 1APX-Hard is a strict subset of NP-hard Balls could be tasks Bin could be machine, time 10 Fit A packing heuristic Tasks resources demand vector < Machine resource vector #1 Packing heuristic Alignment score (A) “A” works because: 1. Check for fit ensure no over-allocation 2. Bigger balls get bigger scores 3. Abundant resources used first Over-Allocation Resource Fragmentation 4. Can spread load across machines 11 #2 Faster average job completion time 12 CHALLENGE Q: What is the shortest “remaining time” ? #2 remaining # tasks “remaining work” & = tasks durations & tasks resource demands Job Completion Time Heuristic A job completion time heuristic Gives a score P to every job Extended SRTF to incorporate multiple resources Shortest Remaining Time First1 (SRTF) schedules jobs in ascending order of their remaining time 13 1SRTF – M. Harchol-Balter et al. Connection Scheduling in Web Servers [USITS’99] #2 CHALLENGE A: delays job completion time Packing Efficiency ? Completion Time P: loss in packing efficiency Job Completion Time Heuristic Combine A and P scores ! 1: 2: 3: 4: among J runnable jobs score (j) = A(t, R)+ P(j) max task t in j, demand(t) ≤ R (resources free) pick j*, t* = argmax score(j) 14 #3 Achieve performance and fairness 15 Performance and fairness do not mix well in general But …. We can get “perfect fairness” and much better performance #3 Fairness Heuristic A says: “task i should go here to improve packing efficiency” P says: “schedule job j next to improve job completion time” Fairness says: “this set of jobs should be scheduled next” Feasible solution which typically can satisfy all of them 16 Fairness is not a tight constraint Lose a bit of fairness for a lot of gains in performance Long term fairness not short term fairness Heuristic Fairness Knob, F [0, 1) F = 0 F→1 #3 Fairness Heuristic most efficient scheduling close to perfect fairness Pick the best-for-perf. task from among 1-F fraction of jobs furthest from fair share 17 Putting it all together We saw: Packing efficiency Prefer small remaining work Fairness knob Job Manager1 Multi-resource asks; barrier hint Allocations Offers Asks Other things in the paper: Estimate task demands Deal with inaccuracies, barriers Ingestion / evacuation Node Manager1 Track resource usage; enforce allocations Resource availability reports Cluster-wide Resource Manager New logic to match tasks to machines (+packing, +SRTF, +fairness) Yarn architecture Changes to add Tetris(shown in orange) 18 Pluggable scheduler in Yarn 2.4 Evaluation 250 machine cluster deployment Replay Bing and Facebook traces 19 CPU 200 Tetris Mem Capacity Scheduler In St Over-allocation 150 100 50 CPU 200 Utilization (%) Utilization (%) Efficiency Mem In St Over-allocation 150 100 Lower value => higher resource fragmentation 0 0 Tetris vs. 5000 10000 Time (s) Makespan 15000 Avg. Job Compl. Time Capacity Scheduler 29 % 30 % DRF 28 % 35% 50 Lower value => higher resource fragmentation 0 0 4500 9000 13500 Time (s) 18000 22500 Gains from avoiding fragmentation avoid over-allocation 20 Fairness Fairness Knob quantifies the extent to which Tetris adheres to fair allocation Avg. Slowdown [over impacted jobs] Makespan Job Compl. Time No Fairness F=0 50 % 40 % 25 % Full Fairness F→1 10 % 23 % 2% F = 0.25 25 % 35 % 5% 21 Pack efficiently along multiple resources Prefer jobs with less “remaining work” Incorporate Fairness combine heuristics that improve packing efficiency with those that lower average job completion time achieving desired amounts of fairness can coexist with improving cluster performance implemented inside YARN; trace-driven simulations and deployment We are working towards a Yarn check-in show encouraging initial results http://research.microsoft.com/en-us/UM/redmond/projects/tetris/ 22 Backup slides 23 Estimating Resource Demands Estimating resource requirements peak usage demands estimates from: inputs size/location of tasks o o finished tasks in the same phase collecting statistics from recurring jobs Placement Impacts network/disk requirements Resource Tracker Machine1 - In Network 1024 MBytes / sec o 850 Peak Demand 512 0 In Network Used In Network Free Time (sec) Under-utilization Resource Tracker o o report unused resources aware of other cluster activities: ingestion and evacuation 24 Packer Scheduler vs. DRF A B C 6 tasks 2 tasks 2 tasks 6 tasks 2 tasks 2 tasks 6 tasks 2 tasks 2 tasks t 2t Job Schedule A B C 18 tasks 0 tasks 0 tasks 6 tasks 0 tasks 6 tasks t 2t Job Schedule 3t 3t 18 cores 18 cores 18 cores 18 cores 18 cores 18 cores 16 GB 16 GB 16 GB 36 GB 6 GB 6 GB Resources used Resources used DRF Scheduler Packer Schedulers Dominant Resource Fairness (DRF) computes theDurations dominant share :(DS) of every user and seeks to maximize the minimum DS across all users A: 3t B: 3tallocations) max (qA, qB, qC) (Maximize qA + 3qB + 3qC ≤ 18C: (CPU 3tconstraint) Durations : Memory] Cluster [18 Cores, 36 GB Job: [Task A: Prof.], t # tasks A 𝟏 2qA + 1qB + 1qC ≤ 36 (Memory constraint) DS = 𝟑 qA qB qC = = (Equalize DS) 18 6 6 B C B: 2t2 GB], 18 [1 Core, 33% 3t 1 GB], 6 [3 C: Cores, improvement [3 Cores, 1 GB], 6 25 Packing efficiency does not achieve everything Achieving packing efficiency does not necessarily improve job completion time Pack Machine 1,2: [2 Cores, 4 GB] Job: [Task Prof.], # tasks A [2 Cores, 3 GB], 6 B [1 Core, 2 GB], 2 2 tasks 2 tasks 2 tasks 2t t Durations: A: 3t B: 4t 2 tasks No Pack 4 cores 4 cores 2 cores 6 GB 6 GB 6 GB 4 GB Resources used 2 tasks 4t 3t Job Schedule Job Schedule 4 cores 2t t 4t 3t 2 tasks 2 tasks 2 tasks Durations: A: 4t B: t 2 cores 4 cores 4 cores 4 cores 4 GB 6 GB 6 GB 6 GB 29% improvement Resources used 1Time to finish a set of jobs 26 Ingestion / evacuation Other cluster activities which produce background traffic ingestion = storing incoming data for later analytics e.g. some clusters reports volumes of up to 10 TB per hour evacuation = data evacuated and re-replicated before maintenance operations e.g. rack decommission for machines re-imaging Resource Tracker reports, used by Tetris to avoid contention between its tasks and these activities 27 Workload analysis 28 Alternative Packing Heuristics 29 Fairness vs. Efficiency 30 Fairness vs. Efficiency 31 Virtual Machine Packing != Tetris Virtual Machine Packing Consolidating VMs, with multi-dimensional resource requirements, on to the fewest number of servers But focus on different challenges and not task packing: balance load across servers ensure VM availability inspite of failures allow for quick software and hardware updates NO corresponding entity to a job and hence job completion time is inexpressible Explicit resource requirements (e.g. small VM) makes VM packing simpler 32 Barrier knob, b [0, 1) b=1 no tasks preferentially treated Tetris gives preference for last tasks in a stage stage preceding a barrier, where b fraction of tasks have finished Offer resources to tasks in a 33 Starvation Prevention It could take a long time to accommodate large tasks ? But … 1. most tasks have demands within one order of magnitude of one another 2. machines report resource availability to the scheduler periodically scheduler learn about all the resources freed up by tasks that finish in the preceding period together => can to reservation for large tasks 34 Cluster load vs. Tetris performance 35