The Case for Tiny Tasks in Compute Clusters Kay Ousterhout*, Aurojit Panda*, Joshua Rosen*, Shivaram Venkataraman*, Reynold Xin*, Sylvia Ratnasamy*, Scott Shenker*+, * UC Berkeley, Ion Stoica+* ICSI Setting Tas k Tas k Map Reduce/Spark/Dr yad … Tas k Job … Tas k 0 1 2 3 tasks Slots Slots Use smaller tasks! Today’s Tiny Tasks Time 0 1 2 3 Time Why How Wher ? ? e? Why How Wher ? ? e? Slots Problem: Skew and Stragglers 0 1 2 3 Contended machine? Data skew? Time Benefit: Handling of Skew and Stragglers 0 1 2 3 Tiny Tasks 0 1 2 3 Slots Slots Today’s tasks Timeas 5.2x reductionTim As much in ejob completion time! Problem: Batch and Interactive Sharing Clusters forced to trade off utilization and responsiveness! Slots 0 1 2 3 Low priority batch task High priority interactive job Time Benefit: Improved Sharing Today’s tasks Tiny Tasks 0 1 2 3 Slots Slots 0 1 2 3 Time Time High-priority tasks not subject to long wait times! Benefits: Recap 0 1 2 3 0 1 2 3 Slots Slots (1) Straggler mitigation Time Quincy (SOSP ‘09 Amoeba (SOCC ’12) … 0 1 2 3 Slots 0 1 2 3 Slots (2) Improved sharing Time Time Mantri (OSDI ‘10) Scarlett (EuroSys ’11 SkewTune (SIGMOD ‘ Dolly (NSDI ’13) … Time Why How Wher ? ? e? Sched ule task Scheduling requirements: (millions per High second) Throughput Low (millisecon Latency ds) Distributed Scheduling (e.g., Sparrow Sched ule task Launc h task Use existing thread pool to launch tasks Sched ule task Launc h task Use existing thread pool to launch tasks + Cache task binaries Task launch = RPC time (<1ms) Sched ule task Launch task Read input data Smallest efficient file block size: 8M B Distribute Metadata (à la Flat Datacenter Storage, OSDI Sched ule task Launch task … … Read input data Execute task + read data for Tons of tiny transfers! Framework(enables optimizations, e.g., Sched ule task Launch task How low can you go? 8MB disk block Read input data Execute task + read data for next 100’s of millisecon Why How Wher ? ? e? Original Job … Map Task 1 K2: K2: K3: K5: K5: … Reduce Task 1 Reduc e Tasks N K1: K1: K1: K2: K2: Kn: Kn: Kn: … K1: K1: K1: Map Task s 2 3 4 … … Map Task 2 Tiny Tasks 1 Job Original Reduce Phase K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: K1: Reduce Task 1 Tiny Tasks Splitting Large Tasks • Aggregation trees – Works for functions that are associative and commutative • Framework-managed temporary state store • Ultimately, need to allow a small number of large tasks Slots Slots 0 1 2 3 0 1 2 3 Time Time 0 1 2 3 Slots 0 1 2 3 Slots mitigate stragglers + Improve sharing Time Time Distribu Pipelined Launch Distribu ted file task in task ted metada execution existing schedul ta thread ing Questions?pool Find me or Shivaram: Backup Slides Benefit of Eliminating Stragglers 0 1 2 3 Slots Time 0 1 2 3 Cumulative Fraction Slots Based on Facebook Trace 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Original Job Size 1 - 9 Tasks 10 - 99 Tasks 100+ Tasks 1 2 3 4 5 6 7 8 9 10 Improvement in Job Completion Time Time th 95 5.2x at the percentile! Why Not Preemption? • Preemption only handles sharing (not stragglers) • Task migration is time consuming • Tiny tasks improve fault tolerance Dremel/Drill/Impala • Similar goals and challenges (supporting short tasks) • Dremel statically assigns tablets to machines; rebalances if query dispatcher notices that a machine is processing a tablet slowly standard straggler mitigation • Most jobs expected to be interactive (no Scheduling Throughput 10,000 Machines 16 cores/machine 100 millisecond tasks Over 1 million task scheduling decisions per Sparrow: Technique Place m tasks on the least loaded of dm slaves Job m=2 tasks 4 Schedu probes ler (d = 2) Schedu ler Schedu ler Schedu ler Slave Slave Slave Slave Slave Slave More at tinyurl.com/sparrow-scheduler Sparrow: Performance on TPCH Workload Within 12% of offline optimal; median delay of 8ms More atqueuing tinyurl.com/sparrow-scheduler 29