Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca DATA PARALLEL CLUSTERS 2 DATA PARALLEL CLUSTERS 3 DATA PARALLEL CLUSTERS Predictabi lity 4 DATA PARALLEL CLUSTERS Deadli ne 5 DATA PARALLEL CLUSTERS Deadli ne 6 VARIABLE LATENCY 7 VARIABLE LATENCY 8 VARIABLE LATENCY 9 VARIABLE LATENCY 10 VARIABLE LATENCY 11 VARIABLE LATENCY 12 VARIABLE LATENCY 13 4.3x VARIABLE LATENCY 14 Why does latency vary? 1. Pipeline complexity 2. Noisy execution environment 15 Cosmos MICROSOFT’S DATA PARALLEL CLUSTERS 16 • CosmosStore Cosmos MICROSOFT’S DATA PARALLEL CLUSTERS 17 • CosmosStore • Dryad Cosmos MICROSOFT’S DATA PARALLEL CLUSTERS 18 • CosmosStore • Dryad • SCOPE Cosmos MICROSOFT’S DATA PARALLEL CLUSTERS 19 • CosmosStore • Dryad • SCOPE Cosmos MICROSOFT’S DATA PARALLEL CLUSTERS 20 Cosmos Cluster DRYAD’S DAG WORKFLOW 21 Cosmos Cluster DRYAD’S DAG WORKFLOW 22 Pipeli ne Job DRYAD’S DAG WORKFLOW 23 Deadline Deadline Deadline Deadline Deadli ne DRYAD’S DAG WORKFLOW 24 Deadline Deadline Deadline Deadline Deadli ne DRYAD’S DAG WORKFLOW 25 Job Sta ge DRYAD’S DAG WORKFLOW 26 Tas ks Job Sta ge DRYAD’S DAG WORKFLOW 27 28 Priorities? EXPRESSING PERFORMANCE TARGETS 29 Priorities? Not expressive enough Weights? EXPRESSING PERFORMANCE TARGETS 30 Priorities? Not expressive enough Weights? Difficult for users to set Utility curves? EXPRESSING PERFORMANCE TARGETS 31 Priorities? Not expressive enough Weights? Difficult for users to set Utility curves? Capture deadline & penalty EXPRESSING PERFORMANCE TARGETS 32 OUR GOAL 33 Maximize utility OUR GOAL 34 Maximize utility while minimizing resources OUR GOAL 35 Maximize utility while minimizing resources by dynamically adjusting the allocation OUR GOAL 36 Jockey 37 • Large clusters Jockey 38 • Large clusters • Many users Jockey 39 • Large clusters • Many users • Prior execution Jockey 40 f(job state, allocation) -> remaining run time JOCKEY – MODEL 41 f(job state, allocation) -> remaining run time JOCKEY – MODEL 42 f(job state, allocation) -> remaining run time JOCKEY – MODEL 43 f(job state, allocation) -> remaining run time JOCKEY – MODEL 44 JOCKEY – CONTROL LOOP 45 JOCKEY – CONTROL LOOP 46 JOCKEY – CONTROL LOOP 47 f(job state, allocation) -> remaining run time JOCKEY – MODEL 48 f(job state, allocation) -> remaining run time f(progress, allocation) -> remaining run time JOCKEY – MODEL 49 JOCKEY – PROGRESS INDICATOR 50 JOCKEY – PROGRESS INDICATOR 51 JOCKEY – PROGRESS INDICATOR 52 total running JOCKEY – PROGRESS INDICATOR 53 total running + total queuing JOCKEY – PROGRESS INDICATOR 54 sta ge total + running total queuing JOCKEY – PROGRESS INDICATOR 55 Stage 1 total + running total queuing + Stage 2 total + running total queuing + Stage 3 total + running total queuing JOCKEY – PROGRESS INDICATOR 56 Stage 1 total # + running complet total total e tasks queuing + Stage 2 total # + running complet total total e tasks queuing + Stage 3 total # + running complet total total e tasks queuing JOCKEY – PROGRESS INDICATOR 57 JOCKEY – PROGRESS INDICATOR 58 JOCKEY – PROGRESS INDICATOR 59 JOCKEY – PROGRESS INDICATOR 60 JOCKEY – PROGRESS INDICATOR 61 JOCKEY – PROGRESS INDICATOR 62 JOCKEY – PROGRESS INDICATOR 63 JOCKEY – CONTROL LOOP 64 Job model 1% complete 2% complete 3% complete 4% complete 5% complete JOCKEY – CONTROL LOOP 65 Job 10 nodes model 20 nodes 30 nodes 1% complete 2% complete 3% complete 4% complete 5% complete JOCKEY – CONTROL LOOP 66 Job 10 nodes model 20 nodes 30 nodes 1% complete 60 minutes 40 minutes 25 minutes 2% complete 59 minutes 39 minutes 24 minutes 3% complete 58 minutes 37 minutes 22 minutes 4% complete 56 minutes 36 minutes 21 minutes 5% complete 54 minutes 34 minutes 20 minutes JOCKEY – CONTROL LOOP 67 Job 10 nodes model 20 nodes 30 nodes 1% complete 60 minutes 40 minutes 25 minutes 2% complete 59 minutes 39 minutes 24 minutes 3% complete 58 minutes 37 minutes 22 minutes 4% complete 56 minutes 36 minutes 21 minutes 5% complete 54 minutes 34 minutes 20 minutes Comple tion: 1% Deadlin e: 50 min. JOCKEY – CONTROL LOOP 68 Job 10 nodes model 20 nodes 30 nodes 1% complete 60 minutes 40 minutes 25 minutes 2% complete 59 minutes 39 minutes 24 minutes 3% complete 58 minutes 37 minutes 22 minutes 4% complete 56 minutes 36 minutes 21 minutes 5% complete 54 minutes 34 minutes 20 minutes Comple tion: 1% Deadlin e: 50 min. JOCKEY – CONTROL LOOP 69 Job 10 nodes model 20 nodes 30 nodes 1% complete 60 minutes 40 minutes 25 minutes 2% complete 59 minutes 39 minutes 24 minutes 3% complete 58 minutes 37 minutes 22 minutes 4% complete 56 minutes 36 minutes 21 minutes 5% complete 54 minutes 34 minutes 20 minutes Comple tion: 3% Deadlin e: 40 min. JOCKEY – CONTROL LOOP 70 Job 10 nodes model 20 nodes 30 nodes 1% complete 60 minutes 40 minutes 25 minutes 2% complete 59 minutes 39 minutes 24 minutes 3% complete 58 minutes 37 minutes 22 minutes 4% complete 56 minutes 36 minutes 21 minutes 5% complete 54 minutes 34 minutes 20 minutes Comple tion: 5% Deadlin e: 30 min. JOCKEY – CONTROL LOOP 71 f(progress, allocation) -> remaining run time JOCKEY – MODEL 72 f(progress, allocation) -> remaining run time analytic model? JOCKEY – MODEL 73 f(progress, allocation) -> remaining run time analytic model? machine learning? JOCKEY – MODEL 74 f(progress, allocation) -> remaining run time analytic model? machine learning? simulator JOCKEY – MODEL 75 Problem Solution JOCKEY 76 Problem Solution Pipeline complexity JOCKEY 77 Problem Solution Pipeline complexity Use a simulator JOCKEY 78 Problem Solution Pipeline complexity Noisy environment JOCKEY Use a simulator 79 Problem Solution Pipeline complexity Noisy environment JOCKEY Use a simulator Dynamic control 80 Jockey in Action 81 • Real job Jockey in Action 82 • Real job • Production cluster Jockey in Action 83 • Real job • Production cluster • CPU load: ~80% Jockey in Action 84 Jockey in Action 85 Jockey in Action 86 Jockey in Action 87 Jockey in Action Initial deadline: 140 minutes 88 Jockey in Action New deadline: 70 minutes 89 Jockey in Action Release resources due to excess pessimism New deadline: 70 minutes 90 Jockey in Action “Oracle” allocation: Total allocationhours 91 Jockey in Action Available parallelism less than allocation “Oracle” allocation: Total allocationhours 92 Jockey in Action Allocation above oracle “Oracle” allocation: Total allocationhours 93 Evaluation 94 • Production cluster Evaluation 95 • Production cluster • 21 jobs Evaluation 96 • Production cluster • 21 jobs • SLO met? Evaluation 97 • Production cluster • 21 jobs • SLO met? • Cluster impact? Evaluation 98 Evaluation 99 Evaluation Jobs which met the SLO 100 Evaluation Jobs which met the SLO 101 Missed 1 of 94 deadlines Evaluation Jobs which met the SLO 102 Evaluation Jobs which met the SLO 103 Evaluation Jobs which met the SLO 104 Evaluation 1.4x Jobs which met the SLO 105 Missed 1 of 94 deadlines Evaluation Allocated too many resources Jobs which met the SLO 106 Missed 1 of 94 deadlines Evaluation Simulator made good predictions: 80% finish before deadline Allocated too many resources Jobs which met the SLO 107 Missed 1 of 94 deadlines Evaluation Simulator made good predictions: 80% finish before deadline Allocated too many resources Control loop is stable Jobs which met the SLO and successful 108 Evaluation 109 Evaluation 110 Evaluation 111 Evaluation 112 Evaluation 113 Evaluation 114 Conclusion 115 Data parallel jobs are complex, 116 Data parallel jobs are complex, yet users demand deadlines. 117 Data parallel jobs are complex, yet users demand deadlines. Jobs run in shared, noisy clusters, 118 Data parallel jobs are complex, yet users demand deadlines. Jobs run in shared, noisy clusters, making simple models inaccurate. 119 Jockey 120 simulator 121 control-loop 122 Deadline Deadline Deadline Deadline Deadli ne 123 124 Questions? Andrew Ferguson adf@cs.brown.edu 125 Coauthors • • • • Peter Bodík (Microsoft Research) Srikanth Kandula (Microsoft Research) Eric Boutín (Microsoft) Rodrigo Fonseca (Brown) Questions? Andrew Ferguson adf@cs.brown.edu 126 Backup Slides 127 Utility Curves For single jobs, scale doesn’t matter Deadline For multiple jobs, se financial penalties 128 Jockey Resource allocation control loop Utility Run Time 1. Slack 2. Hysteresis 3. Dead Zone 129 Prediction Cosmos Resource sharing in Cosmos 130 • Resources are allocated with a form of fair sharing across business groups and their jobs. (Like Hadoop FairScheduler or CapacityScheduler) • Each job is guaranteed a number of tokens as dictated by cluster policy; each running or initializing task uses one token. Token released on task completion. • A token is a guaranteed share of CPU and memory Jockey Progress indicator • Can use many features of the job to build a progress indicator • Earlier work (ParaTimer) concentrated on fraction of tasks completed • Our indicator is very simple, but we found it performs best for Jockey’s needs Fraction of completed vertices Total vertex initialization time Total vertex run time 131 Comparison with ARIA 132 • ARIA uses analytic models • Designed for 3 stages: Map, Shuffle, Reduce • Jockey’s control loop is robust due to control-theory improvements • ARIA tested on small (66-node) cluster without a network bottleneck • We believe Jockey is a better match for production DAG frameworks such as Hive, Pig, etc. Jockey Latency prediction: C(p, a) • Event-based simulator – Same scheduling logic as actual Job Manager – Captures important features of job progress – Does not model input size variation or speculative reexecution of stragglers – Inputs: job algebra, distributions of task timings, probabilities of failures, allocation • Analytic model – Inspired by Amdahl’s Law: T = S + P/N 133 – S is remaining work on critical path, P is all remaining work, N is number of machines Jockey Resource allocation control loop • Executes in Dryad’s Job Manager • Inputs: fraction of completed tasks in each stage, time job has spent running, utility function, precomputed values (for speedup) • Output: Number of tokens to allocate 134 • Improved with techniques from Jockey offline during job runtime job profile 135 Jockey offline during job runtime job profile simulator 136 Jockey offline during job runtime job profile simulator job stats 137 Jockey offline during job runtime job profile simulator latency predictions job stats 138 Jockey offline during job runtime job profile simulator latency predictions job stats utility function 139 Jockey offline during job runtime job profile simulator latency predictions job stats utility function resource allocation control loop running job 140