Priority Scheduling: An Application for the Permutahedron Ethan Bolker UMass-Boston BMC Software AMS Toronto meeting September 24, 2000 Plan • • • • Brief introduction to queueing theory Priority scheduling Conservation laws and the permutahedron Specifying CPU shares interesting pictures and open questions References: www.cs.umb.edu/~eb/goalmode Acknowledgements: Jeff Buzen, Yiping Ding, Dan Keefe, Oliver Chen, Aaron Ball, Tom Larard 2 Queueing theory • Workload: stream of jobs visiting a server (ATM, time shared CPU, printer, …) • Jobs queue when server is busy • Input: – Arrival rate: – Service demand: job/sec s sec/job • Performance metrics: – – – – Utilization: Response time: Degradation: Queue length: u = s (must be 1) r = ??? d = r/s q = r (Little’s law) 3 Response time computations • r, d, q measure queueing delay r s (d 1), unless parallel processing possible • Randomness really matters r = s (d = 1) if arrivals scheduled (best case, no waiting) r >> s for bulk arrivals (worst case, maximum delays) • Theorem. d = 1/(1- u) if arrivals are Poisson and service is exponentially distributed (M/M/1). r = s/(1- u) (think virtual server with speed 1-u ) q = u/(1- u) (convention: job in service is on queue) 4 M/M/1 • Essential nonlinearity often counterintuitive – at u = 90% average queue length is 0.9/(1-0.9) = 9, – average response time is s/(1-0.9) = 10s, – but 1 customer in 10 has no wait at all (10% idle time) • A useful guide even when hypotheses fail – accurate enough ( 20%) for real computer systems – d depends only on u: many small jobs have same impact as few large jobs – faster system smaller s smaller u r = s/(1-u) double win: less service, less wait – waiting costly, server cheap (telephones): want u 0 – server costly (doctors): want u 1 but scheduled 5 Multiple Job Streams • Multiple workloads, utilizations u1, u2, … • U = ui < 1 All degradations equal: di = 1/(1-U) • Suppose priority scheduling possible Study degradation vector V = (d1, d2, …) 6 Priority Scheduling • Priority state: order workloads by priority (ties OK) – two workloads, 3 states: 12, 21, [12] – three workloads, 13 states: • • • • 123 [12]3 1[23] [123] (6 = 3! of these ordered states), (3 of these), (3 of these), (1 state with no priorities) – n wkls, f(n) states, n! ordered (simplex lock combos) • p(s) = prob( state = s ) = fraction of time in state s • V(s) = degradation vector when state = s (measure this, or compute it using queueing theory) • V = s p(s)V(s) (time avg is convex combination) • Achievable region is convex hull of vectors V(s) Two workloads d1 = d2 d2 V(12) (wkl 1 high prio) V([12]) (no priorities) achievable region V(21) d1 8 Two workloads d1 = d2 d2 V(12) (wkl 1 high prio) V([12]) (no priorities) V(21) d1 9 Two workloads d1 = d2 d2 V(12) (wkl 1 high prio) V([12]) (no priorities) note: u1 < u2 wkl 2 effect on wkl 1 large V(21) d1 10 Conservation • No Free Lunch Theorem. Weighted average degradation is constant, independent of priority scheduling scheme: i (ui /U) di = 1/(1-U) • Provable from some hypotheses • Observable in some real systems • Sometimes false: shortest job first minimizes average response time (printer queues, supermarket express checkout lines) 11 Conservation • For any proper set A of workloads Imagine giving those workloads top priority. Then can pretend other wkls don’t exist. In that case i A (ui /U(A)) di = 1/(1-U(A)) When wkls in A have lower priorities they have higher degradations, so in general i A (ui /U(A)) di 1/(1-U(A)) • These 2n -2 linear inequalities determine the convex achievable region R • R is a permutahedron: only n! vertices 12 Two workload permutahedron d2 u1d1 + u2d2 = U/(1-U) d1 13 Two workload permutahedron d2 u1d1 + u2d2 = U/(1-U) V(21) d2 1/(1- u2 ) d1 14 Two workload permutahedron d2 V(12) achievable region u1d1 + u2d2 = U/(1-U) d1 1/(1- u1 ) V(21) d2 1/(1- u2 ) d1 15 Three workload permutahedron d3 u1d1 + u2d2 + u3d3 = U/(1-U) V(213) V(123) d2 d1 16 Experimental evidence 17 Four workload permutahedron 4! = 24 vertices (ordered states) 24 - 2 = 14 facets (proper subsets) (conservation constraints) 74 faces (states) Simplicial geometry and transportation polytopes, Trans. Amer. Math. Soc. 217 (1976) 138. 18 Scheduling for performance • Administrator specifies performance goals – desired degradations (IBM OS/390) (not today) – CPU shares (UNIX offerings from HP, IBM, Sun) • Operating system dispatches jobs in an attempt to meet goals • Model predicts degradations by constructing map workload performance goals permutahedron 19 Specifying CPU shares • Administrator specifies workload CPU shares • Share f (0 < f < 1) means workload guaranteed fraction f of CPU when at least one of its jobs is queued for service, can get more if some competition is absent • share utilization • share cap • share should be renamed guarantee 20 Map shares to degradations - two workloads • Suppose f1 and f2 > 0 , f1 + f2 = 1 • Model: System operates in state – 12 with probability f1 – 21 with probability f2 (independent of who is on queue) • Average degradation vector: V = f1 V(12) + f2 V(21) 21 Model validation 22 Model validation 23 Map shares to degradations - three (n) workloads prob(123) = f1 f2 f3 -----------------------------(f1 + f2 + f3) (f2 + f3) (f3) • Theorem: These n! probabilities sum to 1 – interesting identity generalizing adding fractions – prove by induction, or by coupon collecting • V = ordered states s prob(s) V(s) • O(n!), (n!), good enough for n 9 (12) • Searching for fast (approximate) algorithm ... 24 Model validation 25 Model validation 26 Map shares to degradations (geometry) • Interpret shares as barycentric coordinates in the n-1 simplex • Study the geometry of the map from the simplex to the n-1 dimensional permutahedron • Easy when n=2: each is a line segment and map is linear 27 Mapping a triangle to a hexagon f1 = 1 f1 = 0 f3 = 1 132 312 M 321 123 wkl 1 high priority wkl 1 low priority 213 231 28 f1 = 0 Mapping a triangle to a hexagon f1 = 1 {23} 29 Mapping a triangle to a hexagon 30 Implementing fair share scheduling • Actual Sun/solaris implementation is subtle • HP and IBM are black boxes (for me) • Stochastic solution: randomly choose queued job to dispatch (implement the model rather than model an implementation) • May require prior computation of priodist(w, p) = prob(wkl w runs at prio p) • workload priority probabilities, not state probabilities 31 Priority distributions • Given degradations, compute a priodist • A priodist is an nn matrix with row sums 1 • {priodists} = cartesian product of n n-simplices priodist space (dim n(n-1)) permutahedron (dim n-1) • Map is surjective, not injective • Look for a well behaved inverse image 32 Three workload permutahedron d2 d1 = d2 [13]2 312 132 3[12] 1[23] [123] 321 123 [23]1 231 [12]3 213 d2 = d3 2[13] d1 d1 = d3 33 … dissected into 3! quadrilaterals d2 d1 = d2 1[23] [123] 123 [12]3 d2 = d3 d1 34 … each mapped to from a skew quadrilateral of priodists 1 0 0 0 .5 .5 0 .5 .5 .33 .33 .33 .33 .33 .33 .33 .33 .33 P[123] P1[23] 1[23] P123 1 0 0 0 1 0 0 0 1 (x,y) (x,y) [123] P[12]3 .5 .5 0 .5 .5 0 0 0 1 123 [12]3 xyP123 + x(1-y) P1[23] + (1-x)yP[12]3 + (1-x)(1-y) P[123] degradation vector in this corner of permutahedron 35 Skew quadrilaterals • Given 4 points P00, P01, P10, P11 Rm , map unit square: (x,y) xyP00 + x(1-y) P01+ (1-x)yP10 + (1-x)(1-y) P11 • Easy to generalize to 2k points • Analogous to convex hull, which maps barycentric coordinates on a simplex • Reference for this construction? 36 Inversion Try to locate * = (d1, d2 ) on coordinate grid d2 d1 37 Sequential bisection d2 d1 38 Sequential bisection d2 d1 39 Sequential bisection d2 d1 40 Sequential bisection d2 d1 41 Sequential bisection d2 d1 42 … may fail to converge d2 d1 43 Tempered sequential bisection d2 o d1 44 Tempered sequential bisection d2 o o d1 45 Tempered sequential bisection d2 oo o prove that this converges... d1 46