Fair Share Scheduling

advertisement
Fair Share Scheduling
Ethan Bolker
Mathematics & Computer Science
UMass Boston
eb@cs.umb.edu
www.cs.umb.edu/~eb
Queen’s University
March 23, 2001
References
• www.bmc.com/patrol/fairshare
• www/cs.umb.edu/~eb/goalmode
Acknowledgements
•
•
•
•
•
Yiping Ding
Jeff Buzen
Dan Keefe
Oliver Chen
Chris Thornley
•
•
•
•
Aaron Ball
Tom Larard
Anatoliy Rikun
Liying Song
2
Coming Attractions
•
•
•
•
Queueing theory primer
Fair share semantics
Priority scheduling; conservation laws
Predicting response times from shares
– analytic formula
– experimental validation
– applet simulation
• Implementation geometry
3
Transaction Workload
• Stream of jobs visiting a server
(ATM, time shared CPU, printer, …)
• Jobs queue when server is busy
• Input:
– Arrival rate:
– Service demand:
 job/sec
s sec/job
• Performance metrics:
– server utilization:
– response time:
– degradation:
u = s (must be  1)
r = ??? sec/job (average)
d = r/s
4
Response time computations
• r, d measure queueing delay
r  s (d  1), unless parallel processing possible
• Randomness really matters
r = s (d = 1) if arrivals scheduled (best case, no waiting)
r >> s for bulk arrivals (worst case, maximum delays)
• Theorem. If arrivals are Poisson and service is
exponentially distributed (M/M/1) then
d = 1/(1- u)
r = s/(1- u)
• Think: virtual server with speed 1-u
5
M/M/1
• Essential nonlinearity often counterintuitive
– at u = 95% average degradation is 1/(1-0.95) = 20,
– but 1 customer in 20 has no wait at all (5% idle time)
• A useful guide even when hypotheses fail
– accurate enough ( 30%) for real computer systems
– d depends only on u: many small jobs have same
impact as few large jobs
– faster system  smaller s  smaller u
r = s/(1-u)  double win: less service, less wait
– waiting costly, server cheap (telephones): want u  0
– server costly (doctors): want u  1 but scheduled
6
Scheduling for Performance
• Customers want good response times
• Decreasing u is expensive
• High end Unix offerings from HP, IBM, Sun
offer fair share scheduling packages that allow
an administrator to allocate scarce resources
(CPU, processes, bandwidth) among workloads
• How do these packages behave?
• Model as a black box, independent of internals
• Limit study to CPU shares on a uniprocessor
7
Multiple Job Streams
• Multiple workloads, utilizations u1, u2, …
• U =  ui < 1
• If
no workload prioritization
then
all degradations are equal:
di = 1/(1-U)
• Share allocations are de facto prioritizations
• Study degradation vector V = (d1, d2, …)
8
Share Semantics
• Suppose workload w has CPU share fw
• Normalize shares so that w fw = 1
• w gets fraction fw of CPU time slices when at
least one of its jobs is ready for service
• Can it use more if competing workloads idle?
No : think
share = cap
Yes : think
share = guarantee
9
Shares As Caps
•
•
•
•
Good for accounting (sell fraction of web server)
Available now from IBM, HP, soon from Sun
Straightforward (boring) - workloads are isolated
Each runs on a virtual processor with speed *= f
share f
dedicated system
utilization
u
u/f need f > u !
response time
r
r(1  u)/(f  u)
10
Shares As Guarantees
• Good for performance + economy
(use otherwise idle resources)
• Shares make a difference only when there are
multiple workloads
• Large share resembles high priority:
share may be less than utilization
• Workload interaction is subtle, often
unintuitive, hard to explain
11
Modeling
OS
complex
scheduling
software
Performance
Goals
report
fast
computation
update
analytic
algorithms
query
Model
response time
measure
frequently
workload
12
Modeling
• Real system
– Complex, dynamic, frequent state changes
– Hard to tease out cause and effect
• Model
– Static snapshot, deals in averages and probabilities
– Fast enlightening answers to “what if ” questions
• Abstraction helps you understand real system
• Start with a study of priority scheduling
13
Priority Scheduling
• Priority state: order workloads by priority (ties OK)
– two workloads, 3 states: 12, 21, [12]
– three workloads, 13 states:
•
•
•
•
123
[12]3
1[23]
[123]
(6 = 3! of these ordered states),
(3 of these),
(3 of these),
(1 state with no priorities)
– n wkls, f(n) states, n! ordered (simplex lock combos)
• p(s) = prob( state = s ) = fraction of time in state s
• V(s) = degradation vector when state = s
(measure this, or compute it using queueing theory)
• V = s p(s)V(s) (time avg is convex combination)
• Achievable region is convex hull of vectors V(s)
Two workloads
d1 = d2
d2
V(12) (wkl 1 high prio)

 V([12]) (no priorities)
achievable region
V(21)

d1
15
Two workloads
d1 = d2
d2
V(12) (wkl 1 high prio)

 V([12]) (no priorities)

V(21)

d1
16
Two workloads
d1 = d2
d2
V(12) (wkl 1 high prio)

 V([12]) (no priorities)
note: u1 < u2  wkl 2 effect on wkl 1 large
V(21)

d1
17
Conservation
• No Free Lunch Theorem. Weighted average
degradation is constant, independent of priority
scheduling scheme:
i (ui /U) di = 1/(1-U)
• Provable from some hypotheses
• Observable in some real systems
• Sometimes false: shortest job first minimizes
average response time (printer queues,
supermarket express checkout lines)
18
Conservation
• For any proper set A of workloads
Imagine giving those workloads top priority.
Then can pretend other wkls don’t exist. In that case
i  A (ui /U(A)) di = 1/(1-U(A))
When wkls in A have lower priorities they have
higher degradations, so in general
i  A (ui /U(A)) di  1/(1-U(A))
• These 2n -2 linear inequalities determine the
convex achievable region R
• R is a permutahedron: only n! vertices
19
d2 : workload 2 degradation
Two Workloads
conservation law:
(d1 , d2 ) lies on the line
u 1d1 + u 2d2 = 1/(1-U)
d1 : workload 1 degradation
20
d2 : workload 2 degradation
Two Workloads
constraint resulting
from workload 1
d 1  1/(1- u1 )
d1 : workload 1 degradation
21
d2 : workload 2 degradation
Two Workloads
Workload 1 runs at high priority:
 V(1,2) = (1 /(1- u1 ), 1 /(1- u1 )(1-U) )
constraint resulting
from workload 1
d1  1 /(1- u1 )
d1 : workload 1 degradation
22
d2 : workload 2 degradation
Two Workloads

u 1d1 + u 2d2 = 1/(1-U)

V(2,1)
d1 : workload 1 degradation
d2  1 /(1- u2 )
23
d2 : workload 2 degradation
Two Workloads
 V(1,2) = (1 /(1- u1 ), 1 /(1- u1 )(1-U) )
u 1d1 + u 2d2 = 1/(1-U)
d1  1 /(1- u1 )

V(2,1)
d1 : workload 1 degradation
d2  1 /(1- u2 )
24
Three Workloads
• Degradation vector (d1,d2, d3) lies on plane
u1 d1 + u2 d2 + u3dr3 = C
• We know a constraint for each workload w:
uw dw  Cw
• Conservation applies to each pair of wkls as well:
u1 d1 + u2 d2  C12
• Achievable region has one vertex for each priority
ordering of workloads: 3! = 6 in all
• Hence its name: the permutahedron
25
Three Workload Permutahedron
3! = 6 vertices (priority orders)
23 - 2 = 6 edges
(conservation constraints)
V(2,1,3)
d3
u1 r1 + u2 d2 + u3 d3 = C
V(1,2,3)
 
d2
d1
26
Experimental evidence
27
Four workload permutahedron
4! = 24 vertices (ordered states)
24 - 2 = 14 facets (proper subsets)
(conservation constraints)
74 faces (states)
Simplicial geometry and transportation polytopes,
Trans. Amer. Math. Soc. 217 (1976) 138.
28
Map shares to degradations
- two workloads • Suppose f1 and f2 > 0 , f1 + f2 = 1
• Model: System operates in state
– 12 with probability f1
– 21 with probability f2
(independent of who is on queue)
• Average degradation vector:
V = f1 V(12) + f2 V(21)
29
Predict Degradations From Shares
(Two Workloads)
• Reasonable modeling assumption: f1 = 1, f2 = 0
means workload 1 runs at high priority
• For arbitrary shares: workload priority order is
(1,2) with probability f1
(2,1) with probability f2
(probability = fraction of time)
• Compute average workload degradation:
d1 = f1  (wkl 1 degradation at high priority)
+ f2  (wkl 1 degradation at low priority )
Dec 13, 2000
Fair Share Scheduling
30
Model validation
31
Model validation
32
Map shares to degradations
- three (n) workloads prob(123) =
f1
f2
f3
-----------------------------(f1 + f2 + f3) (f2 + f3) (f3)
• Theorem: These n! probabilities sum to 1
– interesting identity generalizing adding fractions
– prove by induction, or by coupon collecting
• V = ordered states s prob(s) V(s)
• O(n!), (n!), good enough for n  9 (12)
33
Model validation
34
Model validation
35
The Fair Share Applet
• Screen captures on next slides are from
www.bmc.com/patrol/fairshare
• Experiment with “what if” fair share modeling
• Watch a simulation
• Random virtual job generator for the
simulation is the same one used to generate
random real jobs for our benchmark studies
36
Three Transaction Workloads
1
???
2
???
3
???
• Three workloads, each with utilization
0.32 jobs/second  1.0 seconds/job = 0.32 = 32%
• CPU 96% busy, so average (conserved) response time is
1.0/(10.96) = 25 seconds
• Individual workload average response times depend on
shares
37
Three Transaction Workloads
1
32.0
2
48.0
3
20.0
sum 80.0
• Normalized f3 = 0.20 means 20% of the time workload 3
(development) would be dispatched at highest priority
• During that time, workload priority order is (3,1,2) for
32/80 of the time, (3,2,1) for 48/80
• Probability( priority order is 312 ) = 0.20(32/80) = 0.08
38
Three Transaction Workloads
• Formulas on previous slide
• Average predicted response time weighted by
throughput 25 seconds (as expected)
• Hard to understand intuitively
• Software helps
39
note change from 32%
Three
Transaction
Workloads
40
jobs currently on run queue
Simulation
41
When the Model Fails
• Real CPU uses round robin scheduling to
deliver time slices
• Short jobs never wait for long jobs to complete
• That resembles shortest job first, so response
time conservation law fails
• At high utilization, simulation shows smaller
response times than predicted by model
• Response time conservation law yields
conservative predictions
42
Scaling Degradation Predictions
V = ordered states s prob(s) V(s)
Each s is a permutation of (1,2, … , n)
Think of it as a vector in n-space
Those n! vectors lie on of a sphere
For n large they are pretty densely packed
Think of prob(s) as a discrete
approximation to a probability distribution
on the sphere
• V is an integral
•
•
•
•
•
•
43
Monte Carlo
• loop sampleSize times
choose a permutation s at random from the
distribution determined by the shares
compute degradation vector V(s)
accumulate V += prob(s)V(s)
• sampleSize = 40000 works well
independent of n!
44
Map shares to degradations
(geometry)
• Interpret shares as barycentric coordinates in
the n-1 simplex
• Study the geometry of the map from the
simplex to the n-1 dimensional permutahedron
• Easy when n=2: each is a line segment and map
is linear
45
Mapping a triangle to a hexagon
f1 = 1

f1 = 0
f3 = 1
132
312
M
321
123
wkl 1 high priority
wkl 1 low priority
213
231
46
f1 = 0
Mapping a triangle to a hexagon
f1 = 1

{23}
47
Mapping a triangle to a hexagon
48
What This Means
• Add a strong statement that summarizes
how you feel or think about this topic
• Summarize key points you want your
audience to remember
49
Download