Server Farms Q: Why would someone choose M/M/k over M/M/1 then? M/M/1 (single fast server) better under low load, as good under high load A: Server Farms! (and “Clouds”!) k slow servers with speed μ are much cheaper than one fast server of speed kμ! But many more servers! (100s or 1000s) Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 1 Capacity Provisioning for Server Farms Problem: How many servers do I need? Fact: The more servers the better the response time Fact: Having a server idle still consumes 60% of power Power for running server farms among the biggest costs/concerns of a company “greening” the Internet is a “hot” research topic! [new problem]: minimum number of servers that will guarantee a low E[T] or low PQ? We can get this from the M/M/k equations (but not easy) Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 2 Building up Intuition M/M/1 Rule of Thumb: utilization ρ should stay below 0.8 ρ = 0.8 E[N] = 4 ρ = 0.95 E[N] = 19 (delays explode!) E[TQ]M/M/K = (1/λ) •PQ • ρ/(1-ρ) Not as clear to tell (depends both on ρ and PQ) Q: How about E[TQ]/PQ? A: Expected waiting time only for delayed customers 1 ρ 1 E[TQ | delayed] λ 1 ρ kμ1 ρ Q: What does this equation imply for high ρ (e.g. ρ = 0.95)? A: high ρ does not imply high delay just put more servers! 5 servers delay = 4/μ | 100 servers delay = 1/(5μ) Q: Why? A: Even if all servers have high ρ prob{all busy at the same time} is lower Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 3 M/M/∞ Goal: minimum k, so that PQ < X% (e.g. 20%) Bounding PQ is equivalent to bounding E[TQ] etc. Easier to consider M/M/∞ first i λ 1 π i π0 μ i! Q: Local Balance equations? Q: What is this? A: The number of customers in an M/M/∞ is Poisson(λ/μ) Q: What is E[N] and E[T]? Does it confirm your intuition? Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 4 Square-Root Staffing Rule Def: R = λ/μ (assume R is large) Main result: with only k R R servers PQ < 20% Q: Probability to have more than R R jobs in an M/M/∞? A1: Prob{Poisson (R) > R R ) A2: (for large R) Poisson(R) Normal(R,R) => Final answer: P{Normal exceeds mean by >1 std dev.) = 16% Q: Is this probability higher or lower for M/M/k? A: Higher! M/M/∞ has more resources (servers) to “clear” extra work M/M/k: turns out that R R servers are enough for PQ < 20% See Ch.16 (Th. 48) for a more detailed rule and proof Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 5 Bulk Arrivals Systems: M[K]/M/2/6 Consider: Two server system (memoryless) Service rate μ (for each server) System size = 6 (2 in servers and 4 in queue) Batch/Bulk arrivals Batches of jobs arrive as Poisson(λ) Batch size Distribution: Each batch might contain 1,2,or 3 jobs Pr{X=1} = 0.5 Pr{X=2} = 0.3 Pr{X=3} = 0.2 Average Batch Size = E[X] =1(0.5)+2(0.3)+3(0.2)= 1.7 Q: How do we solve this queueing system? Q: Can we still use a CTMC to solve it? Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 6 Solving the Bulk Arrival System Local balance on cut Global Balance equations λ π1 π 0 π0 0.5λ 0.3λ 0.2λ π1μ 2 μ λ 0.5λ π1(0.5λ 0.3λ 0.2λ μ) π2μ π0 0.5λ π2 π0 π0 μ μ π2(λ μ) π3μ π0 0.3λ π1 0.5λ Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 7 Bulk/Batch Departures: M/M[K]/n Jobs are served in batches time between batches is exponential (μ) Fixed batch size k batch k = 3 λπ0= μπ1 + μπ2+ μπ3 (λ+μ) π1= λπ0+ μπ4 (λ+μ) π2= λπ1+ μπ5 : (λ+μ) πn= λπn-1+ μπn+k Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 8 Solving Bulk Arrival/Departure Systems Define transition matrix P Solve π•(I-P) = 0 Together with Σi πi = 1 Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 9 Examples of Batch/Bulk Arrivals/Service Batch Arrivals Customer arrive to a server with buses Bus arrivals are Poisson Number of customers in each bus is random variable X Number of files requested at a web/file server is random Requests arrive as Poisson Batch Departures Multicasting popular files Some of the nodes in the queue might be asking for the same file If the file requested by the node at the head of the queue was also requested by another k nodes in the queue, all k nodes are served with a single broadcast message (batch size k) The batch size k depends on the popularity of the file asked and the number of customers in the queue. Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 10 PASTA Property Assume you are simulating or observing a queueing system pn: (limiting) probability of being at state n (n jobs in system) Ergodicity pn = long-time fraction of being at state n an: probability that an arrival finds n jobs dn: probability that a departure leaves n jobs Goal: interested in measuring pn (percentage of time in state n) Q: How? Method 1: let the system run for infinite time (for long time) measure state repeat many times and take average Method 2: measure jobs at times of arrivals Q: Does this work? Is an = pn? Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 11 PASTA Property (2) Q: Is an = dn? A: Yes, if arrivals and departures happen one at a time (no batch) Q: Is an = pn? A: No, not necessarily! Example: Consider a single queue system Arrivals: Uniform in (1,2) Service times: Deterministic with a duration of 1 Q: What is a0? A: a0 = 1 customer completes service before the next arrival Q: What is p0? A: p0 ≠ 1 Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 12 PASTA: (P)oisson (A)rrivals (S)ee (T)ime (A)verage Theorem: If arrivals are Poisson, then an = pn = dn Proof: pn = limt∞P(N(t) = n) an = limt∞P(N(t) = n | an arrival occurred just after time t) Define: A(t,t+δ): event that an arrival occurred just after t Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 13 PASTA Property in Simulations/Experiments measurements S3 S1 S2 S5 S4 Assume you are simulating a system or observing a real system (a backbone router) The system moves randomly from state to state (e.g. Markov Chain, Queueing System) PASTA we can sample the system (state) at exponentially distributed times e.g. send measurement (“probe” packets) with exp inter-packet times Q: Do the samples need to be exponentially distributed A: No! they just need to be independent from N(t) Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 14 Queueing Networks: A Simple Tandem Queue Normal Approach: define a CTMC Finite CTMC: can solve in Matlab (for numerical rates) Infinite CTMC in multiple (2) dimensions: VERY hard! Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 15 Time-reversibility of Markov Chains Forward Chain: … 3 5 1 2 1 3 4 1 … Reverse Chain: … 3 5 1 2 1 3 4 1 … Q: Is the reverse chain a CTMC? A: Yes! VIEW 1 of CTMC can be shown! Time in state i is exponentially distributed Q: What is the probability p*ik (going from i to k in the reverse) A: It is the probability that the fwd chain went to i from k (and not another state) Q: What is Σkp*ik ? A: Σkp*ik = Σk P(reverse moves from state i to k) = 1 Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 16 Time-Reversibility (2) Q: How does π*i relate to πi? A: π*i = πi Time-reversibility Theorem: If πi qik = πk qki forward chain and reverse chain are identical! Example: consider an M/M/1 system Theorem says that rate of going from n to n+1 in forward chain (i.e. probability of a queue increase, given n) is equal to the rate of going from n to n+1 in reverse chain (i.e. a queue decrease, given n+1). Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 17 Time-reversibility examples λ0 0 2 1 μ1 λ2 λ1 μ2 … μ3 Q: Are birth-death processes time-reversible? A: Yes, they are: If n n+1 can only go back to n as n+1 n Q: What about batch arrival systems? A: No, they’re not necessarily! Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 18 Burke’s Theorem Poisson (λ) e.g. 3 jobs/sec service: exp(μ) e.g. 5 jobs/sec Burke’s Theorem (holds also for M/M/k) Q1: What is the departure process from an M/M/1? A: It is Poisson (λ) Q2: How does N(t) (the number of jobs in the system at time t) depends on the sequence of departure times prior to t? A: It does not! Proof: Q1: Departures in the M/M/1 are arrivals in the reverse process but the reverse process is an identical M/M/1 Q2: Sequence of departures prior to t sequence of arrivals (in reverse chain) after t clearly independent Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 19 Tandem Queue: Solution Using Burke’s Theorem 1st queue: M/M/1 P(n1 jobs) = ρ1n1(1-ρ1) Q: What about the 2nd queue? A: Seems like an M/M/1 also P(n2 jobs) = ρ2n2(1-ρ2) Q: But isn’t N2(t) dependent on N1(t)? A: Departures from queue 1 (before t) are arrivals to queue 2 before t departures before t are independent of N1(t) (Burke) arrivals (to queue 2) before t completely define N2(t) N1(t), N2(t) independent πn1,n2 = ρ1n1(1-ρ1) ρ2n2(1-ρ2) Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 20 Example of Tandem Queues Q: Which of the two systems has better performance? A: None! They both have the same mean response time Q: How can you quickly prove it? A: Use Little’s Law Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 21 An Acyclic Network with Probabilistic Routing Q: How can we solve this queueing system? Q: Can we still treat each individual queue as an M/M/1? A: Yes. Use Burke’s Theorem and Poisson Splitting πn1,n2,…nk = ρ1n1(1-ρ1) ρ2n2(1-ρ2) ⋯ ρknk(1-ρk) Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 22 Queueing Networks: Jackson Network • • • • Exponential servers FCFS queues Probabilistic routing Allows loops (cycles) Q: Is each queue still an M/M/1? Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 23 Jackson Network: A Counter-example Q: Is the total arrival process into the server (i.e. outside and feedback) a Poisson process? Q: Does the arrival process into this server look Poisson? A: No! the feedback and the external process are dependent Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 24 Jackson Network: Is a Product-Form Network k k i 1 i 1 Pnetwork state is (n1 , n2 ,..., n k ) Pn i jobs at server i ρni i (1 ρi ) This is a very important result! It transforms an infinite k-dimensional Markov chain into a simple closed form Can still treat each queue independently! Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 25 Jackson Network Example: a Web Server Thrasyvoulos Spyropoulos / spyropou@eurecom.fr Eurecom, Sophia-Antipolis 26 27 28