The Buffer - Bandwidth Tradeoff in Lossless Networks Alex Shpiner, Eitan Zahavi, Ori Rottenstreich April 2014 Background: Network Incast Congestion 100% 300% 100% 100% Source: http://theithollow.com/2013/05/flow-control-explained/ © 2014 Mellanox Technologies - Mellanox Technologies- 2 Background: Pause Frame Flow Control Buffer Source: http://theithollow.com/2013/05/flow-control-explained/ © 2014 Mellanox Technologies - Mellanox Technologies- 3 Background: Incast with Pause Frame Flow Control 33% 33% 100% 33% Source: http://theithollow.com/2013/05/flow-control-explained/ © 2014 Mellanox Technologies - Mellanox Technologies- 4 Background: Congestion Spreading Problem 33% instead of 66% This flow is also paused, since the pause control does not distinguish between flows. Effective link bandwidth= Link bandwidth * %unpaused Small buffers Link pauses Congestion spreading Effective link bandwidth decrease To deal with Incast we can: • Increase buffers • Increase link bandwidth Trade-off Source: http://theithollow.com/2013/05/flow-control-explained/ © 2014 Mellanox Technologies - Mellanox Technologies- 5 Buffer-bandwidth Tradeoff Higher bandwidth allows: • Faster buffer draining • Pausing link for longer periods, while achieving same effective bandwidth. reduced buffering demand. Effective bandwidth = Link bandwidth * %unpaused Aim: evaluate the buffer-bandwidth tradeoff. Assumptions: • Lossless network • Congestion spreading avoidance is desired © 2014 Mellanox Technologies - Mellanox Technologies- 6 Evaluation Flow Traffic Injector 1). Assume network with links of bandwidth C and buffers of size B 0 C B C 1 C/N B C/N Traffic Receiver C/N B C C Switch 2). Define the most challenging workload the network can handle without congestion spreading Data Rate N-1 Time 3). Increase links bandwidth by α and reduce buffer size by β Transmit the workload at the same rate, while keeping the same effective link bandwidth © 2014 Mellanox Technologies 4). Evaluate the relation between α and β that able to handle the workload from step 2. - Mellanox Technologies- 𝐶𝑛𝑒𝑤 = 𝛼𝐶, 𝐵𝑛𝑒𝑤 = 𝛽𝐵, 𝛼>1 𝛽<1 β = 𝑓(𝛼) 7 Network Model The most challenging workload: Data Rate Injected Traffic Traffic Injector 0 C B Time 1 C C/N B C/N Injected Traffic Data Rate Traffic Receiver C/N B C C Switch N-1 Time © 2014 Mellanox Technologies - Mellanox Technologies- 8 Determining the Workload Parameters Arrival rate λin C t T/N T Assumption: output link is 100% utilized. • Burst length = 𝑇/𝑁 Q 𝑇 =? Queue size • It takes 𝑡 = 𝑇 N senders to fill buffer of size 𝐵 at arrival rate 𝐶 and departure rate 𝐶/𝑁: B 𝑁 𝑇 𝐵 = 𝑁 𝐶 − 𝐶 𝑁 𝑁2𝐵 𝑇= 𝐶(𝑁 − 1) t © 2014 Mellanox Technologies - Mellanox Technologies- 9 Effect of Buffer Size Reduction with Link Acceleration λin C Arriving traffic t T/N T/N TT Input Queue Size Q B 𝛽𝐵 𝑡1 = 𝐶(1 − α 𝑁) 𝛽𝐵 𝑡2 = 𝑡3 − 𝛼𝐶 𝑁 𝑇 𝑡3 = 𝛼 𝑄𝑚𝑎𝑥 = 𝐵 t Q Input Queue Size βB 𝑄𝑚𝑎𝑥 = 𝛽𝐵, 𝛽 < 1 𝐶𝑛𝑒𝑤 = 𝛼𝐶, 𝛼 > 1 © 2014 Mellanox Technologies t1 t2 t3 Conclusions: • We can push more traffic. • When the buffer is full, the link is in paused mode: 𝐶𝑒𝑓𝑓 = 𝛼𝐶 𝑁 (congestion spreading) t - Mellanox Technologies- 10 Effect of Buffer Size Reduction with Link Acceleration– cont. Q When buffer is full, the link is in paused mode: • Paused mode 𝐶𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒 = 𝛼𝐶 𝑁 βB congestion spreading t1 %𝑝𝑎𝑢𝑠𝑒𝑑 = 𝑡2 −𝑡1 𝑇 =⋯= 𝑁−𝛼−𝛽 𝑁−1 𝛼 𝑁−𝛼 𝑡1 = (1) 𝛽𝐵 ;𝑡 𝐶(1−α 𝑁) 2 t2 = 𝑡3 − t3 t T 𝛽𝐵 ;𝑡 𝛼𝐶 𝑁 3 = 𝑇 𝛼 By how much the buffer can be reduced (β) to avoid congestion spreading (%𝑝𝑎𝑢𝑠𝑒𝑑 = 0) ? • For %𝑝𝑎𝑢𝑠𝑒𝑑 = 0, 𝛼 = - 𝛽 = 0.6 © 2014 Mellanox Technologies = 1.4, 𝑁 = 2 (incast 2 56 40 = 1.4, 𝑁 = 10 (incast 10 1): 40% of buffer saving!!! • For %𝑝𝑎𝑢𝑠𝑒𝑑 = 0, 𝛼 = - 𝛽 = 0.95 56 40 1): only 5% of buffer saving - Mellanox Technologies- 11 Effect of Buffer Size Reduction with Link Acceleration– cont. Q Observation: we are allowed to pause the link, since we increased the link capacity. • 𝐶 = 𝛼𝐶 1 − %𝑝𝑎𝑢𝑠𝑒𝑑 + 𝛼𝐶 𝑁 ∙ %𝑝𝑎𝑢𝑠𝑒𝑑 βB %𝑝𝑎𝑢𝑠𝑒𝑑 = 𝛼−1 𝛼 𝛼−𝑁 - For 𝛼 = 1.4, 𝑁 = 10 : %𝑝𝑎𝑢𝑠𝑒𝑑 = 32% • Using (1) and (2) 56 𝛽= - For 𝑁 = 10, 𝛼 = 40 = 1.4 2 t1 𝑡1 = 𝛽𝐵 ;𝑡 𝐶(1−α 𝑁) 2 t2 = 𝑡3 − t3 t T 𝛽𝐵 ;𝑡 𝛼𝐶 𝑁 3 = (𝑁−𝛼)(2𝑁−𝛼𝑁−1) . (𝑁−1)2 𝛽 = 0.53!!! We can save 47% of buffer size with 40% of link rate increase, to get the same performance! And we can also push more data in total (e.g. 56Gbps vs 40Gbps) • with the congestion spreading cost © 2014 Mellanox Technologies - Mellanox Technologies- 12 𝑇 𝛼 Asymptotic Analysis For 𝜶 ≥ 𝟐 no buffering is required in the switches! By increasing link bandwidth by X% the buffer saving is at least X% for any incast load! Buffer Size Saving (1-β) Buffer Size Saving (1-β) 100% 90% α 80% 1.1 80% 70% 1.2 70% 60% 1.3 60% 50% 1.4 50% 40% 1.5 40% 30% 1.6 30% 20% 1.7 20% 10% 1.8 10% 0% 1.9 0% 100% 0 5 10 15 20 25 30 35 40 45 90% 50 2 4 8 16 32 64 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 α N (Incast ratio N to 1) © 2014 Mellanox Technologies N - Mellanox Technologies- 13 Multiple Incast Cascade Analysis Rank 2 Rank 1 Rank n Traffic Injectors Last rank defines the workload parameters: C 𝐵 ∙ 𝑘𝑛 𝑇𝑛 = ; 𝐶 𝑘𝑛 − 1 𝑇 𝑇𝑛−1 = 𝑛 𝑘 ; 𝑛−1 𝑇𝑛 𝑇1 = 𝑛−1 𝑖=1 𝑘𝑖 λin C B B k1 B Tn C Q C Switch B B k2 Traffic Receiver C B k1 C B Switch C B B t T C kn B B B B Switch Switch t C Rank n, α>1, β<1: C αC The analysis is similar to a single-rank case, but now the traffic arrives at rate 𝛼𝐶 λin Tn/α T t 𝜷 = 𝟏 − 𝜶 ∗ %𝒑𝒂𝒖𝒔𝒆𝒅 Q βB 𝛽 𝛼 = 1.4, %𝑝𝑎𝑢𝑠𝑒𝑑 = 32% = 0.55 t © 2014 Mellanox Technologies - Mellanox Technologies- 14 Conclusions We can reduce switch buffer size, while still pushing the same traffic. • But, we pay with: - Congestion spreading (pause frames) - Buffers at the traffic sources (NICs) By increasing the links bandwidth, we can reduce the congestion spreading • And push more traffic. We can save X% of buffer size with X% of link rate increase (for any incast). When increasing the links bandwidth by a factor of at least 2 (𝜶 ≥ 𝟐) no buffering is required at the switches. The results hold also for the multiple incast cascade. © 2014 Mellanox Technologies - Mellanox Technologies- 15 Thank You Thank You © 2014 Mellanox Technologies - Mellanox Technologies- 16