Backward Congestion Notification Version 2.0 Davide Bergamasco (davide@cisco.com) Rong Pan (ropan@cisco.com) Cisco Systems, Inc. IEEE 802.1 Interim Meeting Garden Grove, CA (USA) September 22, 2005 1 Credits • Valentina Alaria (Cisco) • Andrea Baldini (Cisco) • Flavio Bonomi (Cisco) • Manoj K. Wadekar (Intel) 2 BCN v2.0 • Desire from Mick to see an analytical study of BCN stability • BCN v2.0 improvements • Linear control loop allows analysis of stability • Simplified detection mechanism • Reduced signaling rate • Original BCN framework remains the same 3 BCN Background Data Center Network Tra ffic 10 Gbps Congestion 10 Gbps 10 Gbps BCN M e End Node C Edge Switch C ssa g e ag BC N Me ss 10 Gbps e Core Switch Tra ffic Edge Switch A 10 Gbps Traffic Traffic Edge Switch B Tra ffic End Node A 10 Gbps End Node B 4 Detection & Signaling Qsc Qeq EMPTY QUEUE FULL QUEUE IN OUT Sample Frame with Probability P MESSAGE TO GENERATE No BCN (0,0) Sampled Frame? No Yes BCN (Qoff, Qdelta) RL Tagged Frame? No Message MESSAGE TO GENERATE Send BCN Yes BCN (0,0) BCN (Qoff, Qdelta) NOP Qoff = Qeq - Qlen [-Qeq. +Qeq] Qdelta = #pktEnq - #pktDeq [-2Qeq, +2Qeq] 5 Reaction No Match * Feedback Fb = (Qoff - W * Qdelta) F1 * Additive Increase (Fb > 0) R1 R2 Data OUT * Multiplicative Decrease (Fb < 0) R = R * ( 1 - Gd * |Fb| ) * Parameters Fn Data IN F2 R = R + Gi * Fb * ru W = derivative weight Gi = increase gain Gd = decrease gain ru = rate unit Rn Control IN Packets Marked with RATE_LIMITED_TAG EDGE NODE BCN Messages from congested point NETWORK CORE 6 Suggested BCN Message Format 0 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + DA = SA of sampled frame +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ SA = MAC Address of CP + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IEEE 802.1Q Tag or S-Tag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | EtherType = BCN |Version| Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + CPID + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Qoff | Qdelta | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Timestamp | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | | First N bytes of sampled frame starting from DA | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FCS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 7 Suggested RLT Tag Format 0 3 7 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + DA of rate-limited frame +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ SA of rate-limited frame + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IEEE 802.1Q Tag or S-Tag of rate-limited frame | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | EtherType = RLT |Version| Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + CPID + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Timestamp |EtherType of rate limited frame| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Payload of rate-limited frame + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FCS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 8 Simulation Environment (1) ES6 Core Switch TCP Bulk SJ UDP On/Off Congestion DR2 ES1 ES2 ES3 ES4 ES5 SR2 SR1 ST1 SU1 ST2 SU2 ST3 SU3 ST4 SU4 DT DU DR1 9 Simulation Environment (2) • Short Range, High Speed DC Network • Link Capacity = 10 Gbps • Switch latency = 1 s • Link Length = 100 m (0.5 s propagation delay) • Control loop • Delay ~ 3 s • Parameters • W=2 • Gi = 4 • Gd = 1/64 • Ru = 8 Mbps • Workload • ST1-ST4: 10 parallel TCP connections transferring 1 MB each continuously • SU1-SU4: 64 KB bursts of UDP traffic starting at t = 10 ms 10 BCNv1.0 11 BCNv2.0 Faster Transient Response Higher Stability @ Steady State 12 Simulation Environment (3) • Long Range, High Speed DC Network • Link Capacity = 10 Gbps • Switch latency = 1 s • Link Length = 20000 m (100 s propagation delay) • Control loop • Delay ~ 200 s • Parameters • W=2 • Gi = 4 • Gd = 1/64 • Ru = 8 Mbps • Workload • ST1-ST4: 10 parallel TCP connections transferring 1 MB each continuously • SU1-SU4: 64 KB bursts of UDP traffic starting at t = 10 ms 13 BCNv1.0 14 BCNv2.0 Much higher stability @ steady state with larger loop delays 15 Summary • BCN v2 has a number of advantages … • Can be studied analytically • Better protection of TCP flows in mixed TCP and UDP traffic scenarios • Detection algorithm independent of Switch implementation • Better Performance • Lower signaling frequency (from 10% to 1%) • Better stability • Increased tolerance to loop delays • … and one disadvantage • Slower convergence to fairness 16 A Control-Theoretic Approach to BCN Design and Analysis 17 Notation N: Number of Flows C: Link Capacity : Round Trip Delay w: Weight of the Derivitive Pm: Sampling Probability Gi: Additive Increase Gain Gd: Multiplicative Decrease Gain 18 Block Diagram of BCN Congestion Control C ∆R + + R + N _ q + Gd _ Time Delay + Fb(T ) (qeq q(T )) w * (q(T ) q(T 1)) Pm Gi 19 Non-linear Differential Equations Link Control dq(t ) N * R(t ) C dt Fb(t ) (qeq q(t )) w * dq(t ) 1 * dt C * Pm Source Control If Fb(t-) > 0 dR(t ) Gi * Fb(t ) * R(t ) * Pm dt If Fb(t-) < 0 dR(t ) R(t ) * Gd * Fb(t ) * R(t ) * Pm dt 20 Linearization Around Operating Point • Using feedback control to analyze local stability • Operating point: R = C/N; q’ = qeq – q = 0; • Linearization Difficulty: depending on sgn(Fb(t-d)), the system responses are different – Luckily, a piecewise-linear function Details are in the appendix 21 Block Diagram of BCN Feedback Control R + q N s + lose 90o margin Multiplicative Decrease: Gd * Pm * C 2 N2 GC sw d N _ e s Fb ww**s s )) CC**PmPm Fb((ss) ((11 Fb Additive Increase: C N s Gi * w Gi * Pm * + add lead zero to compensate 22 The Effect Of Zero From Time Domain’s Eyes R q zero:dq/dt 23 Choosing Parameters – an example • Network conditions (10G link) N = 50 = 200us • Choose parameters such that the feedback loop is stable with a 35o margin w=4 Gi = 2Mbps Gd = 1/128 Pm = 0.01 24 lost 90o margin Stability Result: 1. With N = 50, delay = 200us, the system is stable 2. Phase margin translates into allowing extreme network conditions of N -> 1000 flows or -> 1ms before oscillation 25 Simulation Result Shows A Stable System for N = 50; Delay = 200us 26 Simulation Result Shows System is stable, but on the verge of oscillation: N = 50, Delay = 1ms 27 Change W = 4 -> 1 1. When w = 1, a system with N = 50, delay = 200us already runs out of margin, on the verge of oscillation 2. w = 1, diminishing zero effect. System can’t cope with wide range of network conditions 28 Indeed System is stable, but on the verge of oscillation even for N = 50, Delay = 200us when w = 1.0 29 Requests to 802.1 • Start a Task Force on Congestion Management • Use BCN as a Baseline Proposal 30 Appendix 31 Linearizing… . q(t ) NR(t ) NR( s ) q( s ) s . w * q(t ) Fb(t ) G * (q(t ) ) C * Pm w* s Fb( s ) G * (1 )q( s) C * Pm 32 Linearizing Additive Increase Function dR (t ) f: Gi * R(t ) * Pm * Fb(t ) dt C * Pm f Gi * R(t ) * Pm Gi Fb(t ) N f R(t ) (Gi * R(t ) * Pm * G * (qeq q(t ) R(t ) dq (t ) w * ) dt C * Pm 33 Linearizing Additive Increase Function f R(t ) (Gi * R(t ) * Pm * G * (qeq q(t ) R(t ) Gi * Pm * G * (qeq q (t ) dq (t ) w * ) G * Gi * R(t ) * Pm * dt C * Pm (( NR(t ) C ) * G * Gi * R(t ) * Pm * R(t ) (( NR(t ) C ) * G * Gi * R(t ) * Pm * dq (t ) w * ) dt C * Pm R(t ) dq (t ) w * ) dt C * Pm R(t ) (qeq q(t ) w ) C * Pm w ) C * Pm G * Gi * R(t ) * Pm * N * w C * Pm G * Gi * w C Fb G * Gi * w * R N C Gi * Pm * N Fb R s G * Gi * w sR Gi * Pm * 34 Linearizing Multiplicative Decrease Function dR(t ) Gd * R(t ) * R(t ) * Pm * Fb(t ) dt Gd * Pm * C 2 g Gd * R(t ) * R(t ) * Pm Fb(t ) N2 g: g R(t ) (Gd * R(t ) * R(t ) * Pm * G * (qeq q(t ) R(t ) dq(t ) w * ) dt C * Pm 35 Linearizing Multiplicative Decrease Function g R(t ) (Gd * R(t ) * R(t ) * Pm * G * (qeq q(t ) R(t ) 2 * Gd * R(t ) * Pm * G * (qeq q (t ) dq (t ) w * ) Gd * R 2 (t ) * Pm * G * dt C * Pm (( NR(t ) C ) * Gd * R 2 (t ) * Pm * G * R(t ) (( NR(t ) C ) * Gd * R 2 (t ) * Pm * G * Gd * R(t ) * G * w Gd * R(t ) dq(t ) w * ) dt C * Pm dq (t ) w * ) dt C * Pm R(t ) (qeq q(t ) w ) C * Pm w ) C * Pm Gd * R 2 (t ) * Pm * G * N * w C * Pm C *G * w N Gd * Pm * C 2 G *C sR Fb G * w d * R 2 N N Gd * Pm * C 2 N2 R Fb Gd * C s G*w N 36 Issue #1: Non-linearity Q • ISSUE: Overshoots Stop Generation of BCN Messages + - - + + - - and undershoots accumulate over time • SOLUTION: Signal + only when • Q > Qeq && dQ/dt > 0 • Q < Qeq && dQ/dt < 0 • Easy to implement in Qeq hardware: just an Up/Down counter • Increment @ every enqueue • Decrement @ every dequeue t • Reduces signaling rate by 50%!! 37 Issue #2: Specific Detection Mechanism T-4 T-3 T-2 FULL QUEUE T-1 T+0 T+1 T+2 T+3 T+4 EMPTY QUEUE EQUILIBRIUM IN OUT Sample Frame with Probability P MESSAGE TO GENERATE Sampled Frame? Yes No RL Tagged Frame? No BCN 0 BCN-4 BCN-3 BCN-2 BCN-1 NOP No Message No Yes MESSAGE TO GENERATE BCN 0 BCN-4 BCN-3 BCN-2 BCN-1 No Message BCN+1 BCN+2 BCN+3 BCN+4 dQ/dt < 0? 0 BCN type - MESSAGE TO GENERATE RL Tag && Solicit Bit Set? Yes + Send BCN dQ/dt > 0 Yes No Yes No Message BCN+1 BCN+2 BCN+3 BCN+4 NOP No NOP 38 39