Backward Congestion Notification Version 2.0

advertisement
Backward Congestion
Notification Version 2.0
Davide Bergamasco (davide@cisco.com)
Rong Pan (ropan@cisco.com)
Cisco Systems, Inc.
IEEE 802.1 Interim Meeting
Garden Grove, CA (USA)
September 22, 2005
1
Credits
• Valentina Alaria (Cisco)
• Andrea Baldini (Cisco)
• Flavio Bonomi (Cisco)
• Manoj K. Wadekar (Intel)
2
BCN v2.0
• Desire from Mick to see an analytical study
of BCN stability
• BCN v2.0 improvements
• Linear control loop allows analysis of stability
• Simplified detection mechanism
• Reduced signaling rate
• Original BCN framework remains the same
3
BCN Background
Data Center Network
Tra
ffic
10 Gbps
Congestion
10 Gbps
10 Gbps
BCN M
e
End Node C
Edge Switch C
ssa g e
ag
BC
N
Me
ss
10 Gbps
e
Core Switch
Tra
ffic
Edge Switch A
10 Gbps
Traffic
Traffic
Edge Switch B
Tra
ffic
End Node A
10 Gbps
End Node B
4
Detection & Signaling
Qsc
Qeq
EMPTY QUEUE
FULL QUEUE
IN
OUT
Sample
Frame with
Probability P
MESSAGE TO GENERATE
No
BCN (0,0)
Sampled
Frame?
No
Yes
BCN (Qoff, Qdelta)
RL
Tagged
Frame?
No Message
MESSAGE TO GENERATE
Send
BCN
Yes
BCN (0,0)
BCN (Qoff, Qdelta)
NOP
Qoff
= Qeq - Qlen
[-Qeq. +Qeq]
Qdelta = #pktEnq - #pktDeq [-2Qeq, +2Qeq]
5
Reaction
No
Match
* Feedback
Fb = (Qoff - W * Qdelta)
F1
* Additive Increase (Fb > 0)
R1
R2
Data OUT
* Multiplicative Decrease (Fb < 0)
R = R * ( 1 - Gd * |Fb| )
* Parameters
Fn
Data IN
F2
R = R + Gi * Fb * ru
W = derivative weight
Gi = increase gain
Gd = decrease gain
ru = rate unit
Rn
Control IN
Packets Marked with
RATE_LIMITED_TAG
EDGE
NODE
BCN Messages
from congested
point
NETWORK
CORE
6
Suggested BCN Message Format
0
15
31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
DA = SA of sampled frame
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
SA = MAC Address of CP
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
IEEE 802.1Q Tag or S-Tag
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
EtherType = BCN
|Version|
Reserved
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
CPID
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Qoff
|
Qdelta
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Timestamp
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
|
|
|
First N bytes of sampled frame starting from DA
|
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
FCS
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
7
Suggested RLT Tag Format
0
3
7
15
31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
DA of rate-limited frame
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
SA of rate-limited frame
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
IEEE 802.1Q Tag or S-Tag of rate-limited frame
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
EtherType = RLT
|Version|
Reserved
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
CPID
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Timestamp
|EtherType of rate limited frame|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
+
Payload of rate-limited frame
+
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
FCS
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
8
Simulation Environment (1)
ES6
Core Switch
TCP Bulk
SJ
UDP On/Off
Congestion
DR2
ES1
ES2
ES3
ES4
ES5
SR2
SR1
ST1
SU1
ST2
SU2
ST3
SU3
ST4
SU4
DT
DU
DR1
9
Simulation Environment (2)
• Short Range, High Speed DC Network
• Link Capacity = 10 Gbps
• Switch latency = 1 s
• Link Length = 100 m (0.5  s propagation delay)
• Control loop
• Delay ~ 3 s
• Parameters
• W=2
• Gi = 4
• Gd = 1/64
• Ru = 8 Mbps
• Workload
• ST1-ST4: 10 parallel TCP connections transferring
1 MB each continuously
• SU1-SU4: 64 KB bursts of UDP traffic starting at t = 10 ms
10
BCNv1.0
11
BCNv2.0
Faster
Transient
Response
Higher Stability @
Steady State
12
Simulation Environment (3)
• Long Range, High Speed DC Network
• Link Capacity = 10 Gbps
• Switch latency = 1 s
• Link Length = 20000 m (100  s propagation delay)
• Control loop
• Delay ~ 200 s
• Parameters
• W=2
• Gi = 4
• Gd = 1/64
• Ru = 8 Mbps
• Workload
• ST1-ST4: 10 parallel TCP connections transferring
1 MB each continuously
• SU1-SU4: 64 KB bursts of UDP traffic starting at t = 10 ms
13
BCNv1.0
14
BCNv2.0
Much higher
stability @ steady
state with larger
loop delays
15
Summary
• BCN v2 has a number of advantages …
• Can be studied analytically
• Better protection of TCP flows in mixed TCP and UDP traffic
scenarios
• Detection algorithm independent of Switch implementation
• Better Performance
• Lower signaling frequency (from 10% to 1%)
• Better stability
• Increased tolerance to loop delays
• … and one disadvantage
• Slower convergence to fairness
16
A Control-Theoretic Approach to BCN
Design and Analysis
17
Notation
N: Number of Flows
C: Link Capacity
: Round Trip Delay
w: Weight of the Derivitive
Pm: Sampling Probability
Gi: Additive Increase Gain
Gd: Multiplicative Decrease Gain
18
Block Diagram of BCN Congestion Control
C
∆R
+
+
R
+
N

_
q
+
Gd
_
Time
Delay
+
Fb(T )  (qeq  q(T )) 
w * (q(T )  q(T  1))
Pm
Gi
19
Non-linear Differential Equations
Link Control
dq(t )
 N * R(t )  C
dt
Fb(t )  (qeq  q(t ))  w *
dq(t )
1
*
dt
C * Pm
Source Control
If Fb(t-) > 0
dR(t )
 Gi * Fb(t   ) * R(t   ) * Pm
dt
If Fb(t-) < 0
dR(t )
 R(t ) * Gd * Fb(t   ) * R(t   ) * Pm
dt
20
Linearization Around Operating Point
• Using feedback control to analyze local stability
• Operating point:
 R = C/N;
 q’ = qeq – q = 0;
• Linearization
 Difficulty: depending on sgn(Fb(t-d)), the system
responses are different
– Luckily, a piecewise-linear function
 Details are in the appendix
21
Block Diagram of BCN Feedback Control
R
+
q
N
s
+
lose 90o margin
Multiplicative Decrease:
Gd * Pm * C 2
N2
GC
sw d
N
_
e
 s
Fb
ww**s s
))
CC**PmPm
Fb((ss)  ((11
Fb
Additive Increase:
C
N
s  Gi * w
Gi * Pm *
+
add lead zero to
compensate
22
The Effect Of Zero From Time Domain’s Eyes
R
q
zero:dq/dt
23
Choosing Parameters – an example
• Network conditions (10G link)
 N = 50
  = 200us
• Choose parameters such that the feedback loop
is stable with a 35o margin
w=4
 Gi = 2Mbps
 Gd = 1/128
 Pm = 0.01
24
lost 90o margin
Stability Result:
1. With N = 50, delay = 200us, the system
is stable
2. Phase margin translates into allowing
extreme network conditions of N -> 1000
flows or  -> 1ms before oscillation
25
Simulation Result Shows A Stable System for N
= 50; Delay = 200us
26
Simulation Result Shows System is stable, but on the
verge of oscillation: N = 50, Delay = 1ms
27
Change W = 4 -> 1
1. When w = 1, a system with N = 50, delay
= 200us already runs out of margin, on
the verge of oscillation
2. w = 1, diminishing zero effect. System
can’t cope with wide range of network
conditions
28
Indeed System is stable, but on the verge of oscillation
even for N = 50, Delay = 200us when w = 1.0
29
Requests to 802.1
• Start a Task Force on Congestion Management
• Use BCN as a Baseline Proposal
30
Appendix
31
Linearizing…
.
 q(t )  NR(t )
NR( s )
q( s ) 
s
.
w *  q(t )
Fb(t )  G * (q(t ) 
)
C * Pm
w* s
 Fb( s )  G * (1 
)q( s)
C * Pm
32
Linearizing Additive Increase Function
dR (t )
f:
 Gi * R(t ) * Pm * Fb(t   )
dt
C * Pm
f
 Gi * R(t ) * Pm  Gi
Fb(t   )
N
f
R(t )
 (Gi * R(t ) * Pm * G * (qeq  q(t   ) 

R(t )
dq (t   )
w
*
)
dt
C * Pm
33
Linearizing Additive Increase Function
f
R(t )
 (Gi * R(t ) * Pm * G * (qeq  q(t   ) 

R(t )
 Gi * Pm * G * (qeq  q (t   ) 
dq (t   )
w
*
)  G * Gi * R(t ) * Pm *
dt
C * Pm
 (( NR(t   )  C ) *
 G * Gi * R(t ) * Pm *
R(t )
 (( NR(t )  C ) *
 G * Gi * R(t ) * Pm *
dq (t   )
w
*
)
dt
C * Pm
R(t )
dq (t   )
w
*
)
dt
C * Pm
R(t )
 (qeq  q(t   ) 
w
)
C * Pm
w
)
C * Pm
 G * Gi * R(t ) * Pm * N *
w
C * Pm
 G * Gi * w
C
Fb  G * Gi * w * R
N
C
Gi * Pm *
N Fb
 R 
s  G * Gi * w
 sR  Gi * Pm *
34
Linearizing Multiplicative Decrease Function
dR(t )
 Gd * R(t ) * R(t   ) * Pm * Fb(t   )
dt
Gd * Pm * C 2
g
 Gd * R(t ) * R(t   ) * Pm 
Fb(t   )
N2
g:
g
R(t )
 (Gd * R(t ) * R(t   ) * Pm * G * (qeq  q(t   ) 

R(t )
dq(t   )
w
*
)
dt
C * Pm
35
Linearizing Multiplicative Decrease Function
g
R(t )
 (Gd * R(t ) * R(t   ) * Pm * G * (qeq  q(t   ) 

R(t )
 2 * Gd * R(t ) * Pm * G * (qeq  q (t   ) 
dq (t   )
w
*
)  Gd * R 2 (t ) * Pm * G *
dt
C * Pm
 (( NR(t   )  C ) *
 Gd * R 2 (t ) * Pm * G *
R(t )
 (( NR(t )  C ) *
 Gd * R 2 (t ) * Pm * G *
 Gd * R(t ) * G * w  Gd *
R(t )
dq(t   )
w
*
)
dt
C * Pm
dq (t   )
w
*
)
dt
C * Pm
R(t )
 (qeq  q(t   ) 
w
)
C * Pm
w
)
C * Pm
 Gd * R 2 (t ) * Pm * G * N *
w
C * Pm
C
*G * w
N
Gd * Pm * C 2
G *C
 sR 
Fb  G * w d
* R
2
N
N
Gd * Pm * C 2
N2
 R 
Fb
Gd * C
s G*w
N
36
Issue #1: Non-linearity
Q
• ISSUE: Overshoots
Stop Generation of BCN Messages
+
-
-
+
+
-
-
and undershoots
accumulate over time
• SOLUTION: Signal
+
only when
• Q > Qeq && dQ/dt > 0
• Q < Qeq && dQ/dt < 0
• Easy to implement in
Qeq
hardware: just an
Up/Down counter
• Increment @ every
enqueue
• Decrement @ every
dequeue
t
• Reduces signaling
rate by 50%!!
37
Issue #2: Specific Detection Mechanism
T-4
T-3
T-2
FULL QUEUE
T-1
T+0
T+1
T+2
T+3
T+4
EMPTY QUEUE
EQUILIBRIUM
IN
OUT
Sample
Frame with
Probability P
MESSAGE TO GENERATE
Sampled
Frame?
Yes
No
RL
Tagged
Frame?
No
BCN 0
BCN-4
BCN-3
BCN-2
BCN-1
NOP
No Message
No
Yes
MESSAGE TO GENERATE
BCN 0
BCN-4
BCN-3
BCN-2
BCN-1
No
Message
BCN+1
BCN+2
BCN+3
BCN+4
dQ/dt < 0?
0
BCN
type
-
MESSAGE TO GENERATE
RL Tag
&& Solicit
Bit Set?
Yes
+
Send
BCN
dQ/dt > 0
Yes
No
Yes
No Message
BCN+1
BCN+2
BCN+3
BCN+4
NOP
No
NOP
38
39
Download