Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control

advertisement
Tutorial Survey of LL-FC Methods
for Datacenter Ethernet
101 Flow Control
M. Gusat
Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten
and Clark Jeffries
26 Sept. 2006
IBM Zurich Research Lab
1
Outline
• Part I


Requirements of datacenter link-level flow control (LL-FC)
Brief survey of top 3 LL-FC methods




PAUSE, aka. On/Off grants
credit
rate
Baseline performance evaluation
• Part II

Selectivity and scope of LL-FC

per-what? : LL-FC’s resolution
2
Req’ts of .3x’: Next Generation of Ethernet Flow
Control for Datacenters
1.
Lossless operation
No-drop expectation of datacenter apps (storage, IPC)
Low latency
2.
Selective
Discrimination granularity: link, prio/VL, VLAN, VC, flow...?
Scope: Backpressure upstream one hop, k-hops, e2e...?
3.
Simple...
PAUSE-compatible !!
3
Generic LL-FC System
RTT
•
One link with 2 adjacent buffers: TX (SRC) and RX (DST)

•
Round trip time (RTT) per link is system’s time constant
LL-FC issues:



link traversal (channel Bw allocation)
RX buffer allocation
pairwise-communication between channel’s terminations


signaling overhead (PAUSE, credit, rate commands)
backpressure (BP):


increase / decrease injections
stop and restart protocol
4
FC-Basics: PAUSE (On/Off Grants)
Stop
Go
TX Queues
“Over -run”=
Send STOP
FC Return path
RX Buffer
OQ
Data Link
Threshold
PAUSE BP Semantics :
STOP / GO / STOP..
Xbar
Down-stream Links
* Note: Selectivity and granularity of
FC domains are not considered here.
5
FC-Basics: Credits
Xbar
* Note: Selectivity and granularity of
FC domains are not considered here.
6
Correctness: Min. Memory for “No Drop”
 "Minimum“: to operate lossless => O(RTTlink)
– Credit : 1 credit = 1 memory location
– Grant : 5 (=RTT+1) memory locations
 Credits
– Under full load the single credit is constantly looping between RX and TX
RTT=4 =>
max. performance = f(up-link utilisation) = 25%
 Grants
– Determined by slow restart: if last packet has left the RX queue, it takes an
RTT until the next packet arrives
7
PAUSE vs. Credit @ M = RTT+1
 "Equivalent" = ‘fair’ comparison
1. Credit scheme: 5 credit = 5 memory locations
2. Grant scheme: 5 (=RTT+1) memory locations
Performance loss for PAUSE/Grants is due to lack of underflow protection, because
if M < 2*RTT the link is not work-conserving (pipeline bubbles on restart)
For equivalent (to credit) performance, M=9 is required for PAUSE.
8
FC-Basics: Rate
•
•
•
RX queue Qi=1 (full capacity).
Max. flow (input arrivals) during one timestep
(Dt = 1) is 1/8.
Goal: update the TX probability Ti from any
sending node during the time interval [t, t+1) to
obtain the new Ti applied during the time
interval [t+1, t+2).
•
Algorithm for obtaining Ti(t+1) from Ti(t) ... =>
•
Initially the offered rate from source0 was set
= .100 , and from source1 = .025. All other
processing rates were .125. Hence all queues
show low occupancy.
•
At timestep 20, the flow rate to the sink was
reduced to .050 => causing a congestion level in
Queue2 of .125/.050 = 2.5 times processing
capacity.
•
Results: The average queue occupancies are .23
to .25, except Q3 = .13. The source flows are
treated about equally and their long-term sum is
about .050 (optimal).
9
Conclusion Part I: Which Scheme is “Better”?
•
PAUSE
+ simple
+ scalable (lower overhead of signalling)
- 2xM size required
•
Credits (absolute or incremental)
+ are always lossless, independent of the RTT and memory size
+ adopted by virtually all modern ICTNs (IBA, PCIe, FC, HT, ...)
not trivial for buffer-sharing
protocol reliability
scalability
• At equal M = RTT, credits show 30+% higher Tput vs. PAUSE
*Note: Stability of both was formally proven here
•
Rate: in-between PAUSE and credits
+ adopted in adapters
+ potential good match for BCN (e2e CM)
- complexity (cheap fast bridges)
10
Part II: Selectivity and Scope of LL-FC
“Per-Prio/VL PAUSE”

The FC-ed ‘link’ could be a




physical channel (e.g. 802.3x)
virtual lane (VL, e.g. IBA 2-16 VLs)
virtual channel (VC, larger figure)
...
• Per-Prio/VL PAUSE is the often proposed PAUSE v2.0 ...
• Yet, is it good enough for the next decade of datacenter
Ethernet?
• Evaluation of IBA vs. PCIe/As vs. NextGen-Bridge (PrizmaCi)
11
Already Implemented in IBA (and other ICTNs...)
•
IBA has 15 FC-ed VLs for QoS

•
However, IBA doesn’t have VOQ-selective LL-FC

•
“selective” = per switch (virtual) output port
So what?

•
SL-to-VL mapping is performed per hop, according to capabilities
Hogging - aka buffer monopolization, HOL1-blocking, output queue lockup,
single-stage congestion, saturation tree(k=0)
How can we prove that hogging really occurs in IBA?



A. Back-of-the-envelope reasoning
B. Analytical modeling of stability and work-conservation
C. Comparative simulations: IBA, PCI-AS etc. (next slides)
(papers available)
12
IBA SE Hogging Scenario
• Simulation: parallel backup to a RAID across an IBA switch

TX / SRC



16 independent IBA sources, e.g. 16 “producer” CPU/threads
SRC behavior: greedy, using any communication model (UD)
SL: BE service discipline on a single VL
– (the other VLs suffer of their own )

Fabrics (single stage)




16x16 IBA generic SE
16x16 PCI-AS switch
16x16 Prizma CI switch
RX / DST



16 HDD “consumers”
t0
: initially each HDD sinks data at full 1x (100%)
tsim
: during simulation HDD[0] enters thermal recalibration or sector
remapping; consequently
» HDD[0] progressively slows down its incoming link throughput: 90, 80,..., 10%
13
First: Friendly Bernoulli Traffic
2 Sources (A, B) sending @ (12x + 4x) to 16*1x End Nodes (C..R)
Fig. from
IBA Spec
R
achievable performance
Throughput loss
aggregate throughput

link 0 throughput reduction
14
Myths and Fallacies about Hogging
•
•
Isn’t IBA’s static rate control sufficient?
No, because it is STATIC
•
•
IBA’s VLs are sufficient...?!
No.

VLs and ports are orthogonal dimensions of LL-FC


•
•
1. VLs are for SL and QoS => VLs are assigned to prios, not ports!
2. Max. no. of VLs = 15 << max (SE_degree x SL) = 4K
Can the SE buffer partitioning solve hogging, blocking and sat_trees, at
least in single SE systems?
No.

1. Partitioning makes sense only w/ Status-based FC (per bridge output port
- see PCIe/AS SBFC);


IBA doesn’t have a native Status-based FC
2. Sizing becomes the issue => we need dedication per I and O ports


M = O( SL * max{RTT, MTU} * N2 ) very large number!
Academic papers and theoretical disertations prove stability and workconservation, but the amounts of required M are large
15
Conclusion Part II: Selectivity and Scope of LL-FC


Despite 16 VLs, IBA/DCE is exposed to the “transistor effect”: any
single flow can modulate the aggregate Tput of all the others
Hogging (HOL1-blocking) requires a solution even for the smallest
IBA/DCE system (single hop)
Prios/VL and VOQ/VC are 2 orthogonal dimensions of LL-FC
Q: QoS violation as price of ‘non-blocking’ LL-FC?

• Possible granularities of LL-FC queuing domains:


A. CM can serve in single hop fabrics also as LL-FC
B. Introduce VOQ-FC: intermediate coarser grain
no. VCs = max{VOQ} * max{VL} = 64..4096 x 2..16 <= 64K VCs
Alternative: 802.1p (map prios to 8 VLs) + .1q (map VLANs to 4K VCs)?
Was proposed in 802.3ar...
16
Backup
17
LL-FC Between Two Bridges
Switch[k]
Switch[k+1]
TX Port[k,j]
RX Port[k+1, i]
VOQ[1]
RX Mgnt.
Unit (Buffer
Allocation)
VOQ[n]
TX
Scheduler
"send packet"
RX Buffer
LL-FC
TX Unit
LL-FC
Reception
“return path of LL-FC token"
18
Download