E CM updates IEEE 802.1 Interim @ Geneva 2

advertisement
E2CM updates
IEEE 802.1 Interim @ Geneva
Cyriel Minkenberg & Mitch Gusat
IBM Research GmbH, Zurich
May 29, 2007
Outline
• Summary of E2CM proposal
– How it works
– What has changed
• New E2CM performance results
– Managing across a non-CM domain
– Performance in fat tree topology
– Mixed link speeds (1G/10G)
7/26/2016
IBM Research GmbH, Zurich
2
Refresher: E2CM Operation
1. Probe arrives at dst
2. Insert timestamp
1. Qeq exceeded
3. Return probe to source
2. Send BCN to source
Probe
Switch 1
src
1. BCN
1. Probe
arrives
arrives
at source
at source
Switch
2
BCN
2. Install
2. Pathrate
occupancy
limiter computed
3. Inject
3. AIMD
probe
control
w/ timestamp
applied using
same rate limiter
Switch 3
•
Probing is triggered by BCN frames; only rate-limited flows are probed
•
Per flow, BCN and probes employ the same rate limiter
–
–
–
–
–
–
–
–
dst
Insert one probe every X KB of data sent per flow, e.g. X = 75 KB
Probes traverse network inband: Objective is to observe real current queuing delay
Variant: continuous probing (used here)
Control per-flow (probe) as well as per-queue (BCN) occupancy
CPID of probes = destination MAC
Rate limiter is associated with CPID from which last negative feedback was received
Increment only on probes from associated CPID
Parameters relating to probes may be set differently (in particular Qeq,flow, Qmax,flow, Gd,flow,
Gi,flow)
7/26/2016
IBM Research GmbH, Zurich
3
Synergies
• “Added value” of E2CM
– Fair and stable rate allocation
• Fine granularity owing to per-flow end-to-end probing
– Improved initial response queue convergence
speeds
– Transparent to network
• Purely end-to-end, no (additional) burden on bridges
• “Added value” of ECM
– Fast initial response
• Feedback travels straight back to source
– Capped aggregate queue length for largedegree hotspots
• Controls sum of per-flow queue occupancies
7/26/2016
IBM Research GmbH, Zurich
4
Modifications since March proposal
Modification
Reason
Details
Calculate per-flow load at source
Implementation concern: Destination
burdened with per-flow rate calculation
Source measures injected amount of data D in
between probes
Destination records time T elapsed between probes,
includes T in reverse probe
Source computes throughput estimate = D/T
Does not account for dropped frames
Clock synchronization is not an issue, as both time
stamps are recorded at the destination
Use source clock to determine forward
latency
Implementation concern: Global clock
synchronization needed for forward
latency measurement
Expedite probes on reverse path
Use top priority traffic class
Switches automatically preempt other traffic for
probes
Source includes time stamp T in probe
Upon return, source computes round-trip latency L =
now - T
Source keeps track of minimum round-trip latency L0
= minn(Ln)
Source computes effective forward latency as L – L0
New rate limiter CP association rule
Fix
Upon negative probe feedback, associate RL with
destination MAC; Only increase upon positive probe
feedback when association matches
Perform continuous probing
Performance enhancement
Probe all flows, even those not currently being rate
limited; Improves initial response speed at the cost of
increased overhead
Accelerate rate recovery
Performance enhancement
Gi,flow is increased linearly with number of positive
feedback received consecutively
See also au-sim-ZRL-E2CM-src-based-r1.2.pdf
7/26/2016
IBM Research GmbH, Zurich
5
Coexistence of CM and non-CM domains
• Concern has been raised that an end-to-end
scheme requires global deployment
– We consider the case where a non-CM switch exists in
the path of the congesting flows
• CM messages terminated at edge of domain
– Cannot relay notifications across non-CM domain
– Cannot control congestion inside non-CM domain
• Non-CM (legacy) bridge behavior
– Does not generate or interpret any CM notifications
– Can relay CM notifications as regular frames?
• May depend on bridge implementation
• Next results make this assumption
7/26/2016
IBM Research GmbH, Zurich
6
Managing across a non-CM domain
Node 1
100%
Non-CM-domain
CM-domain
Switch 1
Node 2
100%
Node 6
CM-domain
Node 3
100%
CM
Switch 2
Node 4
Node 5
•
•
•
•
Switch 4
Switch 5
100%
100%
Node 7
Switch 3
Switches 1, 2, 3 & 5 are in congestion-managed domains, switch 4 is in a noncongestion-managed domain
Four hot flows of 10 Gb/s each from nodes 1, 2, 3, 4 to node 6 (hotspot)
One cold (lukewarm) flow of 10 Gb/s from node 5 to 7
Max-min fair allocation provides 2.0 Gb/s to each flow
7/26/2016
IBM Research GmbH, Zurich
7
Simulation Setup & Parameters
•
Traffic
–
–
–
–
Mean flow size = [1’500, 60’000] B
Geometric flow size distribution
Source stops sending at T = 1.0 s
Simulation runs to completion (no frames left in
the system)
•
Scenario
•
Switch
1.
–
–
–
–
–
–
•
Adapter
–
–
–
–
–
Per-node virtual output queuing, round-robin
scheduling
No limit on number of rate limiters
Ingress buffer size = unlimited, round-robin VOQ
service
Egress buffer size = 150 KB
PAUSE enabled
•
•
See previous slide
Radix N = 2, 3, 4
M = 150 KB/port
Link time of flight = 1 us
Partitioned memory per input, shared among all
outputs
No limit on per-output memory usage
PAUSE enabled or disabled
•
•
•
•
7/26/2016
Applied on a per input basis based on local
high/low watermarks
watermarkhigh = 141.5 KB
watermarklow = 131.5 KB
If disabled, frames dropped when input
partition full
watermarkhigh = 141.5 KB
watermarklow = 131.5 KB
•
ECM
•
E2CM (per-flow)
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
W = 2.0
Qeq = 37.5 KB (= M/4)
Gd = 0.5 / ((2*W+1)*Qeq)
Gi0 = (Rlink / Runit) * ((2*W+1)*Qeq)
Gi = 0.1 * Gi0
Psample = 2% (on average 1 sample every 75 KB
Runit = Rmin = 1 Mb/s
BCN_MAX enabled, threshold = 150 KB
BCN(0,0) disabled
Drift enabled (1 Mb/s every 10 ms)
Continuous probing
Wflow = 2.0
Qeq,flow = 7.5 KB
Gd, flow = 0.5 / ((2*W+1)*Qeq,flow)
Gi, flow = 0.01 * (Rlink / Runit) / ((2*W+1)*Qeq,flow)
Psample = 2% (on average 1 sample every 75 KB)
Runit = Rmin = 1 Mb/s
BCN_MAXflow enabled, threshold = 30 KB
BCN(0,0)flow disabled
IBM Research GmbH, Zurich
8
E2CM: Per-flow throughput
Bernoulli
Bursty
PAUSE
disabled
Max-min fair rates
PAUSE
enabled
7/26/2016
IBM Research GmbH, Zurich
9
E2CM: Per-node throughput
Bernoulli
PAUSE
disabled
Bursty
Max-min fair rates
PAUSE
enabled
7/26/2016
IBM Research GmbH, Zurich
10
E2CM: Switch queue length
Bernoulli
Bursty
Stable OQ level
PAUSE
disabled
PAUSE
enabled
7/26/2016
IBM Research GmbH, Zurich
11
Frame drops, flow completions, FCT
Bursty
Bernoulli
1.0E+07
1.4
1.0E+06
1.2
1.0E+05
FCT (seconds)
Counted number
Bernoulli
1.0E+04
1.0E+03
1.0E+02
1.0E+01
Bursty
1
0.8
0.6
0.4
0.2
1.0E+00
w/o
PAUSE
w/
PAUSE
Frame drops
w/o
PAUSE
w/
PAUSE
PAUSE frames
w/o
PAUSE
w/
PAUSE
Mean FCT longer w/ PAUSE
•
Absence of PAUSE heavily skews results
•
Cold flow FCT independent of burst size!
–
–
–
–
all
Completed flows
•
–
0
cold
w/o PAUSE
hot
all
cold
w/ PAUSE
All flows accounted for (w/o PAUSE not all flows completed)
In particular for hot flows  much longer FCT w/ PAUSE
Load compression: Flows wait for a long time in adapter before being injected
FCT dominated by adapter latency
Cold traffic also traverses hotspot, therefore suffers from compression
7/26/2016
IBM Research GmbH, Zurich
12
hot
Fat tree network
• Fat trees enable scaling to arbitrarily large
networks with constant (full) bisectional
bandwidth
• We use static, destination-based, shortest-path
routing
• For more details on construction and routing see:
au-sim-ZRL-fat-tree-build-and-route-r1.0.pdf
7/26/2016
IBM Research GmbH, Zurich
13
Fat tree network
spine
(2,0)
(2,1)
(2,2)
(2,3)
Up
Switches are labeled
(stageID, switchID):
(1,0)
•stageID  [0, S-1
(1,1)
(1,2)
(1,3)
(3,0)
(3,1)
(3,2)
(3,3)
•switchID  [0, (N/2)L-1]
0 1
Conventions
M = no. of end nodes = N*(N/2)L-1
N = no. of bidir ports per switch
L = no. of levels (folded)
S = no. of stages = 2L-1 (unfolded)
Number of switches per stage = (N/2)L-1
Total number of switches = (2L-1) *(N/2)L-1
Nodes are connected at left and right edges
Left nodes are numbered 0 through M/2-1
Fat tree: Folded
representation
(0,1)
2 3
(0,2)
4 5
(0,3)
6 7
(4,0)
(4,1)
(4,2)
(4,3)
Level 0
8 9 10 11 12 13 14 15
spine
0
1
(0,0)
(1,0)
(2,0)
(3,0)
(4,0)
8
9
2
3
(0,1)
(1,1)
(2,1)
(3,1)
(4,1)
10
11
4
5
(0,2)
(1,2)
(2,2)
(3,2)
(4,2)
12
13
6
7
(0,3)
(1,3)
(2,3)
(3,3)
(4,3)
14
15
Stage 2
Stage 3
Stage 4
Left
7/26/2016
Level 1
Down
(0,0)
Right nodes are numbered M/2 to M-1
Level 2
Stage 0
Stage 1
IBM Research GmbH, Zurich
Unfolded
to Benes
Right
14
Simulation Setup & Parameters
•
•
Traffic
–
–
–
–
–
–
•
Adapter
–
–
–
–
–
Per-node virtual output queuing, round-robin
scheduling
No limit on number of rate limiters
Ingress buffer size = unlimited, round-robin VOQ
service
Egress buffer size = 150 KB
PAUSE enabled
•
•
Scenario
1.
2.
•
Mean flow size = [1’500, 60’000] B
Geometric flow size distribution
Uniform destination distribution (except self)
Mean load = 50%
Source stops sending at T = 1.0 s
Simulation runs to completion
16-node (3-level) and 32-node (4-level) fat tree
networks
Output-generated hotspot (rate reduction to
10% of link rate) on port 1 from 0.1 to 0.5 s
•
ECM
•
E2CM (per-flow)
Switch
–
–
–
–
–
–
Radix N = 4
M = 150 KB/port
Link time of flight = 1 us
Partitioned memory per input, shared among all
outputs
No limit on per-output memory usage
PAUSE enabled or disabled
•
•
•
•
7/26/2016
Applied on a per input basis based on local
high/low watermarks
watermarkhigh = 141.5 KB
watermarklow = 131.5 KB
If disabled, frames dropped when input
partition full
watermarkhigh = 141.5 KB
watermarklow = 131.5 KB
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
W = 2.0
Qeq = 37.5 KB (= M/4)
Gd = 0.5 / ((2*W+1)*Qeq)
Gi0 = (Rlink / Runit) * ((2*W+1)*Qeq)
Gi = 0.1 * Gi0
Psample = 2% (on average 1 sample every 75 KB
Runit = Rmin = 1 Mb/s
BCN_MAX enabled, threshold = 150 KB
BCN(0,0) en-/disabled, threshold = 300 KB
Drift enabled (1 Mb/s every 10 ms)
Continuous probing
Wflow = 2.0
Qeq,flow = 7.5 KB
Gd, flow = 0.5 / ((2*W+1)*Qeq,flow)
Gi, flow = 0.01 * (Rlink / Runit) / ((2*W+1)*Qeq,flow)
Psample = 2% (on average 1 sample every 75 KB)
Runit = Rmin = 1 Mb/s
BCN_MAXflow enabled, threshold = 30 KB
BCN(0,0)flow en/-disabled, threshold = 60 KB
IBM Research GmbH, Zurich
15
E2CM fat tree results: 16 nodes, 3 levels
Bernoulli
Bursty
Aggr.
Thrput
Hot Q
length
7/26/2016
IBM Research GmbH, Zurich
16
E2CM fat tree results: 32 nodes, 4 levels
Bernoulli
Bursty
Aggr.
Thrput
Hot Q
length
7/26/2016
IBM Research GmbH, Zurich
17
Frame drops, completed flows, FCT
32 nodes
1.0E+07
1.0E+06
1.0E+05
1.0E+04
1.0E+03
1.0E+02
1.0E+01
1.0E+00
Counted number
Counted number
16 nodes
1.0E+08
1.0E+07
1.0E+06
1.0E+05
1.0E+04
1.0E+03
1.0E+02
1.0E+01
1.0E+00
w/o
w/
w/o
w/
w/o
w/
PAUSE PAUSE PAUSE PAUSE PAUSE PAUSE
Frame drops
PAUSE frames
Completed flows
Frame drops
Bernoulli w/ (0,0)
Bursty w/ (0,0)
PAUSE frames
Bernoulli w/o (0,0)
Bursty w/o (0,0)
1.00E+00
1.00E+00
1.00E-01
1.00E-01
FCT (seconds)
FCT (seconds)
Bernoulli w/o (0,0)
Bursty w/o (0,0)
w/o
w/
w/o
w/
w/o
w/
PAUSE PAUSE PAUSE PAUSE PAUSE PAUSE
1.00E-02
1.00E-03
1.00E-04
1.00E-05
Completed flows
Bernoulli w/ (0,0)
Bursty w/ (0,0)
1.00E-02
1.00E-03
1.00E-04
1.00E-05
1.00E-06
1.00E-06
all
cold
w/o PAUSE
7/26/2016
hot
all
cold
hot
w/ PAUSE
IBM Research GmbH, Zurich
all
cold
w/o PAUSE
hot
all
cold
hot
w/ PAUSE
18
Mixed link speeds
Node 1
Service rate = 10%
50%
Output-generated
1G
Node 10
50%
Switch 1
10G
1G
50%
Switch 2
10G
Node 11
•
•
Nodes 1-10 are connected via 1G adapters and links
Switch 1 has 10 1G ports and 1 10G port to switch 2, which has 2 10G ports
•
•
•
Ten hot flows of 0.5 Gb/s each from nodes 1-10 to node 11 (hotspot)
Node 11 sends uniformly at 5 Gb/s (cold)
Max-min fair shares: 12.5 MB/s for [1-10]  11
–
Shared-memory switches  create more serious congestion
Input-generated
Node 1 50%
1G
Node 10 50%
•
•
•
•
1G
Switch 1
50%
Switch 2
10G
10G
Node 11
Same topology as above
One hot flow of 5.0 Gb/s from node 11 to node 1 (hotspot)
Nodes 1-10 send uniformly at 0.5 Gb/s (cold)
Max-min fair shares: 62.5 MB/s for 11  1 and 6.25 MB/s for [2-10]  1
7/26/2016
IBM Research GmbH, Zurich
19
E2CM mixed speed: output-generated HS
Per-node throughput
Per-flow throughput
PAUSE
disabled
PAUSE
enabled
7/26/2016
IBM Research GmbH, Zurich
20
E2CM mixed speed: input-generated HS
Per-node throughput
Per-flow throughput
PAUSE
disabled
PAUSE
enabled
7/26/2016
IBM Research GmbH, Zurich
21
Probing mixed speed: output-generated HS
Per-node throughput
PAUSE
disabled
Per-flow throughput
Perfect bandwidth
sharing
PAUSE
enabled
7/26/2016
IBM Research GmbH, Zurich
22
Probing mixed speed: input-generated HS
Per-node throughput
Per-flow throughput
PAUSE
disabled
PAUSE
enabled
7/26/2016
IBM Research GmbH, Zurich
23
Conclusions
• FCT dominated by adapter latency for rate-limited flows
• E2CM can manage across non-CM domains
– Even a hotspot within a non-CM domain can be controlled
– Need to ensure that CM notifications can traverse non-CM
domains
• They have to look like valid frames to non-CM bridges
• E2CM works excellently in multi-level fat tree topologies
• E2CM also copes well with mixed speed networks
• Continuous probing improves E2CM’s overall performance
• In low-degree hotspot scenarios probing-only appears to be
sufficient to control congestion
7/26/2016
IBM Research GmbH, Zurich
24
Download