E2CM updates IEEE 802.1 Interim @ Geneva Cyriel Minkenberg & Mitch Gusat IBM Research GmbH, Zurich May 29, 2007 Outline • Summary of E2CM proposal – How it works – What has changed • New E2CM performance results – Managing across a non-CM domain – Performance in fat tree topology – Mixed link speeds (1G/10G) 7/26/2016 IBM Research GmbH, Zurich 2 Refresher: E2CM Operation 1. Probe arrives at dst 2. Insert timestamp 1. Qeq exceeded 3. Return probe to source 2. Send BCN to source Probe Switch 1 src 1. BCN 1. Probe arrives arrives at source at source Switch 2 BCN 2. Install 2. Pathrate occupancy limiter computed 3. Inject 3. AIMD probe control w/ timestamp applied using same rate limiter Switch 3 • Probing is triggered by BCN frames; only rate-limited flows are probed • Per flow, BCN and probes employ the same rate limiter – – – – – – – – dst Insert one probe every X KB of data sent per flow, e.g. X = 75 KB Probes traverse network inband: Objective is to observe real current queuing delay Variant: continuous probing (used here) Control per-flow (probe) as well as per-queue (BCN) occupancy CPID of probes = destination MAC Rate limiter is associated with CPID from which last negative feedback was received Increment only on probes from associated CPID Parameters relating to probes may be set differently (in particular Qeq,flow, Qmax,flow, Gd,flow, Gi,flow) 7/26/2016 IBM Research GmbH, Zurich 3 Synergies • “Added value” of E2CM – Fair and stable rate allocation • Fine granularity owing to per-flow end-to-end probing – Improved initial response queue convergence speeds – Transparent to network • Purely end-to-end, no (additional) burden on bridges • “Added value” of ECM – Fast initial response • Feedback travels straight back to source – Capped aggregate queue length for largedegree hotspots • Controls sum of per-flow queue occupancies 7/26/2016 IBM Research GmbH, Zurich 4 Modifications since March proposal Modification Reason Details Calculate per-flow load at source Implementation concern: Destination burdened with per-flow rate calculation Source measures injected amount of data D in between probes Destination records time T elapsed between probes, includes T in reverse probe Source computes throughput estimate = D/T Does not account for dropped frames Clock synchronization is not an issue, as both time stamps are recorded at the destination Use source clock to determine forward latency Implementation concern: Global clock synchronization needed for forward latency measurement Expedite probes on reverse path Use top priority traffic class Switches automatically preempt other traffic for probes Source includes time stamp T in probe Upon return, source computes round-trip latency L = now - T Source keeps track of minimum round-trip latency L0 = minn(Ln) Source computes effective forward latency as L – L0 New rate limiter CP association rule Fix Upon negative probe feedback, associate RL with destination MAC; Only increase upon positive probe feedback when association matches Perform continuous probing Performance enhancement Probe all flows, even those not currently being rate limited; Improves initial response speed at the cost of increased overhead Accelerate rate recovery Performance enhancement Gi,flow is increased linearly with number of positive feedback received consecutively See also au-sim-ZRL-E2CM-src-based-r1.2.pdf 7/26/2016 IBM Research GmbH, Zurich 5 Coexistence of CM and non-CM domains • Concern has been raised that an end-to-end scheme requires global deployment – We consider the case where a non-CM switch exists in the path of the congesting flows • CM messages terminated at edge of domain – Cannot relay notifications across non-CM domain – Cannot control congestion inside non-CM domain • Non-CM (legacy) bridge behavior – Does not generate or interpret any CM notifications – Can relay CM notifications as regular frames? • May depend on bridge implementation • Next results make this assumption 7/26/2016 IBM Research GmbH, Zurich 6 Managing across a non-CM domain Node 1 100% Non-CM-domain CM-domain Switch 1 Node 2 100% Node 6 CM-domain Node 3 100% CM Switch 2 Node 4 Node 5 • • • • Switch 4 Switch 5 100% 100% Node 7 Switch 3 Switches 1, 2, 3 & 5 are in congestion-managed domains, switch 4 is in a noncongestion-managed domain Four hot flows of 10 Gb/s each from nodes 1, 2, 3, 4 to node 6 (hotspot) One cold (lukewarm) flow of 10 Gb/s from node 5 to 7 Max-min fair allocation provides 2.0 Gb/s to each flow 7/26/2016 IBM Research GmbH, Zurich 7 Simulation Setup & Parameters • Traffic – – – – Mean flow size = [1’500, 60’000] B Geometric flow size distribution Source stops sending at T = 1.0 s Simulation runs to completion (no frames left in the system) • Scenario • Switch 1. – – – – – – • Adapter – – – – – Per-node virtual output queuing, round-robin scheduling No limit on number of rate limiters Ingress buffer size = unlimited, round-robin VOQ service Egress buffer size = 150 KB PAUSE enabled • • See previous slide Radix N = 2, 3, 4 M = 150 KB/port Link time of flight = 1 us Partitioned memory per input, shared among all outputs No limit on per-output memory usage PAUSE enabled or disabled • • • • 7/26/2016 Applied on a per input basis based on local high/low watermarks watermarkhigh = 141.5 KB watermarklow = 131.5 KB If disabled, frames dropped when input partition full watermarkhigh = 141.5 KB watermarklow = 131.5 KB • ECM • E2CM (per-flow) – – – – – – – – – – – – – – – – – – – W = 2.0 Qeq = 37.5 KB (= M/4) Gd = 0.5 / ((2*W+1)*Qeq) Gi0 = (Rlink / Runit) * ((2*W+1)*Qeq) Gi = 0.1 * Gi0 Psample = 2% (on average 1 sample every 75 KB Runit = Rmin = 1 Mb/s BCN_MAX enabled, threshold = 150 KB BCN(0,0) disabled Drift enabled (1 Mb/s every 10 ms) Continuous probing Wflow = 2.0 Qeq,flow = 7.5 KB Gd, flow = 0.5 / ((2*W+1)*Qeq,flow) Gi, flow = 0.01 * (Rlink / Runit) / ((2*W+1)*Qeq,flow) Psample = 2% (on average 1 sample every 75 KB) Runit = Rmin = 1 Mb/s BCN_MAXflow enabled, threshold = 30 KB BCN(0,0)flow disabled IBM Research GmbH, Zurich 8 E2CM: Per-flow throughput Bernoulli Bursty PAUSE disabled Max-min fair rates PAUSE enabled 7/26/2016 IBM Research GmbH, Zurich 9 E2CM: Per-node throughput Bernoulli PAUSE disabled Bursty Max-min fair rates PAUSE enabled 7/26/2016 IBM Research GmbH, Zurich 10 E2CM: Switch queue length Bernoulli Bursty Stable OQ level PAUSE disabled PAUSE enabled 7/26/2016 IBM Research GmbH, Zurich 11 Frame drops, flow completions, FCT Bursty Bernoulli 1.0E+07 1.4 1.0E+06 1.2 1.0E+05 FCT (seconds) Counted number Bernoulli 1.0E+04 1.0E+03 1.0E+02 1.0E+01 Bursty 1 0.8 0.6 0.4 0.2 1.0E+00 w/o PAUSE w/ PAUSE Frame drops w/o PAUSE w/ PAUSE PAUSE frames w/o PAUSE w/ PAUSE Mean FCT longer w/ PAUSE • Absence of PAUSE heavily skews results • Cold flow FCT independent of burst size! – – – – all Completed flows • – 0 cold w/o PAUSE hot all cold w/ PAUSE All flows accounted for (w/o PAUSE not all flows completed) In particular for hot flows much longer FCT w/ PAUSE Load compression: Flows wait for a long time in adapter before being injected FCT dominated by adapter latency Cold traffic also traverses hotspot, therefore suffers from compression 7/26/2016 IBM Research GmbH, Zurich 12 hot Fat tree network • Fat trees enable scaling to arbitrarily large networks with constant (full) bisectional bandwidth • We use static, destination-based, shortest-path routing • For more details on construction and routing see: au-sim-ZRL-fat-tree-build-and-route-r1.0.pdf 7/26/2016 IBM Research GmbH, Zurich 13 Fat tree network spine (2,0) (2,1) (2,2) (2,3) Up Switches are labeled (stageID, switchID): (1,0) •stageID [0, S-1 (1,1) (1,2) (1,3) (3,0) (3,1) (3,2) (3,3) •switchID [0, (N/2)L-1] 0 1 Conventions M = no. of end nodes = N*(N/2)L-1 N = no. of bidir ports per switch L = no. of levels (folded) S = no. of stages = 2L-1 (unfolded) Number of switches per stage = (N/2)L-1 Total number of switches = (2L-1) *(N/2)L-1 Nodes are connected at left and right edges Left nodes are numbered 0 through M/2-1 Fat tree: Folded representation (0,1) 2 3 (0,2) 4 5 (0,3) 6 7 (4,0) (4,1) (4,2) (4,3) Level 0 8 9 10 11 12 13 14 15 spine 0 1 (0,0) (1,0) (2,0) (3,0) (4,0) 8 9 2 3 (0,1) (1,1) (2,1) (3,1) (4,1) 10 11 4 5 (0,2) (1,2) (2,2) (3,2) (4,2) 12 13 6 7 (0,3) (1,3) (2,3) (3,3) (4,3) 14 15 Stage 2 Stage 3 Stage 4 Left 7/26/2016 Level 1 Down (0,0) Right nodes are numbered M/2 to M-1 Level 2 Stage 0 Stage 1 IBM Research GmbH, Zurich Unfolded to Benes Right 14 Simulation Setup & Parameters • • Traffic – – – – – – • Adapter – – – – – Per-node virtual output queuing, round-robin scheduling No limit on number of rate limiters Ingress buffer size = unlimited, round-robin VOQ service Egress buffer size = 150 KB PAUSE enabled • • Scenario 1. 2. • Mean flow size = [1’500, 60’000] B Geometric flow size distribution Uniform destination distribution (except self) Mean load = 50% Source stops sending at T = 1.0 s Simulation runs to completion 16-node (3-level) and 32-node (4-level) fat tree networks Output-generated hotspot (rate reduction to 10% of link rate) on port 1 from 0.1 to 0.5 s • ECM • E2CM (per-flow) Switch – – – – – – Radix N = 4 M = 150 KB/port Link time of flight = 1 us Partitioned memory per input, shared among all outputs No limit on per-output memory usage PAUSE enabled or disabled • • • • 7/26/2016 Applied on a per input basis based on local high/low watermarks watermarkhigh = 141.5 KB watermarklow = 131.5 KB If disabled, frames dropped when input partition full watermarkhigh = 141.5 KB watermarklow = 131.5 KB – – – – – – – – – – – – – – – – – – – W = 2.0 Qeq = 37.5 KB (= M/4) Gd = 0.5 / ((2*W+1)*Qeq) Gi0 = (Rlink / Runit) * ((2*W+1)*Qeq) Gi = 0.1 * Gi0 Psample = 2% (on average 1 sample every 75 KB Runit = Rmin = 1 Mb/s BCN_MAX enabled, threshold = 150 KB BCN(0,0) en-/disabled, threshold = 300 KB Drift enabled (1 Mb/s every 10 ms) Continuous probing Wflow = 2.0 Qeq,flow = 7.5 KB Gd, flow = 0.5 / ((2*W+1)*Qeq,flow) Gi, flow = 0.01 * (Rlink / Runit) / ((2*W+1)*Qeq,flow) Psample = 2% (on average 1 sample every 75 KB) Runit = Rmin = 1 Mb/s BCN_MAXflow enabled, threshold = 30 KB BCN(0,0)flow en/-disabled, threshold = 60 KB IBM Research GmbH, Zurich 15 E2CM fat tree results: 16 nodes, 3 levels Bernoulli Bursty Aggr. Thrput Hot Q length 7/26/2016 IBM Research GmbH, Zurich 16 E2CM fat tree results: 32 nodes, 4 levels Bernoulli Bursty Aggr. Thrput Hot Q length 7/26/2016 IBM Research GmbH, Zurich 17 Frame drops, completed flows, FCT 32 nodes 1.0E+07 1.0E+06 1.0E+05 1.0E+04 1.0E+03 1.0E+02 1.0E+01 1.0E+00 Counted number Counted number 16 nodes 1.0E+08 1.0E+07 1.0E+06 1.0E+05 1.0E+04 1.0E+03 1.0E+02 1.0E+01 1.0E+00 w/o w/ w/o w/ w/o w/ PAUSE PAUSE PAUSE PAUSE PAUSE PAUSE Frame drops PAUSE frames Completed flows Frame drops Bernoulli w/ (0,0) Bursty w/ (0,0) PAUSE frames Bernoulli w/o (0,0) Bursty w/o (0,0) 1.00E+00 1.00E+00 1.00E-01 1.00E-01 FCT (seconds) FCT (seconds) Bernoulli w/o (0,0) Bursty w/o (0,0) w/o w/ w/o w/ w/o w/ PAUSE PAUSE PAUSE PAUSE PAUSE PAUSE 1.00E-02 1.00E-03 1.00E-04 1.00E-05 Completed flows Bernoulli w/ (0,0) Bursty w/ (0,0) 1.00E-02 1.00E-03 1.00E-04 1.00E-05 1.00E-06 1.00E-06 all cold w/o PAUSE 7/26/2016 hot all cold hot w/ PAUSE IBM Research GmbH, Zurich all cold w/o PAUSE hot all cold hot w/ PAUSE 18 Mixed link speeds Node 1 Service rate = 10% 50% Output-generated 1G Node 10 50% Switch 1 10G 1G 50% Switch 2 10G Node 11 • • Nodes 1-10 are connected via 1G adapters and links Switch 1 has 10 1G ports and 1 10G port to switch 2, which has 2 10G ports • • • Ten hot flows of 0.5 Gb/s each from nodes 1-10 to node 11 (hotspot) Node 11 sends uniformly at 5 Gb/s (cold) Max-min fair shares: 12.5 MB/s for [1-10] 11 – Shared-memory switches create more serious congestion Input-generated Node 1 50% 1G Node 10 50% • • • • 1G Switch 1 50% Switch 2 10G 10G Node 11 Same topology as above One hot flow of 5.0 Gb/s from node 11 to node 1 (hotspot) Nodes 1-10 send uniformly at 0.5 Gb/s (cold) Max-min fair shares: 62.5 MB/s for 11 1 and 6.25 MB/s for [2-10] 1 7/26/2016 IBM Research GmbH, Zurich 19 E2CM mixed speed: output-generated HS Per-node throughput Per-flow throughput PAUSE disabled PAUSE enabled 7/26/2016 IBM Research GmbH, Zurich 20 E2CM mixed speed: input-generated HS Per-node throughput Per-flow throughput PAUSE disabled PAUSE enabled 7/26/2016 IBM Research GmbH, Zurich 21 Probing mixed speed: output-generated HS Per-node throughput PAUSE disabled Per-flow throughput Perfect bandwidth sharing PAUSE enabled 7/26/2016 IBM Research GmbH, Zurich 22 Probing mixed speed: input-generated HS Per-node throughput Per-flow throughput PAUSE disabled PAUSE enabled 7/26/2016 IBM Research GmbH, Zurich 23 Conclusions • FCT dominated by adapter latency for rate-limited flows • E2CM can manage across non-CM domains – Even a hotspot within a non-CM domain can be controlled – Need to ensure that CM notifications can traverse non-CM domains • They have to look like valid frames to non-CM bridges • E2CM works excellently in multi-level fat tree topologies • E2CM also copes well with mixed speed networks • Continuous probing improves E2CM’s overall performance • In low-degree hotspot scenarios probing-only appears to be sufficient to control congestion 7/26/2016 IBM Research GmbH, Zurich 24