Ethernet Data Center Routing Challenges and 802.1aq/SPB new work PETER ASHWOOD-SMITH peterashwoodsmith@huawei.com A) Tweak Bridge Priorities Here B) S1 … S16 802.1aq’s 16 ECT can give perfect spread going 2 hops 16 uplinks. However: A) Need to tweak 2nd layer switch priorities to guarantee all 16 are used. B) Need at least 16 subnets (C/S-Vlan’s) to assign one per 802.1aq B-VID. Can we eliminate ‘tweaking*’ • David Allan et al. have a presentation on this so I won’t spend much time on it. • In general a network with N equal cost paths from ‘some source’ to ‘some destination’ requires #ECT about 25-40% greater than N (to statistically capture them all). • Therefore when #ECT == N some ‘tweaking’ is usually required (for DC its trivial to do however). • Dave et al. suggest non-independence between ECT algorithms as way to address this (maximize diversity) … *Tweaking = adjusting Bridge Priorities up/down from defaults. “Example” 802.1aq switching cluster – assume 100GE NNI links/groups Good numbers “16” & “2” levels. A1 A2 A15 A16 32 x 100GE 16 x 32 x 100GE = 51.2T using 48 x 2T switches 16 x 100GE B1 B2 B3 B4 B29 B30 B31 160 x 10GE B32 5120 x 10GE S1,1 • • • • • S1,160 S3,1 S32,1 S32,160 S3,160 48 switch non blocking 2 layer L2 fabric 16 at “upper” layer A1..A16 32 at “lower” layer B1.. B32 16 uplinks per Bn, & 160 UNI links per Bn 32 downlinks per An • • • • • (16 x 100GE per Bn )x32 = 512x100GE = 51.2T 160 x 10GE server links (UNI) per Bn (32 x 160)/2 = 2560 servers @ 2x10GE per uFIB = 16 x 48 B-mac = 768 entries mFIB = 16 subnet x 48 src = 768 entries 1536 FIB/node ECT-ALG #12 Source Node (1) S1 … S16 For a given ECT-ALGk, Aj is a member of every SPF-TREE(B*,ECT-ALGk) Properly tuned no two ECT-ALGorithms will use the same Aj as a fork point. Subnet Ni maps to I-SIDj and then to a unique A (j mod 16 ) A1 B1 I-SIDj B2 A2 B3 I-SIDi A15 B4 I-SIDj B29 B30 A16 B31 I-SIDi I-SIDj B32 I-SIDi So load spreading allows each Ai to transit a complete subnet. Problem#1 - Unable to further spread such that Ai and Aj (i != j) each handle subset of flows in I-SID j This is an issue under failure of Aj A1 B1 I-SIDj B2 A2 B3 I-SIDi A15 B4 I-SIDj B29 B30 A16 B31 I-SIDi I-SIDj B32 I-SIDi Recovery will move entire subnet traffic to another Ai node. A preferable solution is to spread affected load over remaining A* Possible solution – head end hashing (unicast only) A1 B1 I-SIDj Unicast Mcast B2 A2 B3 I-SIDi A15 B4 I-SIDj B29 B30 I-SIDi I-SIDj A16 B31 B32 I-SIDi Allow unicast I-SIDi and I-SIDj traffic to be hashed based on smaller flows to different B-VIDs (ECT-ALGorithms) This breaks the symmetry and congruence rules but allows edge balancing at smaller granularity. No changes to multicast. Requires learning <C-DA, B-DA> , independent of B-VID Interconnection of fabrics creates more than 16 paths (exponential ) C1 O(16x2x16) C2 O(16x2) A1 A2 A15 A16 A1 A2 A15 A16 O(16) B1 B2 B3 B4 B29 B30 B31 B32 B1 B2 B3 B4 B29 B30 B31 Number of paths can grow exponentially with increasing levels. Constant number of paths always << number of paths in many networks. Growing 802.1aq ECT to say 32 or even 100 ECMP causes larger unicast FIBs. B32 Horizontal Growth – not too bad but need more ECT-ALGORITHMS. A1 B1 B2 A2 B3 A15 B4 A16 A17 B2 B3 B3 B3 9 0 1 2 B3 B3 3 4 Horizontal growth by 1 just increases number of ECT by 1 Not too big a problem but we would need to define new ECT (via Opaque). General Issue Choose path from N x B-VID O(degree) S D O(diameter) #paths ~= O( diameter degree) So head end ECT in worst case requires O(exp(# B-VIDs)) A feasible solution … Single B-VID S D Choose path from N x nxt hop Choose path from N x nxt hop Re-assign traffic to path at each hop Tandem “ECMP” just like IP. Need to keep O(degree) number of next hops Only need one B-VID .. removes O(diameter) from state cost Flip side is you have no control – just hope for fine scale statistical distribution What about loops in this mode? 802.1aq Ingress Check is very strong in the case of a single next hop and hence a single possible ingress for an SA. 802.1aq Ingress Check is weakened in the case of a multiple next hop and hence Multiple possible ingress for an SA. However 802.1aq Agreement Protocol functions correctly in the context of multiple possible Next Hops for the same B-VID (refer to Mick’s proof). But … Agreement Protocol Concerns Is it too complex? it is clearly non trivial, we need implementation/ emulation experience. Is it overly Draconian. For example the bounds on movement are what is required for a mathematical proof by induction .. However there are probably many cases where further movement would not loop. What is the degree of ‘overkill’ ? Is it marketable? – this is unfortunately a legitimate concern!!! 802.1aq can be deployed without AP until we introduce hash based forwarding at which point we either require a symmetric AP and/or an on-data-path loop detection/drop mechanism. Believe that an on-data-path loop detection mechanism is required for hash based ECMP until we have more experience with AP. Recommend we standardize a TTL TAG either stand-alone or as a new form of I-TAG. View of New Work Requirements R1) New ECT-ALGorithms with improved spreading properties. R2) Allow optional head end hash assignment of 802.1aq SPBM UNI known unicast traffic to one of multiple next hop interfaces/B-VIDs. Very similar to Link Ag. Minimally HASH (seed, C.SA, C.DA, C-VID, [ IP.SA, IP.DA, IP.PROTO] ) R3) Allow optional tandem hash assignment of 802.1aq SPBM B-VID NNI unicast traffic to one of multiple next hop interfaces. Essentially a new SPBM ECT-ALG with its own B-VID. (i.e. new ECT-ALGorithms, all usable at same time) Minimally HASH (seed, B-VID, C.SA, C.DA, C-VID, [ IP.SA, IP.DA, IP.PROTO ]) R4) minor OA&M changes in support of R2 and R3, because symmetry/congruence broken. R5) More experience with AP, emulations, simulations etc. + addition of TTL to new I-TAG or a TTL-TAG.