Ethernet Data Center Routing Challenges and 802.1aq/SPB new work PETER ASHWOOD-SMITH

advertisement
Ethernet Data Center Routing Challenges
and 802.1aq/SPB new work
PETER ASHWOOD-SMITH
peterashwoodsmith@huawei.com
A) Tweak
Bridge
Priorities
Here
B) S1 … S16
802.1aq’s 16 ECT can give perfect spread going 2 hops 16 uplinks. However:
A) Need to tweak 2nd layer switch priorities to guarantee all 16 are used.
B) Need at least 16 subnets (C/S-Vlan’s) to assign one per 802.1aq B-VID.
Can we eliminate ‘tweaking*’
• David Allan et al. have a presentation on this so I
won’t spend much time on it.
• In general a network with N equal cost paths
from ‘some source’ to ‘some destination’
requires #ECT about 25-40% greater than N (to
statistically capture them all).
• Therefore when #ECT == N some ‘tweaking’ is
usually required (for DC its trivial to do however).
• Dave et al. suggest non-independence between
ECT algorithms as way to address this
(maximize diversity) …
*Tweaking = adjusting
Bridge Priorities up/down from
defaults.
“Example” 802.1aq switching cluster – assume 100GE NNI links/groups
Good
numbers
“16”
& “2”
levels.
A1
A2
A15
A16
32 x 100GE
16 x 32 x 100GE = 51.2T
using 48 x 2T switches
16 x 100GE
B1
B2
B3
B4
B29
B30
B31
160 x 10GE
B32
5120 x 10GE
S1,1
•
•
•
•
•
S1,160
S3,1
S32,1 S32,160
S3,160
48 switch non blocking 2 layer L2 fabric
16 at “upper” layer A1..A16
32 at “lower” layer B1.. B32
16 uplinks per Bn, & 160 UNI links per Bn
32 downlinks per An
•
•
•
•
•
(16 x 100GE per Bn )x32 = 512x100GE = 51.2T
160 x 10GE server links (UNI) per Bn
(32 x 160)/2 = 2560 servers @ 2x10GE per
uFIB = 16 x 48 B-mac
= 768 entries
mFIB = 16 subnet x 48 src = 768 entries
1536 FIB/node
ECT-ALG
#12
Source
Node (1)
S1 … S16
For a given ECT-ALGk, Aj is a member of every SPF-TREE(B*,ECT-ALGk)
Properly tuned no two ECT-ALGorithms will use the same Aj as a fork point.
Subnet Ni maps to I-SIDj and then to a unique A (j mod 16 )
A1
B1
I-SIDj
B2
A2
B3
I-SIDi
A15
B4
I-SIDj
B29
B30
A16
B31
I-SIDi
I-SIDj
B32
I-SIDi
So load spreading allows each Ai to transit a complete subnet.
Problem#1 - Unable to further spread such that Ai and Aj (i != j) each
handle subset of flows in I-SID j
This is an issue under failure of Aj
A1
B1
I-SIDj
B2
A2
B3
I-SIDi
A15
B4
I-SIDj
B29
B30
A16
B31
I-SIDi
I-SIDj
B32
I-SIDi
Recovery will move entire subnet traffic to another Ai node.
A preferable solution is to spread affected load over remaining A*
Possible solution – head end hashing (unicast only)
A1
B1
I-SIDj
Unicast
Mcast
B2
A2
B3
I-SIDi
A15
B4
I-SIDj
B29
B30
I-SIDi
I-SIDj
A16
B31
B32
I-SIDi
Allow unicast I-SIDi and I-SIDj traffic to be hashed based on smaller
flows to different B-VIDs (ECT-ALGorithms)
This breaks the symmetry and congruence rules but allows edge
balancing at smaller granularity. No changes to multicast.
Requires learning <C-DA, B-DA> , independent of B-VID
Interconnection of fabrics creates more than 16 paths (exponential )
C1
O(16x2x16)
C2
O(16x2)
A1
A2
A15
A16
A1
A2
A15
A16
O(16)
B1
B2
B3
B4
B29
B30
B31
B32
B1
B2
B3
B4
B29
B30
B31
Number of paths can grow exponentially with increasing levels.
Constant number of paths always << number of paths in many networks.
Growing 802.1aq ECT to say 32 or even 100 ECMP causes larger
unicast FIBs.
B32
Horizontal Growth – not too bad but need more ECT-ALGORITHMS.
A1
B1
B2
A2
B3
A15
B4
A16
A17
B2
B3
B3
B3
9
0
1
2
B3
B3
3
4
Horizontal growth by 1 just increases number of ECT by 1
Not too big a problem but we would need to define new ECT (via Opaque).
General Issue
Choose
path from
N x B-VID
O(degree)
S
D
O(diameter)
#paths ~= O( diameter degree)
So head end ECT in worst case requires O(exp(# B-VIDs))
A feasible solution …
Single B-VID
S
D
Choose
path from
N x nxt hop
Choose
path from
N x nxt hop
Re-assign traffic to path at each hop
Tandem “ECMP” just like IP.
Need to keep O(degree) number of next hops
Only need one B-VID .. removes O(diameter) from state cost
Flip side is you have no control – just hope for fine scale statistical distribution
What about loops in this mode?
802.1aq Ingress Check is very strong in the case of a single next hop and hence
a single possible ingress for an SA.
802.1aq Ingress Check is weakened in the case of a multiple next hop and hence
Multiple possible ingress for an SA.
However 802.1aq Agreement Protocol functions correctly in the context of
multiple possible Next Hops for the same B-VID (refer to Mick’s proof).
But …
Agreement Protocol Concerns
Is it too complex? it is clearly non trivial, we need implementation/
emulation experience.
Is it overly Draconian. For example the bounds on movement are what
is required for a mathematical proof by induction .. However there are
probably many cases where further movement would not loop. What is
the degree of ‘overkill’ ?
Is it marketable? – this is unfortunately a legitimate concern!!!
802.1aq can be deployed without AP until we introduce hash based
forwarding at which point we either require a symmetric AP and/or
an on-data-path loop detection/drop mechanism.
Believe that an on-data-path loop detection mechanism is required
for hash based ECMP until we have more experience with AP.
Recommend we standardize a TTL TAG either stand-alone or as a
new form of I-TAG.
View of New Work Requirements
R1) New ECT-ALGorithms with improved spreading properties.
R2) Allow optional head end hash assignment of 802.1aq SPBM UNI known unicast
traffic to one of multiple next hop interfaces/B-VIDs. Very similar to Link Ag.
Minimally HASH (seed, C.SA, C.DA, C-VID, [ IP.SA, IP.DA, IP.PROTO] )
R3) Allow optional tandem hash assignment of 802.1aq SPBM B-VID NNI unicast
traffic to one of multiple next hop interfaces. Essentially a new SPBM ECT-ALG
with its own B-VID. (i.e. new ECT-ALGorithms, all usable at same time)
Minimally HASH (seed, B-VID, C.SA, C.DA, C-VID, [ IP.SA, IP.DA, IP.PROTO ])
R4) minor OA&M changes in support of R2 and R3, because
symmetry/congruence broken.
R5) More experience with AP, emulations, simulations etc. +
addition of TTL to new I-TAG or a TTL-TAG.
Download