slides - Computer Science Department

1 PEREGRINE: AN ALL-LAYER-2 CONTAINER COMPUTER NETWORK Tzi-cker Chiueh , Cheng-Chun Tu, Yu-Cheng Wang, Pai-Wei Wang, Kai-Wen Li, and Yu-Ming Huan ∗Computer Science Department, Stony Brook University †Industrial Technology Research Institute, Taiwan 1 Outline 2  Motivation Layer 2 + Layer 3 design  Requirements for cloud-scale DC  Problems of Classic Ethernet to the Cloud   Solutions Related Solutions  Peregrine’s Solution   Implementation and Evaluation Software architecture  Performance Evaluation  L2 + L3 Architecture: Problems 3 Bandwidth Bottleneck Problem: Configuration: - Routing table in the routers - IP assignment - DHCP coordination - VLAN and STP Problem: forwarding table size Commodity switch 16-32k Virtual Machine Mobility Constrained to a Physical Location Ref: Cisco data center with FabricPath and the Cisco FabricPath Switching System Requirements for Cloud-Scale DC 4  Any-to-any connectivity with non-blocking fabric   Virtual machine mobility   Quick failure detection and recovery Support for Multi-tenancy   Large layer 2 domain Fast fail-over   Scale to more than 10,000 physical nodes Share resources between different customers Load balancing routing  Efficiently use all available links Solution: A Huge L2 Switch! Layer 2 Switch Single L2 Network Non-blocking Backplane Bandwidth However, Ethernet does not scale! Config-free, Plug-and play VM VM VM VM VM VM VM VM VM VM VM VM 5 Linear cost and power scaling ….. Scale to 1 million VMs 5 Revisit Ethernet: Spanning Tree Topology 6 N4 N3 N5 s2 s1 N6 N2 s3 N1 s4 N8 N7 Revisit Ethernet: Spanning Tree Topology 7 N4 N3 N5 D D s1 B s2 R N6 N2 s3 s4 Root N1 N8 N7 Revisit Ethernet: Broadcast and Source Learning 8 N4 N3 N5 D D s1 B s2 R N6 N2 s3 s4 Root N1 N8 N7 Benefit: plug-and-play Ethernet’s Scalability Issues  Limited forwarding table size   STP as a solution to loop prevention    Fail-over latency is high ( > 5 seconds) Broadcast overhead  9 Not all physical links are used No load-sensitive dynamic routing Slow fail-over   Commodity switch: 16k to 64k entries Typical Layer 2 consists of hundreds of hosts Related Works / Solution Strategies 10  Scalability:   Clos network / Fat-tree to scale out Alternative to STP Link aggregation, e.g. LACP, Layer 2 trunking  Routing protocols to layer 2 network   Limited forwarding table size   Packet header encapsulation or re-writing Load balancing  Randomness or traffic engineering approach 11 Design of the Peregrine Peregrine’s Solutions 12  Not all links are used   L2 Loop Prevention   Redirect broadcast and block flooding packet Source learning and forwarding   Disable Spanning Tree protocol calculate all routes for all node-pairs by Route Server Limited switch forwarding table size  Mac-in-Mac two stage forwarding by Dom0 kernel module ARP Intercept and Redirect 13 Directory Service Control flow Data flow DS Route Algorithm Server RAS sw3 1. DS-ARP 2. DS-Reply sw1 A sw4 3. Send data sw2 B Peregrine’s Solutions 14  Not all links are used   L2 Loop Prevention   Disable Spanning Tree protocol Redirect broadcast and block flooding packet Source learning and forwarding Calculate all routes for all node-pairs by Route Server  Fast fail-over: primary and backup routes for each pair   Limited switch forwarding table size  Mac-in-Mac two stage forwarding by Dom0 kernel module Mac-in-Mac Encapsulation 15 Control flow Data flow DS 2. B locates at sw4 sw3 3. sw4 A sw1 sw4 1. ARP redirect 4. Encap sw4 in source mac sw2 DA SA 5. Decap and restore original frame DA Encapsulation IDA sw 4 DA B B SA A SA Decapsulation IDA DA SA Fast Fail-Over 16  Goal: Fail-over latency < 100 msec Application agnostic  TCP timeout: 200ms   Strategy: Pre-compute a primary and backup route for each VM Each VM has two virtual MACs  When a link fails, notify hosts using affected primary routes that they should switch to corresponding backup routes  When a Network Link Fails 17 18 IMPLEMENTATION & Evaluation Software Architecture 19 Review All Components 20 DS ARP request rate How fast can DS handle request? RAS sw3 MIM module A How long can RAS process request? sw1 sw4 ARP redirect sw2 sw5 sw7 sw6 Backup route Performance of: MIM, DS, RAS, switch ? B Mac-In-Mac Performance cdf 21 Time spent for decap/encap/total: 1us / 5us / 7us (2.66GHz CPU) Around 2.66K / 13.3K / 18.6K cycles Aggregate Throughput for Multiple VMs 22 980 970 960 950 No MIM MIM 940 930 920 910 1VM 2VM 4VM 1. APR table size < 1k 2. Measure TCP throughput of 1VM, 2VM, 4VM communicating to each other. ARP Broadcast Rate in a Data Center 23  What’s the ARP traffic rate in real world?  From 2456 hosts, CMU CS department claims that there are 1150 ARP/sec at peak, 89 ARP/sec on average.  From 3800 hosts at university network, there are around 1000 ARP/sec at peak, < 100 ARP/sec on average.   To scale to 1M node, 20K-30K ARP/sec on average. Current optimal DS: 100K ARP/sec Fail-over time and its breakdown 24  Average fail-over time: 75ms  Switch: 25 ~ 45 ms   sending trap (soft unplug) RS: 25ms  400 350 300 250 total 200 RS 150 receiving trap and processing DS 100  DS: 2ms   receiving info from RS and inform DS The rests are network delay and dom0 processing time 50 0 1 2 3 4 5 6 7 8 9 10 11 Conclusion 25    A unified Layer-2-only network for LAN and SAN Centralized control plane and distributed data plane Use only Commodity Ethernet switches     Army of commodity switches vs. few high-port-density switches Requirements on switches: run fast and has programmable routing table Centralized load-balancing routing using real-time traffic matrix Fast fail-over using pre-computed primary/back routes 26 Questions? Thank you Review All Components: Result 27 DS 7us for Packet processing ARP request rate A 100K ARP/sec RS sw3 25ms per request sw1 Link down 35ms ARP redirect sw4 B sw2 sw5 sw7 sw6 Backup route 27 28 Backup slides Thank you OpenFlow Architecture 29    OpenFlow switch: A data plane that implements a set of flow rules specified in terms of the OpenFlow instruction set OpenFlow controller: A control plane that sets up the flow rules in the flow tables of OpenFlow switches OpenFlow protocol: A secure protocol for an OpenFlow controller to set up the flow tables in OpenFlow switches 30 OpenFlow Controller OpenFlow Protocol (SSL/TCP) Control Path OpenFlow Data Path (Hardware) Conclusion and Contribution 31   Using commodity switches to build a large scale layer 2 network Provide solutions to Ethernet’s scalability issues       Suppressing broadcast Load balancing route calculation Controlling MAC forwarding table Scale up to one million VMs by Mac-in-Mac two stage forwarding Fast fail-over Future work   High Availability of DS and RAS, mater-slave model Inter Comparisons 32  Scalable and available data center fabrics      IEEE 802.1aq: Shortest Path Bridging IETF TRILL Competitors: Cisco, Juniper, Brocade Differences: commodity switches, centralized load balancing routing and proactive backup route deployment Network virtualization    OpenStack Quantum API Competitors: Nicira, NEC Generality carries a steep performance price   Every virtual network link is a tunnel Differences: Simpler and more efficient because it runs on L2 switches directly Three Stage Clos Network (m,n,r) nxm n n n 1 rxr 1 mxn 1 2 2 2 . . . . . . . . . r m r n n n 33 Clos Network Theory 34  Clos(m, n, r) configuration: rn inputs, rn outputs    2r nxm + m rxr switches, less than rn x rn Each rxr switch can in turn be implemented as a 3-stage Clos network Clos(m,n,r) is rearrangeably non-blocking iff m >= n Clos(m,n,r) is stricly non-blocking iff m >= 2n-1 Link Aggregation 35 Logically a single switch, states are synchronized Stacking or Trunking ! " #$ ! " %$ ! " &$ !" ' $ Logically a single link to the upper layer STP views as a non-loop topology ! " #$ ! " &$ !" ' $ ECMP: Equal-Cost Multipath 36 $%# $&# $' # $( # '# Layer 3 ECMP routing Flow hashing to 4 uplinks !" $) # (# $*# $+# $, # %# ! %# … ! "# - %# … - "# . %# … . "# / %# … / "# Pros: multiple links are used, Cons: hash collision, re-converge downstream to a single link Example: Brocade Data Center 37 L3 ECMP Link aggregation Ref: Deploying Brocade VDX 6720 Data Center Switches with Brocade VCS in Enterprise Data Centers PortLand • Scale-out: Three-layer, multi-root topology • Hierarchical, encode location into MAC address • Local Discover Protocol to find shortest path, route by MAC • Fabric Manager maintains IP to MAC mapping • 60-80 ms failover, centrally control and notify 38 VL2: Virtual Layer 2 • Three layer, Clos network • Flat, IP-in-IP, Location address(LA) an Application Address (AA) IP-in-IP • Link-state routing to encapsulation disseminate LA • VLB + flow-based ECMP • Depend on ECMP to detect link failure • Packet interception at S IP-in-IP decapsulation VLB VL2 Directory Service 39 Monsoon • Three layer, multi-root topology • 802.1ah MAC-in-MAC encapsulation, source routing • centralized routing decision • VLB + MAC rotation • Depend on LSA to detect failures • Packet interception at S Monsoon Directory Service IP <-> (server MAC, ToR MAC) 40 TRILL and SPB TRILL • Transparent Interconnect of Lots of Links, IETF • IS-IS as a topology management protocol • Shortest path forwarding • New TRILL header • Transit hash to select nexthop SPB • Shortest Path Bridging, IEEE • IS-IS as a topology management protocol • Shortest path forwarding • 802.1ah MAC-in-MAC • Compute 16 source node based trees 41 TRILL Packet Forwarding Link-state routing TRILL header Ref: NIL Data Communications A-ID: nickname of A C-ID: nickname of C HopC: hop count 42 SPB Packet Forwarding Link-state routing 802.1ah Mac-in-Mac I-SID: Backbone Service Instance Identifier B-VID: backbone VLAN identifier Ref: NIL Data Communications 43 Re-arrangeable non-blocking Clos network Example: 1. Three-stage Clos network 2. Condition: k>=n 3. An unused input at ingress switch can always be connected to an unused output at egress switch 4. Existing calls may have to be rearranged nxk 2x2 (N/n)x(N/n) 3x3 kxn 2x2 input output 2x2 3x3 2x2 ingress 2x2 N=6 n=2 k=2 2x2 middle egress 44 Features of Peregrine network • Utilize all links • Load balancing routing algorithm • Scale up to 1 million VMs – Two stage dual mode forwarding • Fast fail over • Load balancing routing algorithm 45 45 Goal Primary S D Backup • Given a mesh network and traffic profile – Load balance the network resource utilization • Prevent congestion by balancing the network load to support as many traffic load as possible – Provide fast recovery from failure • Provide primary-backup route to minimize recovery time 46 Factors • Only hop count • Hop count and link residual capacity • Hop count, link residual capacity, and link expected load • Hop count, link residual capacity, link expected load and additional forwarding table entries required How to combine them into one number for a particular candidate route? 47 Route Selection: idea A B Leave C-D free S1 D1 Share with S2-D2 C S2 D S2-D2 shares link C-D D2 Which route is better from S1 to D1? Link C-D is more important! Idea: use it as sparsely as possible 48 Route Selection: hop count and Residual capacity Traffic Matrix: S1 -> D1: 1G S2 -> D2: 1G A B Leave C-D free S1 D1 Share with S2-D2 C D S2 D2 Using Hop count or residual capacity makes no difference! 49 Determine Criticality of A Link Determine the importance of a link fl ( s,d) = fraction of all (s, d) routes that pass through link l Expected load of a link at initial state fl = åfl ( s,d )B( s,d ) (s,d ) B( s,d ) = Bandwidth demand matrix for s and d 50 Criticality Example From B to C has four possible routes. 2/4 2/4 2/4 Case2: Calculate l s = B, d = C f ( s,d) 2/4 2/4 Case3: s = A, d = C is similar 2/4 0 A 4/4 B 4/4 C 51 Expected Load Assumption: load is equally distributed over each possible routes between S and D. 10 10 10 Consider bandwidth demand for B-C is 20. Expected Load: fl = åfl ( s,d )B( s,d ) 10 10 10 (s,d ) 0 A 20 20 B C 52 Cost Metrics Cost metric represents the expected load per unit of available capacity on the link 0.01 0.01 A 0.02 fl Rl f l= Expected Load 0.01 0.01 0 cos t(l) = 0.01 Rl = Residual Capacity 0.01 0.02 B C Idea: pick the link with minimum cost 53 Forwarding Table Metric Consider using commodity switch with 16-32k forwarding table size. 300 200 0 Fwd(n) = available forwarding table entries at node n INC_FWD = extra entries needed to route A-C 100 100 A B C Idea: minimize entry consumption, prevent forwarding table from being exhausted 54 Load Balanced Routing • Simulated network – 52 PMs with 4 NICs, total 384 links – Replay 17 multi-VDC 300-second traces • Compare – Random shortest path routing (RSPR) – Full Link Criticality-based routing (FLCR) • Metrics: congestion count – # of links with exceeded capacity • Low additional traffic induced by FLCR 55

slides - Computer Science Department

Related documents

Products

Support

slides - Computer Science Department

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib