1 PEREGRINE: AN ALL-LAYER-2 CONTAINER COMPUTER NETWORK Tzi-cker Chiueh , Cheng-Chun Tu, Yu-Cheng Wang, Pai-Wei Wang, Kai-Wen Li, and Yu-Ming Huan ∗Computer Science Department, Stony Brook University †Industrial Technology Research Institute, Taiwan 1 Outline 2 Motivation Layer 2 + Layer 3 design Requirements for cloud-scale DC Problems of Classic Ethernet to the Cloud Solutions Related Solutions Peregrine’s Solution Implementation and Evaluation Software architecture Performance Evaluation L2 + L3 Architecture: Problems 3 Bandwidth Bottleneck Problem: Configuration: - Routing table in the routers - IP assignment - DHCP coordination - VLAN and STP Problem: forwarding table size Commodity switch 16-32k Virtual Machine Mobility Constrained to a Physical Location Ref: Cisco data center with FabricPath and the Cisco FabricPath Switching System Requirements for Cloud-Scale DC 4 Any-to-any connectivity with non-blocking fabric Virtual machine mobility Quick failure detection and recovery Support for Multi-tenancy Large layer 2 domain Fast fail-over Scale to more than 10,000 physical nodes Share resources between different customers Load balancing routing Efficiently use all available links Solution: A Huge L2 Switch! Layer 2 Switch Single L2 Network Non-blocking Backplane Bandwidth However, Ethernet does not scale! Config-free, Plug-and play VM VM VM VM VM VM VM VM VM VM VM VM 5 Linear cost and power scaling ….. Scale to 1 million VMs 5 Revisit Ethernet: Spanning Tree Topology 6 N4 N3 N5 s2 s1 N6 N2 s3 N1 s4 N8 N7 Revisit Ethernet: Spanning Tree Topology 7 N4 N3 N5 D D s1 B s2 R N6 N2 s3 s4 Root N1 N8 N7 Revisit Ethernet: Broadcast and Source Learning 8 N4 N3 N5 D D s1 B s2 R N6 N2 s3 s4 Root N1 N8 N7 Benefit: plug-and-play Ethernet’s Scalability Issues Limited forwarding table size STP as a solution to loop prevention Fail-over latency is high ( > 5 seconds) Broadcast overhead 9 Not all physical links are used No load-sensitive dynamic routing Slow fail-over Commodity switch: 16k to 64k entries Typical Layer 2 consists of hundreds of hosts Related Works / Solution Strategies 10 Scalability: Clos network / Fat-tree to scale out Alternative to STP Link aggregation, e.g. LACP, Layer 2 trunking Routing protocols to layer 2 network Limited forwarding table size Packet header encapsulation or re-writing Load balancing Randomness or traffic engineering approach 11 Design of the Peregrine Peregrine’s Solutions 12 Not all links are used L2 Loop Prevention Redirect broadcast and block flooding packet Source learning and forwarding Disable Spanning Tree protocol calculate all routes for all node-pairs by Route Server Limited switch forwarding table size Mac-in-Mac two stage forwarding by Dom0 kernel module ARP Intercept and Redirect 13 Directory Service Control flow Data flow DS Route Algorithm Server RAS sw3 1. DS-ARP 2. DS-Reply sw1 A sw4 3. Send data sw2 B Peregrine’s Solutions 14 Not all links are used L2 Loop Prevention Disable Spanning Tree protocol Redirect broadcast and block flooding packet Source learning and forwarding Calculate all routes for all node-pairs by Route Server Fast fail-over: primary and backup routes for each pair Limited switch forwarding table size Mac-in-Mac two stage forwarding by Dom0 kernel module Mac-in-Mac Encapsulation 15 Control flow Data flow DS 2. B locates at sw4 sw3 3. sw4 A sw1 sw4 1. ARP redirect 4. Encap sw4 in source mac sw2 DA SA 5. Decap and restore original frame DA Encapsulation IDA sw 4 DA B B SA A SA Decapsulation IDA DA SA Fast Fail-Over 16 Goal: Fail-over latency < 100 msec Application agnostic TCP timeout: 200ms Strategy: Pre-compute a primary and backup route for each VM Each VM has two virtual MACs When a link fails, notify hosts using affected primary routes that they should switch to corresponding backup routes When a Network Link Fails 17 18 IMPLEMENTATION & Evaluation Software Architecture 19 Review All Components 20 DS ARP request rate How fast can DS handle request? RAS sw3 MIM module A How long can RAS process request? sw1 sw4 ARP redirect sw2 sw5 sw7 sw6 Backup route Performance of: MIM, DS, RAS, switch ? B Mac-In-Mac Performance cdf 21 Time spent for decap/encap/total: 1us / 5us / 7us (2.66GHz CPU) Around 2.66K / 13.3K / 18.6K cycles Aggregate Throughput for Multiple VMs 22 980 970 960 950 No MIM MIM 940 930 920 910 1VM 2VM 4VM 1. APR table size < 1k 2. Measure TCP throughput of 1VM, 2VM, 4VM communicating to each other. ARP Broadcast Rate in a Data Center 23 What’s the ARP traffic rate in real world? From 2456 hosts, CMU CS department claims that there are 1150 ARP/sec at peak, 89 ARP/sec on average. From 3800 hosts at university network, there are around 1000 ARP/sec at peak, < 100 ARP/sec on average. To scale to 1M node, 20K-30K ARP/sec on average. Current optimal DS: 100K ARP/sec Fail-over time and its breakdown 24 Average fail-over time: 75ms Switch: 25 ~ 45 ms sending trap (soft unplug) RS: 25ms 400 350 300 250 total 200 RS 150 receiving trap and processing DS 100 DS: 2ms receiving info from RS and inform DS The rests are network delay and dom0 processing time 50 0 1 2 3 4 5 6 7 8 9 10 11 Conclusion 25 A unified Layer-2-only network for LAN and SAN Centralized control plane and distributed data plane Use only Commodity Ethernet switches Army of commodity switches vs. few high-port-density switches Requirements on switches: run fast and has programmable routing table Centralized load-balancing routing using real-time traffic matrix Fast fail-over using pre-computed primary/back routes 26 Questions? Thank you Review All Components: Result 27 DS 7us for Packet processing ARP request rate A 100K ARP/sec RS sw3 25ms per request sw1 Link down 35ms ARP redirect sw4 B sw2 sw5 sw7 sw6 Backup route 27 28 Backup slides Thank you OpenFlow Architecture 29 OpenFlow switch: A data plane that implements a set of flow rules specified in terms of the OpenFlow instruction set OpenFlow controller: A control plane that sets up the flow rules in the flow tables of OpenFlow switches OpenFlow protocol: A secure protocol for an OpenFlow controller to set up the flow tables in OpenFlow switches 30 OpenFlow Controller OpenFlow Protocol (SSL/TCP) Control Path OpenFlow Data Path (Hardware) Conclusion and Contribution 31 Using commodity switches to build a large scale layer 2 network Provide solutions to Ethernet’s scalability issues Suppressing broadcast Load balancing route calculation Controlling MAC forwarding table Scale up to one million VMs by Mac-in-Mac two stage forwarding Fast fail-over Future work High Availability of DS and RAS, mater-slave model Inter Comparisons 32 Scalable and available data center fabrics IEEE 802.1aq: Shortest Path Bridging IETF TRILL Competitors: Cisco, Juniper, Brocade Differences: commodity switches, centralized load balancing routing and proactive backup route deployment Network virtualization OpenStack Quantum API Competitors: Nicira, NEC Generality carries a steep performance price Every virtual network link is a tunnel Differences: Simpler and more efficient because it runs on L2 switches directly Three Stage Clos Network (m,n,r) nxm n n n 1 rxr 1 mxn 1 2 2 2 . . . . . . . . . r m r n n n 33 Clos Network Theory 34 Clos(m, n, r) configuration: rn inputs, rn outputs 2r nxm + m rxr switches, less than rn x rn Each rxr switch can in turn be implemented as a 3-stage Clos network Clos(m,n,r) is rearrangeably non-blocking iff m >= n Clos(m,n,r) is stricly non-blocking iff m >= 2n-1 Link Aggregation 35 Logically a single switch, states are synchronized Stacking or Trunking ! " #$ ! " %$ ! " &$ !" ' $ Logically a single link to the upper layer STP views as a non-loop topology ! " #$ ! " &$ !" ' $ ECMP: Equal-Cost Multipath 36 $%# $&# $' # $( # '# Layer 3 ECMP routing Flow hashing to 4 uplinks !" $) # (# $*# $+# $, # %# ! %# … ! "# - %# … - "# . %# … . "# / %# … / "# Pros: multiple links are used, Cons: hash collision, re-converge downstream to a single link Example: Brocade Data Center 37 L3 ECMP Link aggregation Ref: Deploying Brocade VDX 6720 Data Center Switches with Brocade VCS in Enterprise Data Centers PortLand • Scale-out: Three-layer, multi-root topology • Hierarchical, encode location into MAC address • Local Discover Protocol to find shortest path, route by MAC • Fabric Manager maintains IP to MAC mapping • 60-80 ms failover, centrally control and notify 38 VL2: Virtual Layer 2 • Three layer, Clos network • Flat, IP-in-IP, Location address(LA) an Application Address (AA) IP-in-IP • Link-state routing to encapsulation disseminate LA • VLB + flow-based ECMP • Depend on ECMP to detect link failure • Packet interception at S IP-in-IP decapsulation VLB VL2 Directory Service 39 Monsoon • Three layer, multi-root topology • 802.1ah MAC-in-MAC encapsulation, source routing • centralized routing decision • VLB + MAC rotation • Depend on LSA to detect failures • Packet interception at S Monsoon Directory Service IP <-> (server MAC, ToR MAC) 40 TRILL and SPB TRILL • Transparent Interconnect of Lots of Links, IETF • IS-IS as a topology management protocol • Shortest path forwarding • New TRILL header • Transit hash to select nexthop SPB • Shortest Path Bridging, IEEE • IS-IS as a topology management protocol • Shortest path forwarding • 802.1ah MAC-in-MAC • Compute 16 source node based trees 41 TRILL Packet Forwarding Link-state routing TRILL header Ref: NIL Data Communications A-ID: nickname of A C-ID: nickname of C HopC: hop count 42 SPB Packet Forwarding Link-state routing 802.1ah Mac-in-Mac I-SID: Backbone Service Instance Identifier B-VID: backbone VLAN identifier Ref: NIL Data Communications 43 Re-arrangeable non-blocking Clos network Example: 1. Three-stage Clos network 2. Condition: k>=n 3. An unused input at ingress switch can always be connected to an unused output at egress switch 4. Existing calls may have to be rearranged nxk 2x2 (N/n)x(N/n) 3x3 kxn 2x2 input output 2x2 3x3 2x2 ingress 2x2 N=6 n=2 k=2 2x2 middle egress 44 Features of Peregrine network • Utilize all links • Load balancing routing algorithm • Scale up to 1 million VMs – Two stage dual mode forwarding • Fast fail over • Load balancing routing algorithm 45 45 Goal Primary S D Backup • Given a mesh network and traffic profile – Load balance the network resource utilization • Prevent congestion by balancing the network load to support as many traffic load as possible – Provide fast recovery from failure • Provide primary-backup route to minimize recovery time 46 Factors • Only hop count • Hop count and link residual capacity • Hop count, link residual capacity, and link expected load • Hop count, link residual capacity, link expected load and additional forwarding table entries required How to combine them into one number for a particular candidate route? 47 Route Selection: idea A B Leave C-D free S1 D1 Share with S2-D2 C S2 D S2-D2 shares link C-D D2 Which route is better from S1 to D1? Link C-D is more important! Idea: use it as sparsely as possible 48 Route Selection: hop count and Residual capacity Traffic Matrix: S1 -> D1: 1G S2 -> D2: 1G A B Leave C-D free S1 D1 Share with S2-D2 C D S2 D2 Using Hop count or residual capacity makes no difference! 49 Determine Criticality of A Link Determine the importance of a link fl ( s,d) = fraction of all (s, d) routes that pass through link l Expected load of a link at initial state fl = åfl ( s,d )B( s,d ) (s,d ) B( s,d ) = Bandwidth demand matrix for s and d 50 Criticality Example From B to C has four possible routes. 2/4 2/4 2/4 Case2: Calculate l s = B, d = C f ( s,d) 2/4 2/4 Case3: s = A, d = C is similar 2/4 0 A 4/4 B 4/4 C 51 Expected Load Assumption: load is equally distributed over each possible routes between S and D. 10 10 10 Consider bandwidth demand for B-C is 20. Expected Load: fl = åfl ( s,d )B( s,d ) 10 10 10 (s,d ) 0 A 20 20 B C 52 Cost Metrics Cost metric represents the expected load per unit of available capacity on the link 0.01 0.01 A 0.02 fl Rl f l= Expected Load 0.01 0.01 0 cos t(l) = 0.01 Rl = Residual Capacity 0.01 0.02 B C Idea: pick the link with minimum cost 53 Forwarding Table Metric Consider using commodity switch with 16-32k forwarding table size. 300 200 0 Fwd(n) = available forwarding table entries at node n INC_FWD = extra entries needed to route A-C 100 100 A B C Idea: minimize entry consumption, prevent forwarding table from being exhausted 54 Load Balanced Routing • Simulated network – 52 PMs with 4 NICs, total 384 links – Replay 17 multi-VDC 300-second traces • Compare – Random shortest path routing (RSPR) – Full Link Criticality-based routing (FLCR) • Metrics: congestion count – # of links with exceeded capacity • Low additional traffic induced by FLCR 55