Data Center Networking CS 6250: Computer Networking Fall 2011 1 Cloud Computing • Elastic resources – Expand and contract resources – Pay-per-use – Infrastructure on demand • Multi-tenancy – Multiple independent users – Security and resource isolation – Amortize the cost of the (shared) infrastructure • Flexibility service management 2 – Resiliency: isolate failure of servers and storage – Workload movement: move work to other locations Trend of Data Center By J. Nicholas Hoover , InformationWeek June 17, 2008 04:00 AM http://gigaom.com/cloud/google-builds-megadata-center-in-finland/ 200 million Euro http://www.datacenterknowledge.com Data centers will be larger and larger in the future cloud computing era to benefit from commodities of scale. Tens of thousand Hundreds of thousands in # of servers Most important things in data center management - Economies of scale - High utilization of equipment - Maximize revenue - Amortize administration cost - Low power consumption (http://www.energystar.gov/ia/partners/prod_development/downloads/EPA_Datacenter_Report_Congress_Final1.pdf) Cloud Service Models • Software as a Service – Provider licenses applications to users as a service – e.g., customer relationship management, email, … – Avoid costs of installation, maintenance, patches, … • Platform as a Service – Provider offers software platform for building applications – e.g., Google’s App-Engine – Avoid worrying about scalability of platform • Infrastructure as a Service – Provider offers raw computing, storage, and network – e.g., Amazon’s Elastic Computing Cloud (EC2) – Avoid buying servers and estimating resource needs 4 Multi-Tier Applications • Applications consist of tasks – Many separate components – Running on different machines Front end Server • Commodity computers – Many general-purpose computers – Not one big mainframe – Easier scaling Aggregator Aggregator Aggregator …… Aggregator … 5 Worker Worker … Worker Worker Worker Enabling Technology: Virtualization • Multiple virtual machines on one physical machine • Applications run unmodified as on real machine •6 VM can migrate from one computer to another Status Quo: Virtual Switch in Server 7 Top-of-Rack Architecture • Rack of servers – Commodity servers – And top-of-rack switch • Modular design – Preconfigured racks – Power, network, and storage cabling • Aggregate to the next level 8 Modularity • Containers • Many containers 9 Data Center Challenges • • • • • • Traffic load balance Support for VM migration Achieving bisection bandwidth Power savings / Cooling Network management (provisioning) Security (dealing with multiple tenants) 10 Data Center Costs (Monthly Costs) • Servers: 45% – CPU, memory, disk • Infrastructure: 25% – UPS, cooling, power distribution • Power draw: 15% – Electrical utility costs • Network: 15% – Switches, links, transit http://perspectives.mvdirona.com/2008/11/28/CostOfPowerInLargeScaleDataCenters.aspx 11 Common Data Center Topology Internet Core Aggregation Access Data Center Layer-3 router Layer-2/3 switch Layer-2 switch Servers 12 Data Center Network Topology Internet CR AR AR S S S S S … ~ 1,000 servers/pod 13 CR ... S … AR AR ... Key • CR = Core Router • AR = Access Router • S = Ethernet Switch • A = Rack of app. servers Requirements for future data center • To catch up with the trend of mega data center, DCN technology should meet the requirements as below – – – – – – High Scalability Transparent VM migration (high agility) Easy deployment requiring less human administration Efficient communication Loop free forwarding Fault Tolerance • Current DCN technology can’t meet the requirements. – Layer 3 protocol can not support the transparent VM migration. – Current Layer 2 protocol is not scalable due to the size of forwarding table and native broadcasting for address resolution. Problems with Common Topologies • Single point of failure • Over subscription of links higher up in the topology • Tradeoff between cost and provisioning 15 Capacity Mismatch CR ~ 200:1 AR S CR AR AR AR S S S S S ~ 40:1 S S ~ 5:1 … 16 S S … ... S … S … Data-Center Routing Internet CR DC-Layer 3 AR AR SS SS SS SS CR ... AR AR DC-Layer 2 SS … SS … ~ 1,000 servers/pod == IP subnet 17 ... Key • CR = Core Router (L3) • AR = Access Router (L3) • S = Ethernet Switch (L2) • A = Rack of app. servers Reminder: Layer 2 vs. Layer 3 • Ethernet switching (layer 2) – Cheaper switch equipment – Fixed addresses and auto-configuration – Seamless mobility, migration, and failover • IP routing (layer 3) – Scalability through hierarchical addressing – Efficiency through shortest-path routing – Multipath routing through equal-cost multipath • So, like in enterprises… 18 – Data centers often connect layer-2 islands by IP routers Need for Layer 2 • Certain monitoring apps require server with same role to be on the same VLAN • Using same IP on dual homed servers • Allows organic growth of server farms • Migration is easier 19 Review of Layer 2 & Layer 3 • Layer 2 – One spanning tree for entire network • Prevents loops • Ignores alternate paths • Layer 3 – Shortest path routing between source and destination – Best-effort delivery 20 FAT Tree-Based Solution • Connect end-host together using a fat-tree topology – Infrastructure consist of cheap devices • Each port supports same speed as endhost – All devices can transmit at line speed if packets are distributed along existing paths – A k-port fat tree can support k3/4 hosts 21 Fat-Tree Topology 22 Problems with a Vanilla Fat-tree • Layer 3 will only use one of the existing equal cost paths • Packet re-ordering occurs if layer 3 blindly takes advantage of path diversity 23 Modified Fat Tree • Enforce special addressing scheme in DC – Allows host attached to same switch to route only through switch – Allows inter-pod traffic to stay within pod – unused.PodNumber.switchnumber.Endhost • Use two level look-ups to distribute traffic and maintain packet ordering. 24 Two-Level Lookups • First level is prefix lookup – Used to route down the topology to endhost • Second level is a suffix lookup – Used to route up towards core – Diffuses and spreads out traffic – Maintains packet ordering by using the same ports for the same endhost 25 Diffusion Optimizations • Flow classification – Eliminates local congestion – Assign to traffic to ports on a per-flow basis instead of a per-host basis • Flow scheduling – Eliminates global congestion – Prevent long lived flows from sharing the same links – Assign long lived flows to different links 26 Drawbacks • No inherent support for VLAN traffic • Data center is fixed size • Ignored connectivity to the Internet • Waste of address space – Requires NAT at border 27 Data Center Traffic Engineering Challenges and Opportunities 28 Wide-Area Network ... Servers Router DNS Server DNS-based site selection 29 Data Centers ... Servers Router Internet Clients Wide-Area Network: Ingress Proxies ... Servers Router ... Data Centers Servers Router Proxy Proxy Clients 30 Traffic Engineering Challenges • Scale – Many switches, hosts, and virtual machines • Churn – Large number of component failures – Virtual Machine (VM) migration • Traffic characteristics – High traffic volume and dense traffic matrix – Volatile, unpredictable traffic patterns • Performance requirements 31 – Delay-sensitive applications – Resource isolation between tenants Traffic Engineering Opportunities • Efficient network – Low propagation delay and high capacity • Specialized topology – Fat tree, Clos network, etc. – Opportunities for hierarchical addressing • Control over both network and hosts – Joint optimization of routing and server placement – Can move network functionality into the end host • Flexible movement of workload 32 – Services replicated at multiple servers and data centers – Virtual Machine (VM) migration PortLand: Main Idea Add a new host Key features - Layer 2 protocol based on tree topology - PMAC encode the position information - Data forwarding proceeds based on PMAC - Edge switch’s responsible for mapping between PMAC and AMAC - Fabric manger’s responsible for address resolution - Edge switch makes PMAC invisible to end host - Each switch node can identify its position by itself - Fabric manager keep information of overall topology. Corresponding to the fault, it notifies affected nodes. Transfer a packet Questions in discussion board • Question about Fabric Manager o How to make Fabric Manager robust ? o How to build scalable Fabric Manager ? Redundant deployment or cluster Fabric manager could be solution • Question about reality of base-line tree topology o Is the tree topology common in real world ? Yes, multi-rooted tree topology has been a traditional topology in data center. [A scalable, commodity data center network architecture Mohammad La-Fares and et el, SIGCOMM ‘08] Discussion points o Is the PortLand applicable to other topology ? The Idea of central ARP management could be applicable. To solve forwarding loop problem, TRILL header-like method would be necessary. The benefits of PMAC ? It would require larger size of forwarding table. • Question about benefits from VM migration o VM migration helps to reduce traffic going through aggregate/core switches. o How about user requirement change? o How about power consumption ? • ETC o Feasibility to mimic PortLand on layer 3 protocol. o How about using pseudo IP ? o Delay to boot up whole data center Status Quo: Conventional DC Network Internet CR DC-Layer 3 AR AR S S CR ... AR AR DC-Layer 2 Key • • S S S S ... • • … CR = Core Router (L3) AR = Access Router (L3) S = Ethernet Switch (L2) A = Rack of app. servers … ~ 1,000 servers/pod == IP subnet Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004 Conventional DC Network Problems CR S CR AR AR AR AR S ~ 200:1 S S S S S ~ S ~ 40:1 5 … : 1 S S … ... S … S … Dependence on high-cost proprietary routers Extremely limited server-to-server capacity And More Problems … CR AR S S S … CR AR S ~ 200:1 S S … IP subnet (VLAN) #1 S … AR AR S S S S S … IP subnet (VLAN) #2 • Resource fragmentation, significantly lowering utilization (and cost-efficiency) And More Problems … CR AR S S S … CR AR Complicated manual L2/L3 re-configuration S ~ 200:1 S S … IP subnet (VLAN) #1 S … AR AR S S S S S … IP subnet (VLAN) #2 • Resource fragmentation, significantly lowering cloud utilization (and cost-efficiency) All We Need is Just a Huge L2 Switch, or an Abstraction of One ... CR S … CR AR AR AR AR S S S S S S S S S … S S ... … … VL2 Approach • Layer 2 based using future commodity switches • Hierarchy has 2: – access switches (top of rack) – load balancing switches • Eliminate spanning tree – Flat routing – Allows network to take advantage of path diversity • Prevent MAC address learning – 4D architecture to distribute data plane information – TOR: Only need to learn address for the intermediate switches – Core: learn for TOR switches • Support efficient grouping of hosts (VLAN replacement) 41 VL2 42 VL2 Components • Top-of-Rack switch: – Aggregate traffic from 20 end host in a rack – Performs ip to mac translation • Intermediate Switch – Disperses traffic – Balances traffic among switches – Used for valiant load balancing • Decision Element – Places routes in switches – Maintain a directory services of IP to MAC • End-host – Performs IP-to-MAC lookup 43 Routing in VL2 • End-host checks flow cache for MAC of flow – If not found ask agent to resolve – Agent returns list of MACs for server and MACs for intermediate routers • Send traffic to Top of Router – Traffic is triple-encapsulated • Traffic is sent to intermediate destination • Traffic is sent to Top of rack switch of destination 44 The Illusion of a Huge L2 Switch 1. L2 semantics 2. Uniform high capacity … … 3. Performance isolation … … Objectives and Solutions Objective Approach Solution 1. Layer-2 semantics Employ flat addressing Name-location separation & resolution service 2. Uniform high capacity between servers Guarantee bandwidth for hose-model traffic Flow-based random traffic indirection (Valiant LB) Enforce hose model using existing mechanisms only TCP 3. Performance Isolation Name/Location Separation Cope with host churns with very little overhead Directory Service Switches run link-state routing and maintain only switch-level topology • Allows to use low-cost switches … x ToR2 • Protects network and hosts from host-state churn y ToR3 • Obviates host and switch reconfiguration z ToR34 ToR1 … . . . ToR2 . . . ToR3 . . . ToR4 ToR3 y payload ToR34 z payload 47 x yy, z Servers use flat names z Lookup & Response Clos Network Topology Offer huge aggr capacity & multi paths at modest cost Int D (# of 10G ports) Aggr 48 96 144 ... ... 48 TOR 20 Servers ... Max DC size (# of Servers) 11,520 ... 46,080 K aggr switches with D ports 103,680 ...... ........ 20*(DK/4) Servers Valiant Load Balancing: Indirection Cope with arbitrary TMs with very little overhead IANY IANY IANY Links used for up paths Links used for down paths [ ECMP + IP Anycast ] • • • • Harness huge bisection bandwidth Obviate esoteric traffic engineering or optimization Ensure robustness to failures Work with switch mechanisms available today T1 IANY T35 zy T2 payload x 49 T3 T4 T5 T6 1. Must spread Equal Cost Multitraffic Path Forwarding y2. Must ensure dst z independence Properties of Desired Solutions • Backwards compatible with existing infrastructure – No changes in application – Support of layer 2 (Ethernet) • Cost-effective – Low power consumption & heat emission – Cheap infrastructure • • • • Allows host communication at line speed Zero-configuration No loops Fault-tolerant 50 Research Questions • What topology to use in data centers? – Reducing wiring complexity – Achieving high bisection bandwidth – Exploiting capabilities of optics and wireless • Routing architecture? – Flat layer-2 network vs. hybrid switch/router – Flat vs. hierarchical addressing • How to perform traffic engineering? – Over-engineering vs. adapting to load – Server selection, VM placement, or optimizing routing • Virtualization of NICs, servers, switches, … 51 Research Questions • Rethinking TCP congestion control? – Low propagation delay and high bandwidth – “Incast” problem leading to bursty packet loss • Division of labor for TE, access control, … – VM, hypervisor, ToR, and core switches/routers • Reducing energy consumption – Better load balancing vs. selective shutting down • Wide-area traffic engineering – Selecting the least-loaded or closest data center • Security 52 – Preventing information leakage and attacks