Symbiotic Routing in Future Data Centers Hussam Abu-Libdeh, Paolo Costa, Antony Rowstron, Greg O’Shea, Austin Donnelly Cornell University Microsoft Research Cambridge 1 Data center networking • Network principles evolved from Internet systems • Multiple administrative domains • Heterogeneous environment • But data centers are different • Single administrative domains • Total control over all operational aspects • Re-examine the network in this new setting 2 Rethinking DC networks • New proposals for data center network architectures TCO Scalability Network Interface Modular Design Graceful Degradation Fault Tolerance Performance Isolation Bandwidth • Network interface has not changed! Commodity Components • DCell, BCube, Fat-tree, VL2, PortLand … 3 Challenge • The network is a black box to applications • Must infer network properties • Locality, congestion, failure …etc • Little or no control over routing • Applications are a black box to the network • Must infer flow properties • E.g. Traffic engineering/Hedera • In consequence • Today’s data centers and proposals use a single protocol • Routing trade-offs made in an application-agnostic way • E.g. Latency, throughput, …etc 4 CamCube • A new data center design • Nodes are commodity x86 servers with local storage • Container-based model 1,500-2,500 servers • Direct-connect 3D torus topology • Six Ethernet ports / server • Servers have (x,y,z) coordinates • Defines coordinate space (0,2,0) y • Simple 1-hop API • Send/receive packets to/from 1-hop neighbours • Not using TCP/IP z • Everything is a service • Run on all servers x • Multi-hop routing is a service • Simple link state protocol • Route packets along shortest paths from source to destination 5 Development experience • Built many data center services on CamCube • E.g. • High-throughput transport service • Desired property: high throughput • Large-file multicast service • Desired property: low link load • Aggregation service • Desired property: distribute computation load over servers • Distributed object cache service • Desired property: per-key caches, low path stretch 6 Per-service routing protocols • Higher flexibility • Services optimize for different objectives • High throughput transport disjoint paths • Increases throughput • File multicast non-disjoint paths • Decreases network load 7 What is the benefit? • Prototype Testbed • 27 servers, 3x3x3 CamCube • Quad core, 4 GB RAM, six 1Gbps Ethernet ports • Large-scale packet-level discrete event simulator • 8,000 servers, 20x20x20 CamCube • 1Gbps links • Service code runs unmodified on cluster and simulator 8 Service-level benefits • High throughput transport service • 1 sender 2000 receivers 1 • Sequential iteration CDF Flows • 10,000 packets/flow • 1500 bytes/packet 0.75 0.5 0.25 • Metric: throughput • Shown: custom/base ratio 0 0 1 2 3 4 5 Custom/Base Throughput Ratio 9 Service-level benefits • Large-file multicast service • Metric: # of links in multicast tree • Shown: custom/base ratio 0.4 Links reduction • 8,000-server network • 1 multicast group • Group size: 0% 100% of servers 0.3 0.2 0.1 0 0% 20% 40% 60% 80% 100% Number of servers in the group (%) 10 Service-level benefits • Distributed object cache service • Evenly distributed among servers • 800,000 lookups • 100 lookups per server • Keys picked by Zipf distribution • 1 primary + 8 replicas per key 1 0.75 CDF Lookups • 8,000-server network • 8,000,000 key-object pairs 0.5 Custom Routing 0.25 Base Routing 0 0 10 20 30 Path length • Replicas unpopulated initially • Metric: path length to nearest hit 11 Network impact • Ran all services simultaneously • No correlation in link usage • Reduction in link utilization Services per link Custom/base packet ratio Fraction of links 0.6 Change in link utilization 0.4 0.2 0 0 services 1 service 2 services 3 services 4 services 1 0.8 0.6 0.4 0.2 0 Key-value Cache Multicast Fixed Path Aggregation • Take-away: custom routing reduced network load and increased service-level performance High-Throughput Transport 12 Symbiotic routing relations • Multiple routing protocols running concurrently • Routing state shared with base routing protocol • Services • Use one or more routing protocols • Use base protocol to simplify their custom protocols • Network failures • Handled by base protocol • Services route for common case Service A Routing Protocol 1 Service B Routing Protocol 2 Service C Routing Protocol 3 Base Routing Protocol Network 13 Building a routing framework • Simplify building custom routing protocols • Routing: • Build routes from set of intermediate points • Coordinates in the coordinate space packet local coord next coord • Services provide forwarding function ‘F’ • Framework routes between intermediate points • Use base routing service • Consistently remap coordinate space on node failure • Queuing: • Services manage packet queues per link • Fair queuing between services per link 14 Example: cache service • Distributed key-object caching • Key-space mapped onto CamCube coordinate space • Per-key caches • Evenly distributed across coordinate space • Cache coordinates easily computable based on key 15 Cache service routing • Routing • Source nearest cache or primary • On cache miss: cache primary • Populate cache: primary cache • F function computed at • Source • Cache • Primary • Different packets can use different links • Accommodate network conditions • E.g. congestion source/querier nearest cache primary server 16 Handling failures • On link failure • Base protocol routes around failure • On replica server failure • Key space consistently remapped by framework • F function does not change • Developer only targets common case • Framework handles corner cases source/querier nearest cache primary server 17 Cache service F function protected override List<ulong> F(int neighborIndex, ulong currentDestinationKey, Packet packet) { List<ulong> nextKeys = new List<ulong>(); ulong itemKey = LookupPacket.GetItemKey(packet); extract packet ulong sourceKey = LookupPacket.GetSourceKey(packet); details if (currentDestinationKey == sourceKey) // am I the source? { // get the list of caches (using KeyValueStore static method) ulong[] cachesKey = ServiceKeyValueStore.GetCaches(itemKey); } // iterate over all cache nodes and keep the closest ones int minDistance = int.MaxValue; foreach (ulong cacheKey in cachesKey) { int distance = node.nodeid.DistanceTo(LongKeyToKeyCoord(cacheKey)); if (distance < minDistance) { nextKeys.Clear(); nextKeys.Add(cacheKey); minDistance = distance; } else if (distance == minDistance) nextKeys.Add(cacheKey); } else if (currentDestinationKey != itemKey) // am I the cache? nextKeys.Add(itemKey); } return nextKeys; if at source, route to nearest cache or primary if cache miss, route to primary 18 Framework overhead • Benchmark performance • Single server in testbed • Communicate with all six 1-hop neighbors (Tx + Rx) • Sustained 11.8 Gbps throughput • Out of upper bound of 12 Gbps • User-space routing overhead CPU Utilization (%) 100 80 60 40 19 20 0 Baseline Framework What have we done • Services only specify a routing “skeleton” • Framework fills in the details • Control messages and failures handled by framework • Reduce routing complexity for services • Opt-in basis • Services define custom protocols only if they need to 20 Network requirements • Per-service routing not limited to CamCube • Network need only provide: • Path diversity • Providing routing options • Topology awareness • Expose server locality and connectivity • Programmable components • Allow per-service routing logic 21 Conclusions • Data center networking from the developer’s perspective • Custom routing protocols to optimize for application-level performance requirements • Presented a framework for custom routing protocols • Applications specify a forwarding function (F) and queuing hints • Framework manages network state, control messages, and remapping on failure • Multiple routing protocols running concurrently • Increase application-level performance • Decrease network load 22 Thank You! Questions? hussam@cs.cornell.edu 23 Cache service Insert throughput 4 Ingress bandwidth bounded (3 front-ends) Insert throughput (Gbps) 3.5 3 2.5 2 Disk I/O bounded F=3, disk 1.5 F=27, disk 1 F=3, no disk 0.5 F=27, no disk 0 0 20 40 60 80 Concurrent insert requests 100 120 140 24 Cache service Lookup requests/second 140,000 Lookup rate (reqs/s) 120,000 100,000 Ingress bandwidth bounded 80,000 60,000 40,000 F=3 F=27 20,000 0 0 20 40 60 80 100 Concurrent lookup requests 120 140 25 Cache service CPU Utilization on FEs 100 lookup (F=3) 90 CPU utilization (%) 3 front-ends insert (F=3, no disk) 80 insert (F=27, no disk) 70 lookup (F=27) 60 27 front-ends 50 40 30 20 10 0 26 0 20 40 60 80 Concurrent requests 100 120 140 Camcube link latency 1000 Round trip time (microsec) 900 UDP (x-cable) Camcube (1 hop) 800 UDP (switch) 700 TCP (x-cable) 600 TCP (switch) 500 400 300 200 100 27 0 1,500-byte packets 9,000-byte packets