2nd ACM Symposium on Cloud Computing Oct 27, 2011 Small-World Datacenters Ji-Yong Shin* Bernard Wong+, and Emin Gün Sirer* *Cornell University +University of Waterloo Motivation • Conventional networks are hierarchical – Higher layers become bottlenecks • Non-traditional datacenter networks are emerging – Fat Tree, VL2, DCell and BCube • Highly structured or sophisticated regular connections • Redesign of network protocols – CamCube (3D Torus) • High bandwidth and APIs exposing network architecture • Large network hops Small-World Datacenters • Regular + random connections – Based on a simple underlying grid – Achieves low network diameter – Enables content routing • Characteristics – High bandwidth – Fault tolerant – Scalable Small-World Networks • Watts and Strogatz – Multiple connections to neighbors on a ring + random connections • Kleinberg – Lattice + random links – Probability of connecting a random pair decreases with dth power of distance between the pair in d-dimensional network – Path length becomes O(log n) Watts and Strogatz’98 Kleinberg’00 Small-World Datacenter Design • Possible topologies for 6 links per node Small World Ring (2 reg + 4 rand) Small World 2D Torus (4 reg + 2 rand) Small World 3D Hexagon Torus (5 reg + 1 rand) • Direct connections from server to server – No need for switches – Software routing approach Routing in Small-World Datacenters • Shortest path – Link state protocol (OSPF) – Expensive due to TCAM cost • Greedy geographical – Find min distance neighbor – Coordinates in lattice used as ID – Maintain info of 3 hop neighbors – Comparable to shortest path for 3DHexTorus Content Routing in Small-Worlds • Content routing – Logical coordinate space and network to map data – Logical and physical network do not necessarily match • Geographical identity + simple regular network in SWDC – Logical topology can be mapped physically – Random links only accelerates routing • SWDC can support DHT and key value stores directly – Similar to CamCube Packaging and Scaling • SWDCs can be constructed from preconfigured, reusable, scalable components • Reusable racks – Regular links: only short cables necessary – Random links: • Predefined Blueprint • Random number generator * • Pre-cut wires based on known probability • Ease of construction – Connect rack-> cluster (or container) -> datacenter – Switches, repeaters, or direct wires for inter-cluster connections *Image from http://www.chuck-a-con.net/Dice.html Evaluation Setup • Simulation of 10,240 nodes in three settings: – Small-World Datacenters (SWDC) – CamCube SW Ring SW 2DTorus SW 3DHexagonal Torus CamCube • 6 x 1GigE links per server • Greedy routing • NetFPGA Setup − 64B packet 4.6 us − 1KB packet 15 us – Conventional hierarchical data centers (CDC) • • • • 1 x 1GigE link per server 10 GigE links among switches 3 layer switches (Uniform delays: 6us, 3.2 us, and 5us in each layer) Oversubscriptions: 5 and 10 Evaluation: Average Packet Latency Average Packet Latency: Uniform Random 600 Latency (us) 500 400 300 • SWDCs always outperform CamCube SW Ring SW 2DTorus SW 3DHexTorus 200 100 0 64B Packet Size 1KB • SWDCs can outperform CDC for MapReduce – SWDC has multiple ports Latency (us) Average Packet Latency: MapReduce 450 400 350 300 250 200 150 100 50 0 • SWDC latencies are packet size dependent – Limitations of software routers 64B Packet Size 1KB Evaluation: Aggregate Bandwidth Aggregate Bandwidth: Uniform Random Bandwidth (GBps) 350 300 250 • SWDCs outperform CamCube in general SW Ring 200 – 1.5x - 3x better 150 100 50 0 64B Packet Size 1KB Aggregate Bandwidth: MapReduce Bandwidth (GBps) 160 140 120 100 80 60 40 20 0 64B Packet Size 1KB • SWDCs outperform CDCs in general – 1x - 16x better Evaluation: Fault Tolerance Percentage of preserved connectivity Connectivity under random rack failure 100 90 80 70 60 50 40 30 20 10 0 SW Ring SW 2DTorus SW 3DHexTorus CamCube 0 10 20 30 40 50 60 Percentage of failed racks 70 • No switches to fail compared to CDCs • Random links enable stronger connections 80 90 Related Concurrent Work • Scafida and Jellyfish – Rely on random connections – Achieve high bandwidth • Comparison to SWDC – SWDCs have more regular links – Routing can be simpler Summary • Unorthodox topology comprising a mix of regular and random links can yield: – High performance – Fault tolerant – Easy to construct and scalable • Issues of cost at scale, routing around failures, multipath routing, etc. are discussed in the paper Extra: Path Length Comparison Average Path Length (10240 nodes, Errorbar = stddev) 30 Path Length 25 Shortest Greedy 20 15 10 5 0 SW Ring SW 2DTorus SW 3DHexTorus CamCube Extra: Load balance Extra: Need for Hardware Extra: Path Length Under Failure Shortest path length under random node failure 35 SW Ring Average Path Length 30 SW 2DTorus 25 SW 3DHexTorus CamCube 20 15 10 5 0 0 10 20 30 40 50 Percentage of failed nodes 60 70 Percentage of preserved path Extra: Fault Tolerance (node failure) Connectivity among live nodes under random correlated failure 100 80 SW Ring 60 SW 2DTorus 40 SW 3DHexTorus 20 CamCube 0 0 10 20 30 40 50 60 Percentage of failed nodes • No switches to fail compared to CDCs • Random links enable stronger connections 70 80 90