A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean Overview • • • • Structure and Properties of a Data Center Desired properties in a DC Architecture Fat tree based solution Evaluation of fat tree approach Common data center topology Problem With common DC topology • Single point of failure • Over subscription of links higher up in the topology – Typical over subscription is between 2.5:1 and 8:1 – Trade off between cost and provisioning Properties of solutions • Compatible with Ethernet and TCP/IP • Cost effective – Low power consumption & heat emission – Cheap infrastructure – Commodity hardware • Allows host communication at line speed – Over subscription of 1:1 Cost of maintaining switches Review of Layer 2 & Layer 3 • Layer 2 – Data Link Layer • Ethernet • MAC address – One spanning tree for entire network • Prevents looping • Ignores alternate paths • Layer 3 – Transport Layer • TCP/IP – Shortest path routing between source and destination – Best-effort delivery FAT Tree based Solution • Connect end-host together using a fat tree topology – Infrastructure consist of cheap devices • Every port is the same speed – All devices can transmit at line speed if packets are distributed along existing paths – A k-port fat tree can support k3/4 hosts Fat-Tree Topology Problems with a vanilla Fat-tree • Layer 3 will only use one of the existing equal cost paths • Packet re-ordering occurs if layer 3 blindly takes advantage of path diversity – Creates overhead at host as TCP must order the packets FAT-tree Modified • Enforce special addressing scheme in DC – Allows host attached to same switch to route only through switch – Allows inter-pod traffic to stay within pod – unused.PodNumber.switchnumber.Endhost • Use two level look-ups to distribute traffic and maintain packet ordering. 2 Level look-ups • First level is prefix lookup – Used to route down the topology to endhost • Second level is a suffix lookup – Used to route up towards core – Diffuses and spreads out traffic – Maintains packet ordering by using the same ports for the same endhost Diffusion Optimizations • Flow classification – Eliminates local congestion – Assign to traffic to ports on a per-flow basis instead of a per-host basis • Flow scheduling – Eliminates global congestion – Prevent long lived flows from sharing the same links – Assign long lived flows to different links Results: Heat & Power Consumption Implementation • NetFPGA: • 4 Gigabit Ports, 36 Mb SRAM • 64MB DDR2, 3GB SATA Port • Implemented elements in Click Router Software • Two Level Table • Initialized with preconfigured information • Flow Classifier • Distributes output evenly across local ports • Flow Report + Flow Schedule • Communicates with central schedule Evaluation • Purpose: measure bisection bandwidth • Fat-Tree: 10 machines connected to 48 port switch • Hierarchical: 8 machines connected to 48 port switch Results Related Work • Myrinet – popular for cluster based supercomputers • Benefit: low latency • Cost: proprietary, host responsible for load balancing • Infiniband – used in high-performance computing environments • Benefit: proven to scale and high bandwidth • Cost: imposes its own layer 1-4 protocol • Uses Fat Tree • Many massively parallel computers such as Thinking Machines & SGI use fat-trees Conclusion • The Good: cost per gigabit, energy per gigabit is going down • The Bad: Datacenters are growing faster than commodity Ethernet devices • Our fat-tree solution • Is better: technically infeasible 27k node cluster using 10 GigE, we do it in $690M • Is faster: equal or faster bandwidth in tests • Increases fault tolerance • Is Cheaper: 20k hosts costs $37M for hierarchical and $8.67M for fat-tree (1 GigE) • KO’s the competing data center’s