[FAT- TREE]: A SCALABLE, COMMODITY DATA CENTER NETWORK ARCHITECTURE Mohammad Al-Fares, Alexander Loukissas,Amin Vahdat Represented by: Bassma Aldahlan Outline • Cloud Computing • Data Center Networks • Problem Space • Paper Goal • Background • Architecture • Implantation • Evaluation • Results and Outgoing work Cloud Computing • "Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of. configurable computing resources (for example, networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction." Data Center Networks • Data Centers have recently received significant attention as a cost-effective infrastructure for storing large volumes of data and hosting large-scale service applications Data Center Networks • A data center (DC) is a facility consisting of servers (physical machines), storage and network devices (e.g., switches, routers, and cables), power distribution systems, cooling systems • A data center network is the communication infrastructure used in a data center, and is described by the network topology, routing/switching equipment, and the used protocols (e.g., Ethernet and IP). Could vs. Datacenter Could • an off-premise form of computing that stores data on the Internet • services are outsourced to third-party cloud providers • Less secure • does not require time or capital to get up and running (less expensive) Datacenter • on-premise hardware that stores data within an organization's local network • run by an in-house IT department • More secure • More expensive Problem Space • Inter-node communication bandwidth is the principle bottleneck in DC(or large-scale clusters) . • Two solutions: • Specialized hardware and communication protocols • (e.g., InfiniBand or Myrinet ) • Cons: • scale to clusters of thousands of nodes with high bandwidth, • No commodity parts (more expensive) • Not compatible with TCP/IP applications. • commodity Ethernet switches and routers to interconnect cluster machines • Scale poorly • Non-linear cost increases with clusters size Design Goal The goal of this paper is to design a data center communication architecture that meets the following goals: • Scalable interconnection bandwidth Arbitrary host communication at full bandwidth. • Economies of scale • Commodity switches • Backward compatibility • Compatible with host running Ethernet and IP Background Common data center topology Internet Core Aggregation Access Data Center Layer-3 router Layer-2/3 switch Layer-2 switch Servers Background (Cont.) Oversubscription: Ratio of the worst-case achievable aggregate bandwidth among the end hosts to the total bisection bandwidth of a particular communication topology Lower the total cost of the design Typical designs: factor of 2:5:1 (400 Mbps)to 8:1(125 Mbps) An oversubscription of 1:1 indicates that all hosts may potentially communicate with arbitrary other hosts at the full bandwidth of their network interface. Multi-rooted DC network • Multiple core switches at the top level • Equal Cost Multiple Routing (ECMP) Background (Cont.) • Cost Background (Cont.) • Cost Architecture Clos Networks/Fat-Trees • Adopt a special instance of a Clos topology • Similar trends in telephone switches led to designing a topology with high bandwidth by interconnecting smaller commodity switches. • Why Fat-Tree? • Fat tree has identical bandwidth at any bisections • Each layer has the same aggregated bandwidth Architecture (Cont.) Fat-Trees Inter-connect racks (of servers) using a fat-tree topology K-ary fat tree: three-layer topology (edge, aggregation and core) each pod consists of (k/2)2 switches & each k-port switche in the lower layer is directly connected to k/2 hosts. each edge switch connects to k/2 servers & k/2 aggr. switches each aggr. switch connects to k/2 edge & k/2 core switches (k/2)2 core switches: each connects to k pods Architecture (Cont.) • Addressing • IP addresses is allocated within the private 10.0.0.0/8 block. • The pod switches: 10.pod.switch.1, where pod denotes the pod number (in [0, k − 1]), and switch denotes the position of that switch in the pod (in [0, k − 1], starting from left to right, bottom to top) • The core switches: 10.k.j.i, where j and i denote that switch’s coordinates in the (k/2)2 core switch grid (each in [1, (k/2)], starting from top-left). • Hosts : 10.pod.switch.ID, where ID is the host’s position in that subnet (in [2, k/2+1], starting from left to right). Architecture (Cont.) Architecture (Cont.) • Tow- level Routing structure: Use two level look-ups to distribute traffic and maintain packet ordering First level is prefix lookup used to route down the topology to servers Second level is a suffix lookup used to route up towards core maintain packet ordering by using same ports for same server Architecture (Cont.) • To implement the two- level routing: They use Content Addressable Memory (CAM) : • faster in finding a match against a bit pattern. • perform parallel searches among all its entries in a single clock cycle. • Lookup engines use a special kind of CAM, called Ternary CAM (TCAM). Architecture (Cont.) • Routing Algorithm: Architecture (Cont.) S: 10.0.1.1 S: 10.2.2.1 Prefixes Port 10.0.1.0/24 0 10.0.01.0/24 1 Prefixes Por t 0.0.0.2/8 3 0.0.0.3/8 2 0.0.0.0/8 Prefixes Port Prefixes Port 10.2.0.1/24 0 0.0.0.2/8 3 10.2.1.0/24 1 0.0.0.3/8 2 0.0.0.0/8 S: 10.0.2.1 Prefixes Port 10.0.0.0/24 0 10.0.1.0/24 1 0.0.0.0/8 S: 10.4.1.1 Prefixes Port 10.0.0.0/16 0 10.1.0.0/16 1 10.2.0.0/16 2 Prefixes Port 0.0.0.2/8 2 0.0.0.3/8 3 S: 10.2.0.1 Prefixes Port Prefixes Port 10.2.0.0/24 0 0.0.0.2/8 3 10.2.1.0/24 1 0.0.0.3/8 2 0.0.0.0/8 Architecture (Cont.) • Two level Routing table • Traffic diffusion occurs in the first half of a packet journey • Subsequent packets to the same host to follow the same path; avoid packet reordering • Centralized protocol to initialize the routing table. Architecture (Cont.) • Flow Classification • used with dynamic port- reassignment in port switches to avoid local congestion. • A flow: a sequence of packets that has the same source and destination IP • In particular, pod switches: • 1. Recognize subsequent packets of the same flow, and forward them on the same outgoing port. (to avoid packet reordering) • 2.Periodically reassign a minimal number of flow output ports to minimize any disparity between the aggregate flow capacity of different ports.(to insure fair distribution on flow- balancing) Architecture (Cont.) • Flow scheduling • routing large flows plays the most important role in determining the achievable bisection bandwidth of a network • Large flow to minimize the overlap with one another. Architecture (Cont.) • Edge Switches assign a new flow to the least- loaded port initially detect any outgoing flow with a size above a threshold Notify central scheduler about the source and destination for all active large flows. • Prevent long flows from having the same link • Assign long lived flow to different links • Avoid congestion. Assign the flow to uncontended path. Architecture (Cont.) • Central Scheduler Replicated tacks all active large flows and assign them nonconflicting paths maintains boolean state for all links in signifying their availability to carry large flows. For inter-pod traffic: • Call(k/2)2 paths corresponds to the core switches • searches through the core switches to find corresponding path components do not include a reserved link; marks those links as reserved • notifies the relevant lower- and upper-layer switches in the source pod with the correct outgoing port that corresponds to that flow’s chosen path. Architecture (Cont.) • For intra-pod large flows: • The scheduler garbage collects flows whose last update is older than a given time. • clearing their reservations. • Flow Classifier-very few seconds, Implementation • Tow level table • Click is a modular software router architecture that supports implementation of experimental router designs. • A Click router is a graph of packet processing modules called elements that perform tasks such as routing table lookup • a new Click element is built • Flow Scheduler the element FlowReporter is implemented , which resides in all edge switches, and detects outgoing flows whose size is larger than a given threshold. It sends regular notifications to the central scheduler about these active large flows the heuristic at- tempts to switch, if needed, the output port of at most three flows to minimize the difference between the aggregate flow capacity of its output ports. Hash(src,dst) IP if Seen(hash ) Record the new flow f; Lookup previously assigned port x Assign f to the leastloaded upward port x Send packet on port x Send the packet on port x end For every aggregate switch Find p and p with the largest and max min smallest aggregate outgoing traffic D = pmax - p min Find the largest flow f assigned to port p size is smaller than D max whose if such a flow exists Switch the output port of flow f to pmin Fault-Tolerance • In this scheme, each switch in the network maintains a BFD (Bidirectional Forwarding Detection) session with each of its neighbors to determine when a link or neighboring switch fails Failure b/w upper layer and core switches Outgoing inter-pod traffic, local routing table marks the affected link as unavailable and chooses another core switch Incoming inter-pod traffic, core switch broadcasts a tag to upper switches directly connected signifying its inability to carry traffic to that entire pod, then upper switches avoid that core switch when assigning flows destined to that pod Fault-Tolerance • Failure b/w lower and upper layer switches – Outgoing inter- and intra pod traffic from lower-layer, – the local flow classifier sets the cost to infinity and does not assign it any new flows, chooses another upper layer switch – Intra-pod traffic using upper layer switch as intermediary – Switch broadcasts a tag notifying all lower level switches, these would check when assigning new flows and avoid it – Inter-pod traffic coming into upper layer switch – Tag to all its core switches signifying its ability to carry traffic, core switches mirror this tag to all upper layer switches, then upper switches avoid affected core switch when assigning new flaws Power and Heat Issues Results: Heat & Power Consumption Evaluation • Experiment Description 4-port fat-tree, there are 16 hosts, four pods (each with four switches), and four core switches; a total of 20 switches and 16 end hosts multiplex these 36 elements onto ten physical machines, interconnected by a 48-port ProCurve 2900 switch with 1 Gigabit Ethernet links Each pod of switches is hosted on one machine; each pod’s hosts are hosted on one machine; and the two remaining machines run two core switches each. Each pod of switches is hosted on one machine; each pod’s hosts are hosted on one machine; and the two remaining machines run two core switches each. Evaluation (Cont.) • four machines running four hosts each, and four • • • • machines each running four pod switches with one additional uplink. The four pod switches are connected to a 4-port core switch running on a dedicated machine 3.6:1 oversubscription on the uplinks from the pod switches to the core switch these links are bandwidth-limited to 106.67Mbit/s, and all other links are limited to 96Mbit/s. Each host generates a constant 96Mbit/s of outgoing traffic. Results: Network Utilization Packaging Increased wiring overhead is inherent to the fat-tree topology Each pod consists of 12 racks with 48 machines each, and 48 individual 48-port GigE switches Place the 48 switches in a centralized rack Cables moves in sets of 12 from pod to pod and in sets of 48 from racks to pod switches opens additional opportunities for packing to reduce wiring complexity Minimize total cable length by placing racks around the pod switch in two dimensions Packaging Conclusion Bandwidth is the scalability bottleneck in large scale clusters Existing solutions are expensive and limit cluster size Fat-tree topology with scalable routing and backward compatibility with TCP/IP and Ethernet Large number of commodity switches have the potential of displacing high end switches in DC the same way clusters of commodity PCs have displaced supercomputers for high end computing environments THANK YOU Q&A