Computer Networks Datacenter Networks Lin Gu lingu@ieee.org Rack-Mounted Servers Sun Fire x4150 1U server Scale Up vs. Scale Out MPP SMP Super Server Departmental Server Personal System Cluster of PCs Data center networks 10’s to 100’s of thousands of hosts, often closely coupled, in close proximity: e-business (e.g. Amazon) content-servers (e.g., YouTube, Akamai, Apple, Microsoft) search engines, data mining (e.g., Google) challenges: multiple applications, each serving massive numbers of clients managing/balancing load, avoiding processing, networking, data bottlenecks Inside a 40-ft Microsoft container, Chicago data center Link Layer 5-4 Data center networks load balancer: application-layer routing receives external client requests directs workload within data center returns results to external client (hiding data center internals from client) Internet Border router Load balancer Access router Tier-1 switches B A Load balancer Tier-2 switches C TOR switches Server racks 1 2 3 4 5 6 7 8 Link Layer 5-5 Data center networks rich interconnection among switches, racks: increased throughput between racks (multiple routing paths possible) increased reliability via redundancy Tier-1 switches Tier-2 switches TOR switches Server racks 1 2 3 4 5 6 7 8 Low Earth Orbit networks Wireless communication is convenient, and can be highbandwidth. Satellites can be an effective solution. Some unsuccessful earlier attempts, Iridium, … Such systems may come back in the future It does not have to be a satellite – Google Loon, … Deep space communication How to communication in the Solar System, in the Galaxy, or in deeper space? Extremely long latency What protocols work? How to build the transceivers? Befriend physicists The network is the computer” - SUN Microsystems 6-9 Appendix 10 Motivations of using Clusters over Specialized Parallel Computers • Individual PCs are becoming increasingly powerful • Communication bandwidth between PCs is increasing and latency is decreasing (Gigabit Ethernet, Myrinet) • PC clusters are easier to integrate into existing networks • Typical low user utilization of PCs (<10%) • Development tools for workstations and PCs are mature • PC clusters are a cheap and readily available • Clusters can be easily grown Cluster Architecture Parallel Applications Parallel Applications Parallel Applications Sequential Applications Sequential Applications Sequential Applications Parallel Programming Environment Cluster Middleware (Single System Image and Availability Infrastructure) PC/Workstation PC/Workstation PC/Workstation PC/Workstation Communications Communications Communications Communications Software Software Software Software Network Interface Hardware Network Interface Hardware Network Interface Hardware Network Interface Hardware Cluster Interconnection Network/Switch Datacenter Networking Major Components of a Datacenter • Computing hardware (equipment racks) • Power supply and distribution hardware • Cooling hardware and cooling fluid distribution hardware • Network infrastructure • IT Personnel and office equipment Datacenter Networking Growth Trends in Datacenters • Load on network & servers continues to rapidly grow – Rapid growth: a rough estimate of annual growth rate: enterprise datacenters: ~35%, Internet datacenters: 50% 100% – Information access anywhere, anytime, from many devices • Desktops, laptops, PDAs & smart phones, sensor networks, proliferation of broadband • Mainstream servers moving towards higher speed links – 1-GbE to10-GbE in 2008-2009 – 10-GbE to 40-GbE in 2010-2012 • High-speed datacenter-MAN/WAN connectivity – High-speed datacenter syncing for disaster recovery Datacenter Networking • A large part of the total cost of the DC hardware – Large routers and high-bandwidth switches are very expensive • Relatively unreliable – many components may fail. • Many major operators and companies design their own datacenter networking to save money and improve reliability/scalability/performance. – The topology is often known – The number of nodes is limited – The protocols used in the DC are known • Security is simpler inside the data center, but challenging at the border • We can distribute applications to servers to distribute load and minimize hot spots Datacenter Networking Networking components (examples) 64 10-GE port Upstream 768 1-GE port Downstream • High Performance & High Density Switches & Routers – Scaling to 512 10GbE ports per chassis – No need for proprietary protocols to scale • Highly scalable DC Border Routers – 3.2 Tbps capacity in a single chassis – 10 Million routes, 1 Million in hardware – 2,000 BGP peers – 2K L3 VPNs, 16K L2 VPNs – High port density for GE and 10GE application connectivity – Security Datacenter Networking Common data center topology Internet Core Layer-3 router Datacenter Aggregation Access Layer-2/3 switch Layer-2 switch Servers Datacenter Networking Data center network design goals • High network bandwidth, low latency • Reduce the need for large switches in the core • Simplify the software, push complexity to the edge of the network • Improve reliability • Reduce capital and operating cost Datacenter Networking Avoid this… and simplify this… Interconnect Can we avoid using high-end switches? • Expensive high-end switches to scale up • Single point of failure and bandwidth bottleneck – Experiences from real systems • One answer: DCell ? 20 Interconnect DCell Ideas • #1: Use mini-switches to scale out • #2: Leverage servers to be part of the routing infrastructure – Servers have multiple ports and need to forward packets • #3: Use recursion to scale and build complete graph to increase capacity Data Center Networking One approach: switched network with a hypercube interconnect • Leaf switch: 40 1Gbps ports+2 10 Gbps ports. – One switch per rack. – Not replicated (if a switch fails, lose one rack of capacity) • Core switch: 10 10Gbps ports – Form a hypercube • Hypercube – high-dimensional rectangle Interconnect Hypercube properties • • • • Minimum hop count Even load distribution for all-all communication. Can route around switch/link failures. Simple routing: – Outport = f(Dest xor NodeNum) – No routing tables Interconnect A 16-node (dimension 4) hypercube 3 2 2 2 2 0 0 0 1 3 3 3 2 2 5 0 0 4 1 1 1 1 1 1 1 1 2 0 0 3 2 2 7 0 0 6 3 3 3 3 3 3 3 3 10 0 0 11 2 2 15 0 0 14 1 1 1 1 1 1 1 1 8 3 0 0 9 3 2 2 13 3 0 0 12 3 2 2 2 2 Interconnect 64-switch Hypercube 4X4 Sub-cube 16 links 16 links 4X4 Sub-cube 4X4 Sub-cube 16 links 16 links Core switch: 10Gbps port x 10 4X4 Sub-cube How many servers can be connected in this system? 4 links 63 * 4 links to other containers One container: 81920 servers with 1Gbps bandwidth Level 2: 2 10-port 10 Gb/sec switches 16 10 Gb/sec links Level 1: 8 10-port 10 Gb/sec switches 64 10 Gb/sec links Level 0: 32 40-port 1 Gb/sec switches Leaf switch: 1Gbps port x 40 + 10Gbps port x 2. 1280 Gb/sec links Data Center Networking The Black Box Interconnect Typical layer 2 & Layer 3 in existing systems • Layer 2 – One spanning tree for entire network • Prevents looping • Ignores alternate paths • Layer 3 – Shortest path routing between source and destination – Best-effort delivery Interconnect Problem With common DC topology • Single point of failure • Over subscription of links higher up in the topology – Trade off between cost and provisioning • Layer 3 will only use one of the existing equal cost paths • Packet re-ordering occurs if layer 3 blindly takes advantage of path diversity Interconnect Fat-tree based Solution Connect hosts together using a fat-tree topology – Infrastructure consists of cheap devices • Each port supports same speed as the end host – All devices can transmit at line speed if packets are distributed along existing paths – A k-ary fat-tree is composed of switches with k ports • How many switches? … 5k2/4 • How many connected hosts? … k3/4 Interconnect k-ary Fat-Tree (k=4) k2/4 switches k pods k/2 switches/pod k/2 switches/pod k/2 hosts/edge switch Use the same type of switches in the core, aggregation, and edge, with each switch having k ports Fat-tree Modified Interconnect • Enforce special addressing scheme in DC – Allows host attached to same switch to route only through switch – Allows inter-pod traffic to stay within pod – unused.PodNumber.switchnumber.Endhost • Use two level look-ups to distribute traffic and maintain packet ordering. Interconnect 2-Level look-ups • First level is prefix lookup – Used to route down the topology to endhost • Second level is a suffix lookup – Used to route up towards core – Diffuses and spreads out traffic – Maintains packet ordering by using the same ports for the same endhost Interconnect Comparison of several schemes – Hypercube: high-degree interconnect for large net, difficult to scale incrementally – Butterfly and fat-tree: cannot scale as fast as DCell – De Bruijn: cannot incrementally expand – DCell: low bandwidth between two clusters (sub-DCells) Distributed Systems Sun Fire x4150 1U server Cloud and Globalization of Computation Users are geographically distributed, and computation is globally optimized. Datacenter Datacenter DNS LB system Datacenter • • • Load Balancing The load balancing systems regulate global data center traffic Incorporates site health, load, user proximity, and service response for user site selection Provides transparent site failover in case of disaster or service outage Global Data Center Deployment • • • Providing site selection for users Harnessing the benefits and intricacies of geo-distribution Leveraging both DNS and nonDNS methods for multi-site redundancy Google’s Search System San Jose Computing in an LSDS GWS Google.com The browser issues a query DNS lookup HTTP handling GWS Backend HTTP response London Hong Kong GWS GWS Backend GWS GWS Google’s Cluster Architecture Goals A high-performance distributed system for search Thousands of machines collaborate to handle the workload Price-performance ratio Scalability Energy efficiency and cooling High availability Luiz Andre Barroso, Jeffrey Dean, Urs Holzle. Web Search for a Planet: The Google Cluster Architecture. IEEE Micro, vol. 23, no. 2, pp. 22-28, Mar./Apr. 2003 How to compute in a network? Multiple computers on a perfect network may work like one larger traditional computer. However, computing becomes more complex when messages can be lost/tampered/duplicated. bandwidth is limited. operations incur long latencies, non-uniform latencies, or both. events are asynchronous. Hence, computation in an LSDS over imperfect networks may have to be organized in a different way from traditional computer systems. How to correctly compute on an imperfect network? Two Generals’ Problem Two generals want to agree on a time to attack an enemy in between them. If the attack is synchronized, the generals can defeat the enemy. Otherwise, they will be defeated by enemy one by one. The generals can send messengers to each other, but the messenger may be caught by the enemy. Can the two generals reach an agreement? Send a messenger, then expect the messenger to come back with one acknowledgment? Send 100 messengers? How to prove it is possible or impossible to reach an agreement? Attack at 5am. Three Generals’ Problem in Paxos Paxos: reach global consensus in a distributed system with packet loss Who decides the attack time? When is the decision made and agreed on? What if one general betrayed? Byzantine Generals Problem. Further reading: [Lamport98] Leslie Lamport. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May. 1998), 133-169. How to maintain state in a network? Part-Time Parliament Priests (legislators) in a parliament communicate to each other using messengers. Both the priests and the messengers can leave the parliament Chamber at any time, and may never come back. Can the parliament pass laws (decrees) and ensure consistency (no ambiguity/discrepancy on the content of a decree)? Priests can leave the Chamber (server crash or isolated from the system) and may never come back (server fail). Messengers can leave the Chamber (delayed or out-oforder packets) and may never come back (packet loss). Paxos Protocol 1: Suppose we know there are n priests. A priest constructs a decree and sends it to the other n-1 priests, and collects their votes that support the decree. A vote against the degree equal to not voting. If there are n-1 votes for the decree, the decree is passed. Each priest keeps records of the passed decrees (and some additional information) on his/her ledger (nonvolatile storage). Messengers deliver the candidate decree and votes. Problem? Tax=0 OK Paxos Protocol 2: … A decree is passed by a majority voting for it. Resilient to server failures and packet loss. State (passing or not passing) of a decree is defined unambiguously. What is “majority”? The proposing priest may contact a “quorum” consisting a majority of the priests. Tax=0 Problem? OK Paxos Protocol 3: … Inform all priests about the passing of a decree. Clients can query any priest, and the priest may know the decree. What if the particular priest does not know the decree? Tax=0 OK Tax=0 done Problem? Paxos Protocol 4: … Read from a majority set. If all replies agree, the decree is (perhaps) unambiguous. What if a priest in the majority set does not know? Tax = ? Problem? Paxos Protocol 5: … Read following a majority. Will there be a majority? Can the majority be wrong? Tax = ? Answers to a query should be consistent (identical, or, at least, compatible). Paxos Protocol 6: Consider a single decree (e.g., tax = 0) – 1. One priest serves as the president, and proposes a decree with a unique ballot no. b. The president sends the ballot with the proposal to a set of priests. 2. A priest responds to the receipt of the proposal message by replying with its latest vote (LastVote) and a promise that it would not respond to any ballots whose ballot nos. are between the LastVote’s ballot no. and b. LastVote can be null. Paxos Protocol 6: (continued) 3. After receiving promises from a majority set, the president selects the value for the decree based on the LastVotes of this set (the quorum), and sends the acceptance of the decree to the quorum. 4. The members of the quorum replies with a confirmation (vote) to the president, and reception of all the quorum members’ confirmations (votes) means the decree is passed. Paxos Protocol 6: (continued) 5. After receiving votes from the whole quorum, the president records the decree in its ledger, and sends a success message to all the priests. 6. Upon receiving the success message, a priest records the decree d in its ledger. How to know the protocol works correctly? Paxos Paxos The leads to a system where every passed decree is the same as the first passed one. Paxos All passed decrees are identical. Three Generals’ Problem in Paxos Can we use Paxos to solve the Three Generals’ problem? Who decides the attack time? How to agree? What if one general betrayed? Byzantine Generals Problem. Attack at 5am. Beyond Single-Decree Paxos Can we use Paxos to pass more than one decree? Multiple Paxos instances Sequence of instances Further reading: [Lamport98] Leslie Lamport. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May. 1998), 133-169.