B4: Experience with a Globally Deployed Software Defined WAN Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jonathan Zolla, Urs Hӧlzle, Stephen Stuart, and Amin Vahdat Google, Inc Presented at 2014 SIGCOMM Slides by James Hunsucker What we’ve seen so far… • • • • Controller Design Programming languages Network virtualization Measurement • But we haven’t seen any of these implemented on a large scale network! Background on Google • Google provides search, video, cloud computing, and enterprise applications to users across the world. • Google operates 2 distinct WANs. – User facing network peers with and exchanges traffic with other internet domains. Also delivers end user requests to Google’s data centers. – B4, “behind the scene” network provides connectivity between data centers. Carries over 90% of internal application traffic. Why two different networks? • Public facing network must support a wide range of protocols to handle different interactions. • B4 supports thousands of applications, divided into 3 classes: 1. User data copies (email, documents, audio/video) 2. Remote storage access for computation over inherently distributed data sources 3. Large scale data push synchronizing state across multiple data centers. • Order is based on increasing volume, decreasing latency sensitivity, and decreasing overall priority. Motivation for SDN • Typical WANs are provisioned to 30-40% link utilization with high end router gear. • Provides good reliability and link failures can go unnoticed by end users. • Too costly and inefficient for the needs of the B4 network. Characteristics of B4 • Elastic bandwidth demands: Most data center traffic involves synchronizing large data sets. These operations should make use of all available bandwidth, but defer to tasks that are more latency sensitive. • Moderate number of sites: only about two dozen WAN sites. • End application control: Google controls both the applications and networks connected to B4. • Cost sensitivity: The growth rates and capacity targets of B4 would make the traditional approach cost prohibitive. Switch Design • With high bandwidth utilization Google could avoid deep buffers. • Small number of sites meant smaller forwarding tables as well. • No existing hardware could support SDN deployment. • So Google build their own switches from multiple merchant silicon switch chips. Switches cont… • Switches contain embedded processors running Linux and initially all routing protocols were run directly on the switch. • Even though the switch is complex, Google developed an OpenFlow Agent (OFA) which provides an abstraction of a single non-blocking switch with hundreds of 10Gb/s ports. • The OFA extends OpenFlow and handles the dirty work of the actual port mapping. Controller • Uses a modified version of Onix • There are many controllers per site, but only one is active at a time. The rest are replicas and used for back up. • If a failure occurs, the new controller is decided via Paxos. • Once a new controller is picked it will sync with the switches for the dynamic network state. Traffic Engineering (TE) • Goal: Share bandwidth among competing applications possibly using multiple paths. • Sharing bandwidth is defined by Google as maxmin fairness. • Basics of max-min fairness: – No source gets a resource share larger than its demand. – Sources with unsatisfied demands get an equal share of the resource. Google’s TE terms • Flow Group (FG): TE cannot operate on the granularity of individual applications so they are aggregated as {source site, des site, QoS} tuple. • Tunnel : Represents a site-level path in the network ie sequence of sites (A->B->C) • Tunnel Group : maps FGs to a set of tunnels and corresponding weights. The weight specifies the fraction of FG traffic to be forwarded along each tunnel. • The network topology graph is viewed by TE as such: Sites represent vertices and site to site connectivity represent the edges. Bandwidth Functions • Bandwidth functions are associated with every application which specify the bandwidth allocation given the flow’s relative priority. • Are derived from administrator provided static weights. • The bandwidth function of a FG is the linear additive composition of per application functions. Bandwidth functions TE Optimization Algorithm • Traditional solution for allocating fair share of resources is expensive and does not scale well. • Google’s version of the algorithm is 25% faster than traditional and utilizes at least 99% of the bandwidth. • 2 main components 1) Tunnel Group Generation: Allocation of bandwidth to FGs using bandwidth functions to prioritize at bottlenecks. 2) Tunnel Group Quantization: Changing split ratios in each TG to match the granularity supported by switch hardware. Tunnel Group Generation • Goal: Allocate bandwidth to FGs based on demand and priority such that all FGs on an edge either receive equal fair share or fully satisfy their demand. Tunnel Group Generation • 1) Iterates by finding the bottleneck edge when filling all FGs together by increasing their fair share on their preferred Tunnel. • 2) Once found a bottleneck edge is no longer used for Tunnel Group Generation and the algorithm continues with the next bottleneck edge. • A preferred Tunnel for a FG is the minimum cost path that does not include a bottleneck edge. A concrete example Cost of each edge is not shown, but typically represents latency. Here the cost of all the edges is 1 except A->D which is 10. Worked example • Start by filling FG1 and FG2 on their preferred Tunnels of (A->B) and (A->C) respectively. • Allocate bandwidth by giving equal fair share of 0.9 to each. FG1 -> 10 Gbps FG2 -> 0.45 Gbps (calculated from bandwidth functions) • Edge A->B becomes full (bottlenecked) Worked example • The next preferred tunnel for FG1 is (A->C->B) • At a fair share of 3.33: FG1 -> 8.33 Gbps FG2 -> 1.22 Gbps Now A->C is a bottleneck Worked example • The third preferred tunnel for FG1 is (A->D->C->B) • The second preferred tunnel for FG2 is (A->D->C) • FG1 receives 1.67 Gbps and is fully satisfied. • FG2 receives the remaining 3.33 Gbps. TE State and OpenFlow • B4 switches operate in 3 roles: 1. An encapsulating switch initiates tunnels and splits traffic between them. 2. A transit switch forwards packets based on their outer header. 3. A decapsulating switch terminates tunnels then forwards packets using regular routes. TE State and OpenFlow • A switch maps packets to a FG when their destination IP matches one of the prefixes associated with an FG. • Packets already matching a FG are forwarded via the corresponding Tunnel Group. • Each packet hashes to one of the Tunnels associated with the Tunnel Group in the ratio calculated by the TE optimization algorithm. • Therefor installing a tunnel requires configuring switches at multiple sites. Composing Routing and TE • B4 supports both shortest path routing and TE. • Packet lookups are done in parallel in both the Longest Prefix Match (shortest path) and Access Control List (Traffic Engineering). • But Access Control List entries take precedence over LPM entries. Coordinating TE across sites • TE optimization output is done for each site and translated to a per site Traffic Engineering Database. • The old database state and new state are compared and any differences are translated to a single TE op(add/modify/delete) each. • The OFC converts the op to flow-programming instructions and sends to the switch. Dependencies and Failures • Site specific sequence IDs are attached to TE ops to enforce correct ordering. • A TE op can fail because of a remote procedure call failure. So a dirty/clean bit is kept for each TED entry. Entries only get cleaned when acknowledgement is received from the OFC. • The op is resent if a timeout occurs before the acknowledgement is received. Evaluation Performance of TE ops • The TE algorithm is run after every topology change and periodically to account for demand changes. Failures Google conducted experiments to test the recovery time from different types of failures. Their results are summarized below: TE algorithm evaluation Link Utilization Link Utilization Experience with an outage • During a move of switches from one physical location to another, two switches became manually configured with the same ID. • Resulted in network state never converging to the topology. • System recovered after all traffic was stopped, buffers emptied and OFCs restarted from scratch. Lessons • Scalability and latency between the OFA and the OFC is critical for evolving OpenFlow and improving their implementation. • OFA should be asynchronous and multithreaded for more parallelism. • Most failures occur due to human error. SDN has the ability to simplify the management of large, complex systems. Questions?