Software-defined networking: Change is hard Ratul Mahajan with Chi-Yao Hong, Rohan Gandhi, Xin Jin, Harry Liu, Vijay Gill, Srikanth Kandula, Mohan Nanduri, Roger Wattenhofer, Ming Zhang Inter-DC WAN: A critical, expensive resource Dublin Seattle New York Seoul Barcelona Hong Kong Los Angeles Miami But it is highly inefficient One cause of inefficiency: Lack of coordination Another cause of inefficiency: Local, greedy resource allocation B C D A B E H G Local, greedy allocation D A E H F C G F Globally optimal allocation [Latency inflation with MPLS-based traffic engineering, IMC 2011] SWAN: Software-driven WAN Goals Key design elements Highly efficient WAN Flexible sharing policies Coordinate across services Centralize resource allocation [Achieving high utilization with software-driven WAN, SIGCOMM 2013] SWAN overview SWAN controller Traffic demand BW allocation Service broker Rate limiting Service hosts Topology, traffic Network config. Network agent WAN Key design challenges Scalably computing BW allocations Working with limited switch memory Avoiding congestion during network updates Congestion during network updates Congestion-free network updates Computing congestion-free update plan Leave scratch capacity π on each link ο§ Ensures a plan with at most 1 π − 1 steps Find a plan with minimal number of steps using an LP ο§ Search for a feasible plan with 1, 2, …. max steps Use scratch capacity for background traffic Complementary CDF SWAN provides congestion-free updates Oversubscription ratio Extra traffic (MB) Throughput (relative to optimal) SWAN comes close to optimal SWAN SWAN w/o rate control MPLS TE Deploying SWAN WAN Data center Partial deployment WAN Data center Full deployment The challenge of data plane updates in SDN Not just about congestion ο§ Blackholes, loops, packet coherence, … The challenge of data plane updates in SDN Not just about congestion ο§ Blackholes, loops, packet coherence, … Real-world is even messier Our controlled experiments CDF CDF Google’s B4 Latency (seconds) Latency (seconds) Many resulting questions of interest Fundamental ο§ What consistency properties can be maintained and how? ο§ Is property strength and ease of maintenance related? Practical ο§ How to quickly and safely update the data plane? ο§ Impacts failure recovery time, network utilization, flow response time Minimal dependencies for a consistency property None Eventual consistency Always guaranteed Blackhole freedom Impossible Loop freedom Packet coherence Congestion freedom Downstream subset Self Downstream all Global Add before remove Rule dependency Rule dependency forest tree Impossible Flow version numbers Impossible Impossible Global version numbers Staged partial moves [On consistent updates in software-defined networks, HotNets 2013] Fast, consistent network updates Consistency property Routing policy Current network state Desired state generator Forward fault correction Computes states that are robust to common faults Target network state Update planner Dionysus Dynamically schedules network updates Update plan Overview of forward fault correction Control and data plane faults cause congestion ο§ Today, reactive data plane updates are needed to remove congestion FFC handles faults proactively ο§ Guarantees absence of congestion for up to k faults Main challenge: Too many possible faults ο§ Constraint reduction technique based on sorting networks [Traffic engineering with forward fault correction, SIGCOMM 2014 (to appear)] Congestion due to control plane faults Current State Target state FFC for control plane faults Robust target state (k=1) Current State Vulnerable target state Robust target state (k=2) Congestion due to data plane faults Pre-failure traffic distribution Post-failure traffic distribution FFC for data plane faults Vulnerable traffic distribution Robust traffic distribution (k=1) FFC guarantee needs too many constraints [∀π ∈ πΉ] πΉ: {π | π is a set of up to π faulty switches} π ∈π ππ π ≤ πΆπ πΆπ : Spare capacity of link π in the absence of faults ππ (π ): Additional traffic on link π when switch π is faulty Number of constraints is π 1 + …+ π π for each link Efficient solution using sorting networks ∀π ∈ πΉ π ∈π ππ (π ) ≤ πΆπ ο¨ π π π=1 ππ ≤ πΆπ ππ π: mth largest variable in the array ππ π Use bubble sort network to compute linear expressions for k largest variables ο§ O(nk) constraints FFC performance in practice Single-priority traffic (ππ , ππ ) = (2,1) Multi-priority traffic ππβ , ππβ = 3,3 ; πππ , πππ = 2,1 ; πππ , πππ = (0,0) Fast, consistent network updates Consistency property Routing policy Current network state Desired state generator Forward fault correction Computes states that are robust to common faults Target network state Update planner Dionysus Dynamically schedules network updates Update plan Overview of dynamic update scheduling Current schedulers pre-compute a static update schedule ο§ Can get unlucky with switch delays Dynamic scheduling adapts to actual conditions Main challenge: Tractably exploring “safe” schedules [Dionysus: Dynamic scheduling of network updates, SIGCOMM 2014 (to appear)] Downside of static schedules Current State S1 S2 F2: 5 F3: 10 F1 Plan B F4 F3 F2 F3 S3 S1 S2 S3 F3 S4 F4 F1: 5 F4: 5 S4 Plan A F4 S5 Target State S1 S2 F2: 5 F3: 10 F1: 5 F4: 5 S4 S5 S3 S1 S2 S3 F3 S4 F4 F1 F2 2 1 4 time 3 F2 F4 1 2 3 F1 F1 F2 1 2 3 4 5 time F1 S1 F2 S2 S3 F3 F4 S4 F1 S1 S2 S3 F3 S4 F2 4 time 1 2 3 time Downside of static schedules Current State S1 S2 F2: 5 F3: 10 S3 Static F4 plan A F3 F1 F2 Static F4 plan B F3 F1: 5 F4: 5 S4 S5 Dynamic plan Target State S1 S2 F2: 5 F3: 10 F1: 5 F4: 5 S4 S5 S3 F1 F4 F3 F2 Low update time regardless of latency variability F2 F1 Challenge in dynamic scheduling Current State S1 Tractably explore valid orderings F5: 10 S3 S2 F2: 5 F1: 5 F4: 5 ο§ Exponential number of orderings ο§ Cannot completely avoid planning S4 S5 Target State S1 S2 F2: 5 F3: 5 F5: 10 F3: 10 S3 F1: 5 F4: 5 S4 S5 F3: 5 Dionysus pipeline Consistency property Current network state Target network state Dependency graph generator Dependency graph Update scheduler Dionysus dependency graph Current State Nodes: updates and resources Edges: dependencies among nodes S1 F5: 10 S3 S2 F2: 5 F1: 5 F4: 5 S4 S5 Target State S1 S2 F2: 5 F3: 5 F5: 10 F3: 10 S3 F1: 5 F4: 5 S4 S5 F3: 5 Dionysus scheduling NP-complete problem with capacity and memory constraints Approach ο§ Critical path scheduling ο§ Treat strongly connected components as virtual nodes and favor them ο§ Rate limit flows to resolve deadlocks Dionysus leads to faster updates Median improvement over static scheduling (SWAN): 60-80% Dionysus reduces congestion due to failures 99th percentile improvement over static scheduling (SWAN): 40% Fast, consistent network updates Consistency property Routing policy Current network state Desired state generator Forward fault correction Computes states that are robust to common faults Target network state Update planner Dionysus Dynamically schedules network updates Update plan Summary SDN enables new network operating points such as high utilization But also pose a new challenge: fast, consistent data plane updates