Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University) 1 Cloud services require large network capacity Cloud Services Growing traffic Cloud Networks Expensive (e.g. cost of WAN: $100M/year) 2 TE is critical to effectively utilizing networks Traffic Engineering • Microsoft SWAN • Google B4 • …… WAN Network • Devoflow • MicroTE • …… Datacenter Network 3 But, TE is also vulnerable to faults TE controller Network view Network configuration Control-plane faults Frequent updates for high utilization Data-plane faults Network 4 Control plan faults Failures or long delays to configure a network device TE Controller Switch TE configurations RPC failure Firmware bugs Overloaded CPU Memory shortage 5 Congestion due to control plane faults s2 Link Capacity: 10 s4 (10) s2 New Flows (traffic demands): s1 s2 (10) 10 s1 s3 (10) s1 s4 (10) 7 3 s1 s4 s1 s2 10 10 s4 3 s3 s3 7 10 s4 (10) 10 s3 10 Configuration s2 failure 7 3 s1 s4 Congestion 10 10 s3 10 6 Data plane faults Link and switch failures Rescaling: Sending traffic proportionally to residual paths s2 s4 (10) s2 s2 7 failure link 10 3 s1 s4 3 s3 s4 (10) s3 link failure 7 Link Capacity: 10 s1 congestion s4 3 s3 7 7 Control and data plane faults in practice In production networks: • Faults are common. • Faults cause severe congestion. Control plane: fault rate = 0.1% -- 1% per TE update. Data plane: fault rate = 25% per 5 minutes. 8 State of the art for handling faults •Heavy over-provisioning Big loss in throughput •Reactive handling of faults • Control plane faults: retry • Data plane faults: re-compute TE and update networks Cannot prevent congestion Slow (seconds -- minutes) Blocked by control plane faults 9 How about handling congestion proactively? 10 Forward fault correction (FFC) in TE • [Bad News] Individual faults are unpredictable. • [Good News] Simultaneous #faults is small. FEC guarantees no information loss under up to k arbitrary packet drops. Packet loss with careful data encoding FFC guarantees no congestion under up to k arbitrary faults. Network faults with careful traffic distribution 11 Example: FFC for control plane faults Link Capacity: 10 s2 7 10 3 s1 s4 s2 10 10 s1 s4 3 s3 7 10 s3 10 Non-FFC 10 Configuration s2 failure 7 3 s1 s4 Congestion 10 10 s3 10 12 Example: FFC for control plane faults s2 Link Capacity: 10 7 s2 10 3 10 7 s1 s4 s1 s4 3 10 7 s3 s3 10 Control Plane FFC (k=1) 10 Configuration s2 failure 7 10 3 s4 s1 s4 3 7 s3 10 7 s1 10 s2 10 10 Configuration failure s3 7 13 Trade-off: network utilization vs. robustness 10 s2 10 10 s3 10 10 s4 10 Non-FFC Throughput: 50 s1 s4 10 s3 s2 10 4 7 10 s1 10 s2 10 K=1 (Control Plane FFC) Throughput: 47 s1 s4 10 s3 10 K=2 (Control Plane FFC) Throughput: 44 14 Systematically realizing FFC in TE Formulation: How to merge FFC into existing TE framework? Computation: How to find FFC-TE efficiently? 15 Basic TE linear programming formulations Sizes of flows LP formulations 𝑏𝑓 TE decisions: 𝑙𝑓,𝑡 Traffic on paths TE objective: Maximizing throughput Deliver all granted flows Basic TE constraints: FFC constraints: No overloaded link up to No overloaded link … ∀𝑓 𝑏𝑓 max. s.t. ∀𝑓: ∀𝑒: ∀𝑡 𝑙𝑓,𝑡 ∀𝑓 ≥ 𝑏𝑓 ∀𝑡∋𝑒 𝑙𝑓,𝑡 … 𝑘𝑐 control plane faults 𝑘𝑒 link failures 𝑘𝑣 switch failures 16 ≤ 𝑐𝑒 Formulating control plane FFC 𝑓1 𝑓2 𝑓3 𝑓1 ’s load in old TE s1 s2 Total load under faults? 𝑓2 ’s load in new TE Fault on 𝑓1 : 𝑙1𝑜𝑙𝑑 + 𝑙2𝑛𝑒𝑤 + 𝑙3𝑛𝑒𝑤 ≤ link cap Fault on 𝑓2 : 𝑙1𝑛𝑒𝑤 + 𝑙2𝑜𝑙𝑑 + 𝑙3𝑛𝑒𝑤 ≤ link cap Fault on 𝑓3 : 𝑙1𝑛𝑒𝑤 + 𝑙2𝑛𝑒𝑤 + 𝑙3𝑜𝑙𝑑 ≤ link cap 𝟑 𝟏 Challenge: too many constraints With n flows and FFC protection k: #constraints = 𝒏𝟏 + … + 𝒏𝒌 for each link. 17 An efficient and precise solution to FFC Our approach: A lossless compression from O( 𝑛𝑘 ) constraints to O(kn) constraints. Total load under faults ≤ link capacity Total additional load due to faults ≤ link spare capacity 𝑥𝑖 : additional load due to fault-i Given 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 }, FFC requires that O( the sum of arbitrary k elements in 𝑋 is ≤ link spare capacity Define 𝑦𝑚 as the mth largest element in 𝑋: 𝑘 𝑚=1 𝑦𝑚 ≤ link spare capacity Expressing 𝑦𝑚 with 𝑋? O(1) 19 𝑛 𝑘 ) Sorting network 𝑥1 𝑥2 𝑥3 𝑥4 𝑧2 𝑧1 1st round 𝑧4 𝑧3 𝑧5 𝑧7 𝑧6 2nd round 𝑧8 𝑦1 (1st largest) 𝑦2 (2nd largest) A comparison 𝑥1 𝑧1 =max{𝑥1 , 𝑥2 } 𝑥2 𝑧2 =min{𝑥1 , 𝑥2 } 𝑦1 + 𝑦2 ≤ link spare capacity • Complexity: O(kn) additional variables and constraints. • Throughput: optimal in control-plane and data plane if paths are disjoint. 19 FFC extensions • Differential protection for different traffic priorities • Minimizing congestion risks without rate limiters • Control plane faults on rate limiters • Uncertainty in current TE • Different TE objectives (e.g. max-min fairness) •… 20 Evaluation overview • Testbed experiment • FFC can be implemented in commodity switches • FFC has no data loss due to congestion under faults • Large-scale simulation A WAN network with O(100) switches and O(1000) links Injecting faults based on real failure reports Single priority traffic in a well-provisioned network Multiple priority traffic in a well-utilized network 21 FFC prevents congestion with negligible throughput loss 160% Single priority High priority (High FFC protection) Medium priority (Low FFC protection) Low priority (No FFC protection) 100 Ratio (%) 80 60 40 20 0 <0.01% FFC Data-loss / Non-FFC Data-loss FFC Throughput / Optimal Throughput 22 Conclusions • Centralized TE is critical to high network utilization but is vulnerable to control and data plane faults. • FFC proactively handle these faults. • Guarantee: no congestion when #faults ≤ k. • Efficiently computable with low throughput overhead in practice. FFC High risk of congestion Heavy network over-provisioning 23