Scalable Management of Enterprise and Data Center Networks Minlan Yu minlanyu@cs.princeton.edu Princeton University 1 Edge Networks Enterprise networks (corporate and campus) Data centers (cloud) Internet Home networks 2 Redesign Networks for Management • Management is important, yet underexplored – Taking 80% of IT budget – Responsible for 62% of outages • Making management easier – The network should be truly transparent Redesign the networks to make them easier and cheaper to manage 3 Main Challenges Flexible Policies (routing, security, measurement) Large Networks (hosts, switches, apps) Simple Switches (cost, energy) 4 Large Enterprise Networks Hosts (10K - 100K) Switches …. (1K - 5K) Applications (100 - 1K) …. 5 Large Data Center Networks Switches (1K - 10K) …. …. …. …. Servers and Virtual Machines (100K – 1M) Applications (100 - 1K) 6 Flexible Policies Considerations: - Performance - Security - Mobility - Energy-saving …… - Cost reduction Measuremen - Debugging t - Maintenance Diagnosis …… Alice Customized Routing Access Control Alice 7 Switch Constraints Increasing link speed (10Gbps and more) Switch Small, on-chip memory (expensive, power-hungry) Storing lots of state • Forwarding rules for many hosts/switches • Access control and QoS for many apps/users • Monitoring counters for specific flows 8 Edge Network Management Specify policies Management System Configure devices Collect measurements on switches BUFFALO [CONEXT’09] Scaling packet forwarding DIFANE [SIGCOMM’10] Scaling flexible policy on hosts SNAP [NSDI’11] Scaling diagnosis 9 Research Approach New algorithms & data structure Systems prototyping Evaluation & deployment BUFFALO Effective use of switch memory Prototype on Click Evaluation on real topo/trace DIFANE Effective use of switch memory Prototype on OpenFlow Evaluation on AT&T data SNAP Efficient data collection/analysis Prototype on Win/Linux OS Deployment in Microsoft 10 BUFFALO [CONEXT’09] Scaling Packet Forwarding on Switches 11 Packet Forwarding in Edge Networks • Hash table in SRAM to store forwarding table – Map MAC addresses to next hop – Hash collisions: 00:11:22:33:44:55 00:11:22:33:44:66 …… aa:11:22:33:44:77 • Overprovision to avoid running out of memory – Perform poorly when out of memory – Difficult and expensive to upgrade memory 12 Bloom Filters • Bloom filters in SRAM – A compact data structure for a set of elements – Calculate s hash functions to store element x – Easy to check membership – Reduce memory at the expense of false positives x Vm-1 V0 0 0 0 1 0 0 0 1 0 1 h1(x) h2(x) h3(x) 0 1 0 0 0 hs(x) BUFFALO: Bloom Filter Forwarding • One Bloom filter (BF) per next hop – Store all addresses forwarded to that next hop Bloom Filters query Packet destination Nexthop 1 Nexthop 2 hit …… Nexthop T 14 Comparing with Hash Table • Save 65% memory with 0.1% false positives Fast Memory Size (MB) • 14 More benefitshash over hash table table 12 fp=0.01% – Performance degrades gracefully as tables grow 10 fp=0.1% 65% – Handle worst-case workloads well 8 fp=1% 6 4 2 0 0 500 1000 1500 2000 # Forwarding Table Entries (K) 15 False Positive Detection • Multiple matches in the Bloom filters – One of the matches is correct – The others are caused by false positives Bloom Filters query Packet destination Multiple hits Nexthop 1 Nexthop 2 …… Nexthop T 16 Handle False Positives • Design goals – Should not modify the packet – Never go to slow memory – Ensure timely packet delivery • When a packet has multiple matches – Exclude incoming interface • Avoid loops in “one false positive” case – Random selection from matching next hops • Guarantee reachability with multiple false positives 17 One False Positive • Most common case: one false positive – When there are multiple matching next hops – Avoid sending to incoming interface • Provably at most a two-hop loop – Stretch <= Latency(AB) + Latency(BA) A dst B 18 Stretch Bound • Provable expected stretch bound k /3 O(3 ) – With k false positives, proved to be at most – Proved by random walk theories • However, stretch bound is actually not bad – False positives are independent – Probability of k false positives drops exponentially • Tighter bounds in special topologies – For tree, expected stretch is 2(k -1)2 (k > 1) 19 BUFFALO Switch Architecture 20 Prototype Evaluation • Environment – Prototype implemented in kernel-level Click – 3.0 GHz 64-bit Intel Xeon – 2 MB L2 data cache, used as SRAM size M • Forwarding table – 10 next hops, 200K entries • Peak forwarding rate – 365 Kpps, 1.9 μs per packet – 10% faster than hash-based EtherSwitch 21 BUFFALO Conclusion • Indirection for scalability – Send false-positive packets to random port – Gracefully increase stretch with the growth of forwarding table • Bloom filter forwarding architecture – Small, bounded memory requirement – One Bloom filter per next hop – Optimization of Bloom filter sizes – Dynamic updates using counting Bloom filters 22 DIFANE [SIGCOMM’10] Scaling Flexible Policies on Switches 23 Traditional Network Management plane: offline, sometimes manual Control plane: Hard to manage Data plane: Limited policies New trends: Flow-based switches & logically centralized control 24 Data plane: Flow-based Switches • Perform simple actions based on rules – Rules: Match on bits in the packet header – Actions: Drop, forward, count – Store rules in high speed memory (TCAM) Flow space src. (X) forward via link 1 dst. (Y) Count packets drop TCAM (Ternary Content Addressable Memory) 1. X:* Y:1 drop 2. X:5 Y:3 drop 3. X:1 Y:* count 4. X:* Y:* forward 25 Control Plane: Logically Centralized RCP [NSDI’05], 4D [CCR’05], Ethane [SIGCOMM’07], NOX [CCR’08], Onix [OSDI’10], Software defined networking DIFANE: A scalable way to apply fine-grained policies 26 Pre-install Rules in Switches Pre-install rules Packets hit the rules Controller Forward • Problems: Limited TCAM space in switches – No host mobility support – Switches do not have enough memory 27 Install Rules on Demand (Ethane) Buffer and send packet header to the controller Controller Install rules First packet misses the rules Forward • Problems: Limited resource in the controller – Delay of going through the controller – Switch complexity – Misbehaving hosts 28 Design Goals of DIFANE • Scale with network growth – Limited TCAM at switches – Limited resources at the controller • Improve per-packet performance – Always keep packets in the data plane • Minimal modifications in switches – No changes to data plane hardware Combine proactive and reactive approaches for better scalability 29 DIFANE: Doing it Fast and Easy (two stages) 30 Stage 1 The controller proactively generates the rules and distributes them to authority switches. 31 Partition and Distribute the Flow Rules Flow space Controller Distribute partition information Authority Switch A AuthoritySwitch B Authority Switch C Authority Switch B Ingress Switch accept Authority Switch A reject Egress Switch Authority Switch C 32 Stage 2 The authority switches keep packets always in the data plane and reactively cache rules. 33 Packet Redirection and Rule Caching Authority Switch Ingress Switch Egress Switch First packet Following packets Hit cached rules and forward A slightly longer path in the data plane is faster than going through the control plane 34 Locate Authority Switches • Partition information in ingress switches – Using a small set of coarse-grained wildcard rules – … to locate the authority switch for each packet • A distributed directory service of rules – Hashing does not work for wildcards AuthoritySwitch B Authority Switch A Authority Switch C X:0-1 Y:0-3 A X:2-5 Y: 0-1 B X:2-5 Y:2-3 C 35 Packet Redirection and Rule Caching Authority Switch Ingress Switch Auth. Rules First packet Egress Switch Cache Rules Following packets Partition Rules Hit cached rules and forward 36 Three Sets of Rules in TCAM Type Cache Rules Priority Field 1 Field 2 Action Timeout 1 00** 111* Forward to Switch B 10 sec In ingress switches 2 1110 11** Drop reactively installed by authority switches 10 sec … … … … … 14 00** 001* Forward Trigger cache manager Infinity … Authority In authority switches 15 0001 0***by controller Drop, proactively installed Rules Trigger cache manager … … … … 109 0*** 000* Redirect to auth. switch … … … … Partition In every switch 110 … Rules proactively installed by controller … 37 DIFANE Switch Prototype Built with OpenFlow switch Recv Cache Updates Control Plane Only in Auth. Switches Send Cache Updates Cache Manager Notification Cache Rules Datasoftware modification for authority switches Just Authority Rules Plane Partition Rules 38 Caching Wildcard Rules • Overlapping wildcard rules – Cannot simply cache matching rules src. dst. Priority: R1>R2>R3>R4 39 Caching Wildcard Rules • Multiple authority switches – Contain independent sets of rules – Avoid cache conflicts in ingress switch Authority switch 1 Authority switch 2 40 Partition Wildcard Rules • Partition rules – Minimize the TCAM entries in switches – Decision-tree based rule partition algorithm Cut B is better than Cut A Cut B Cut A 41 Testbed for Throughput Comparison • Testbed with around 40 computers Ethan e DIFANE Controller Controller Authority Switch …. Traffic generator Ingress switch …. Traffic generator Ingress switch 42 Peak Throughput • One authority switch; First Packet of each flow Throughput (flows/sec) 1,000K DIFANE DIFANE Ethane NOX 100K DIFANE (800K) Ingress switch Bottleneck DIFANE is self-scaling: (20K) Higher throughput with more authority switches. 10K Controller Bottleneck (50K) 1K 1K 43 1 ingress 2 3 4 switch 10K 100K Sending rate (flows/sec) 1000K Scaling with Many Rules • Analyze rules from campus and AT&T networks – Collect configuration data on switches – Retrieve network-wide rules – E.g., 5M rules, 3K switches in an IPTV network • Distribute rules among authority switches – Only need 0.3% - 3% authority switches – Depending on network size, TCAM size, #rules 44 Summary: DIFANE in the Sweet Spot Distributed Logically-centralized Traditional network (Hard to manage) OpenFlow/Ethane (Not scalable) DIFANE: Scalable management Controller is still in charge Switches host a distributed directory of the rules 45 SNAP [NSDI’11] Scaling Performance Diagnosis for Data Centers Scalable Net-App Profiler 46 Applications inside Data Centers …. …. Front end Aggregator Server …. …. Workers 47 Challenges of Datacenter Diagnosis • Large complex applications – Hundreds of application components – Tens of thousands of servers • New performance problems – Update code to add features or fix bugs – Change components while app is still in operation • Old performance problems (Human factors) – Developers may not understand network well – Nagle’s algorithm, delayed ACK, etc. 48 Diagnosis in Today’s Data Center App logs: #Reqs/sec Response time 1% req. >200ms delay Application-specific Host App OS SNAP: Diagnose net-app interactions Generic, fine-grained, and lightweight Packet trace: Filter out trace for long delay req. Too expensive Packet sniffer Switch logs: #bytes/pkts per minute Too coarse-grained 49 SNAP: A Scalable Net-App Profiler that runs everywhere, all the time 50 SNAP Architecture Online, lightweight processing & diagnosis Offline, cross-conn diagnosis Management System Topology, routing Conn proc/app At each host for every connection Collect data Performance Classifier Crossconnection correlation Offending app, host, link, or switch Adaptively Classifying polling per-socket based on the statistics stagesinofOS data transfer - Snapshots (#bytes in send buffer) - Sender appsend buffernetworkreceiver - Cumulative counters (#FastRetrans) 51 SNAP in the Real World • Deployed in a production data center – 8K machines, 700 applications – Ran SNAP for a week, collected terabytes of data • Diagnosis results – Identified 15 major performance problems – 21% applications have network performance problems 52 Characterizing Perf. Limitations #Apps that are limited for > 50% of the time Send Buffer 1 App – Send buffer not large enough Network 6 Apps – Fast retransmission – Timeout Receiver 8 Apps – Not reading fast enough (CPU, disk, etc.) 144 Apps – Not ACKing fast enough (Delayed ACK) 53 Delayed ACK Problem • Delayed ACK affected many delay sensitive apps – even #pkts per record 1,000 records/sec odd #pkts per record 5 records/sec – Delayed ACK was used to reduce bandwidth usage and B server interrupts A ACK every other packet Proposed solutions: Delayed ACK should be disabled in data centers …. 200 ms 54 Diagnosing Delayed ACK with SNAP • Monitor at the right place – Scalable, lightweight data collection at all hosts • Algorithms to identify performance problems – Identify delayed ACK with OS information • Correlate problems across connections – Identify the apps with significant delayed ACK issues • Fix the problem with operators and developers – Disable delayed ACK in data centers 55 Edge Network Management Specify policies Management System Configure devices Collect measurements on switches BUFFALO [CONEXT’09] Scaling packet forwarding DIFANE [SIGCOMM’10] Scaling flexible policy on hosts SNAP [NSDI’11] Scaling diagnosis 56 Thanks! 57