Profiling Network Performance for Multi-Tier Data Center Applications Offense by – Balasaheb Bagul Rumou Duan 1 Polling Interval – How much is it? • Errors when intervals smaller and greater than 500ms. 2 SNAP Configuration - I • 8K hosts and just 700 applications! • In section 3.2, collect only discrete socket-call logs (“99.8% of connections has low throughput less than 1 MB/s”) – 1GB of data per host per day and 1 TB per week! • Continuous TCP logs are completely ignored – With pooling interval is at an average of 500ms – 120 bytes per connection per pull? 3 SNAP Configuration - II • Where are they analyzing data collected? Hosts or centralized server? – Centralized (8000*1) GB per day of just socket logs – How and when do you send this data to the central server? 4 SNAP Configuration - III • Sockets to Processes mapping – Done when the sockets are open – Processes can create new sockets and close old ones dynamically – So they have to do this mapping in that short frame of time and continuously. 5 CPU Overhead – I (At each host) • Polling TCP stats + Reading TCP table = 5%+5% < 10% • Collecting Socket logs: 1.6 %. TCP performance classifier? 6 Fine-grained profiling? TCP Incast Problem In paper: “For example, the TCP incast problem [3], caused by micro bursts of traffic at the timescale of tens of milliseconds, is not even visible in SNMP data.” However, based on Figure 8, the CPU overhead is really large. 7 CPU Overhead – II (At Server) • • • • Cross-Connection Correlation is centralized How will it scale? – No mention about it! How it works? “SNAP has full knowledge of network topology, the network-stack configuration, and mappings of applications to servers.” 8 SNAP Validation • Test beds include only 36 hosts! • Extremely small data collected • ACC (average correlation coefficient) = 0.4 – Why? – Are all the connections with ACC just above 0.4 facing problems? 9 Advices to DC Operator – Seriously! 1. “Operators should schedule backup jobs more carefully to avoid triggering network congestion” – 2 am to 4 am is the most idle time to do bulk transfers! -> So why change it? 2. “Operators should disable delayed ACK or reduce it significantly” – What about time critical application? 10 Advices to Developers – Again Seriously! • Claim: “Developers can use these logs to quickly find the root cause of performance problems.” • Problems that SNAP detected required several days and weeks to solve! – Do developers have weeks to spare? – So does this mean that SNAP’s data is not efficient for the developers • “There should be better scheduling of traffic across applications…” – How to do it? 11 Conclusion • Not scalable due to centralized server • Huge data collected per host per day – Continuously • Get it to work with more applications! 12 Thank you! 13