Passive inference: Troubleshooting the Cloud with Tstat Alessandro Finamore <alessandro.finamore@polito.it> TMA Traffic monitoring and Analysis 4th TMA PHD School - London – Apr 16th, 2014 Active -vs- passive inference 2 Active inference: Study cause/effect relationships, i.e., inject some traffic in the network to observe a reaction PRO: world-wide scale (e.g., Planetlab) CONS: synthetic benchmark suffer from lack of generality Passive inference: study traffic properties just by observing it and without interfering with it PRO: study traffic generated from actual Internet users CONS: limited number of vantage points The network monitoring playground 3 Sup e r visor Collect some measurements Re p osit or y Challenges? Extract analytics passive probe Automation Flexibility/Openness data What are the performance of a cache? Deploy some vantage points What are the performance of YouTube video streaming? Pushing the paradigm further with 4 FP7 European project about the design and implementation of a measurement plane for the Internet Large scale Vantage points deployed on a worldwide scale Flexible Offers APIs for integrating existing measurement frameworks Not strictly bounded to specific “use cases” Intelligent Automate/simplify the process of “cooking” raw data Identify anomalies and unexpected events Provide root-cause-analysis capabilities mPlane consortium 16 partners 3 operators 6 research centers 5 universities 2 small enterprises Coordinator WP7 FP7 IP 3 years long 11Meuro Marco Mellia POLITO Saverio Nicolini NEC WP2 WP1 Ernst Biersack Eurecom Dina Papagiannaki Telefonica Brian Trammell ETH Tivadar Szemethy NetVisor WP5 WP6 Andrea Fregosi Fastweb Dario Rossi ENST WP3 Guy Leduc Univ. Liege Pietro Michiardi Eurecom Fabrizio Invernizzi Telecom Italia WP4 Pedro Casas FTW Pushing the paradigm further with 6 Sup e r visor active probe passive probe Integration with existing monitoring frameworks Re p osit or y data control Active and passive analysis for iterative root-cause-analysis What else beside ? “From global measurements to local management” Specific Targeted Research Projects (STReP) 3 years 2 left, 10 partners, 3.8 Meuros … is a sort of “mPlane use case” Build a measure framework out of probes IETF, Large-Scale Measurement of Broadband Performance (LMAP) Standardization effort on how to do broadband measurements Strong similarities for Defining the components, protocols, rules, etc. the architecture core It does not specifically target adding “a brain” to the system The network monitoring trinity 8 Sup e r visor Re p osit or y Post-processing Focus on How to process network traffic? How to scale at 10Gbps? Repository Raw measurements Try not to focus on just one aspect but rather on “mastering the trinity” http://tstat.polito.it 9 Is the passive sniffer developed @Polito over the last 10 years IN Private Network Border router Question: Which are the most popular accessed services? Rest of the world Question: How CDNs/datacenters are composed? Traffic stats http://tstat.polito.it 10 Is the passive sniffer developed @Polito over the last 10 years Per-flow stats including Several L3/L4 metrics (e.g., #pkts, #bytes, RTT, TTL, etc.) Traffic classification Deep Packet Inspection (DPI) Statistical methods (Skype, obfuscated P2P) Different output formats (logs, RRDs, histograms, pcap) Run on off-the-shelf HW Up to 2Gb/s with standard NIC Currently adopted in real network scenarios (campus and ISP) research/technology challenge 11 Challenge: Is it possible to build a “full-fledged” passive probe that cope with >10Gbps? Ad-hoc NICs are too expensive (>10keuro) Software solutions build on top of common Intel NICs ntop DNA [ACM Queue] Revisiting network I/O APIs: The netmaps Framework [PAM’12] PFQ: a Novel Engine for Multi-Gigabit Packet Capturing With netmap Multi-Core Commodity Hardware [IMC’10] High Speed Network Traffic Analysis with Commodity Multi-core Systems PFQ By offering direct access to the NIC (i.e., bypassing the kernel stack) the libraries can count packets at wire speed …but what about doing real processing? Possible system architecture 12 merge out2 outN 18 16 14 How to organize the analysis modules workflow? consumerN 20 % pkts drop consumer2 consumer1 out1 If needed, design “mergeable” output N identical consumer instances? Within each consumer, single execution flow? 2 Tstat + libDNA (synth. traffic)Per-flow packet scheduling Marginis the simplest option, but 12 10 8 6 to improve What about correlating multiple flows (e.g., DNS/TCP)? What about scheduing per traffic class? 4 2 Dispatch / Scheduling 0 1 Read pkts 2 3 4 5 6 7 8 9 10 Wire speedUnder [Gbps] testing a solution based on libDNA One or more process for reading? Depends… Other traffic classification tools? 13 WAND (Shane Alcock) - http://research.wand.net.nz Libprotoident, traffic classification using 4 bytes of payload It doesn’t matter having a fancy Libtrace, rebuilds TCP/UDP and other toolsclassifier for processing pcaps if you do not have proper flow characterization ntop (Luca Deri) - http://www.ntop.org/products/ndpi nDPI, a super set of OpenDPI l7filter, but is known to be inaccurate The literature is full of statistical/behavioral traffic classification methodologies [1,2] but AFAIK no real deployment no open source tool released [1] “A survey of techniques for internet traffic classification using machine learning” IEEE Communications Surveys & Tutorials, 2009 [2] “Reviewing Traffic Classification”, LNCS Vol. 7754, 2013 Measurement frameworks 14 RIPE Atlas – http://ripe.atlas.net World wide deployment of inexpensive active probes User Defined Measurement (UDM) credit based Ping, traceroute/traceroute6, DNS, HTTP Google mLAB Network Diagnostic Test (NDT) http://mlab-live.appspot.com/tools/ndt Connectivity and bandwidth speed Public available data … but IMO not straightforward to use Recent research activities 15 Sup e r visor Re p osit or y Focus on Post-processing Focus on How to process network traffic? How to scale at 10Gbps? Repository Raw measurements How to export/consolidate data continuously? What about BigData? (Big)Data export frameworks 16 Overcrowded scenario https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Logging_Solutions_Recommendation (Big)Data export frameworks 17 Overcrowded scenario All general purpose frameworks Data center scale Emphasis on throughput and/or real-time and/or consistency, etc. Typically designed/optimized for HDFS log_sync, “ad-hoc” solution @ POLITO Designed to manage a few passive probes Emphasis on throughput and data consistency Data management @ POLITO NAS 18 ~40TB (3TB x 12) = 1year data Gateway probe1 NAS cluster log_sync (server) probeN log_sync (server) ISP/Campus Cluster gateway 11 nodes = 9 data nodes + 2 namenode log_sync (client) pre-processing (dual 4-core, 3TB disk, 16GB ram) 416GB RAM = 32GBx9 + 64GBx2 ~32TB HDFS Single 6-core = 66 cores (x2 with HT) Debian 6 + CDH 4.5.0 BigData = Hadoop? 19 Almost true but there are other NoSQL solutions MongoDB, REDIS, Cassandra, Spark, Neo4J, etc. http://nosql-database.org How to choose? Not so easy to say, but Avoid BigData frameworks if you have just few GB of data Sooner or later you are going to do some coding so pick something that seems “confortable” Fun fact: MapReduce is a NoSQL paradigm but people are used to SQL queries Rise of Pig, Hive, Impala, Shark, etc. which allow to do SQL-like queries on top of MapReduce Recent research activities Sup e r visor 20 Re p osit or y Focus on Focus on Case study of an Akamai Post-processing “cache” performance Repository “DBStream: an Online Aggregation, Filtering and Processing System for Network Traffic Monitoring” TRAC’14 Focus on How to process network traffic? How to scale at 10Gbps? Raw measurements How to export/consolidate data continuously? What about BigData? Monitoring an cache 21 Focusing on vantage point of ~20k ADSL customers 1 week of HTTP logs (May 2012) Content served by Akamai CDN The ISP hosts an Akamai “preferred cache” (a specific /25 subnet) ? ? ? Reasoning about the problem 22 Q1: Is this affecting specific FQDN accessed? Q2: Are the variations due to “faulty” servers? Q3: Was this triggered by CDN performance issues? Etc… How to automate/simplify this reasoning? DBStream (FTW) Continuous big data analytics Flexible processing language Full SQL processing capabilities Processing in small batches Storage for post-mortem analysis Q1: Is this affecting a specific FQDN? 23 Select the top 500 Fully Qualified Domain Names (FQDN) served by Akamai Check if they are served by the preferred /25 subnet Repeat every 5 min 1 500 FQDN not served by the preferred cache 0.8 400 FQDN FQDN hosted by the preferred cache, except during the anomaly Akamai Others 300 0.6 200 0.4 Other subnets 0.2 100 Preferred /25 subnet 0 0 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 Mon Mon Mon Tue Tue Tue Tue Wed The two sets have “services” in common Same results extending to more than 500 FQDN Akamai Preferred Q2: Are the variations due to “faulty” servers? 24 Compute the traffic volume per IP address Check the behavior during the disruption Repeat each 5 min Akamai preferred IPs (/25 subnet) 1 120 "ips.matrix" matrix 0.8 100 80 0.6 60 0.4 40 0.2 20 0 06:00 12:00 18:00 00:00 06:00 12:00 18:00 00:00 Mon Mon Mon Tue Tue Tue Tue Wed Q3: Was this triggered by performance issues? 25 Compute the distribution of server query elaboration time It is the time between the TCP ACK of the HTTP GET and the reception of the first byte of the reply Focus on the traffic of the /25passive preferred subnet client server probe Compare the quartiles of the server elaboration time every 5 min Elaboration time 100 Performance decreases right before the anomaly @6pm query processing time 10 50th DATA 75th 06:00 Mon 12:00 Mon 18:00 Mon 00:00 Tue 25th 06:00 Tue 5th 12:00 Tue 18:00 Tue 00:00 Wed Reasoning about the problem 26 Q1: Is this affecting only specific services? Q2: Are the variations due to “faulty” servers? Q3: Was this triggered by CDN performance issues? What else? Other vantage points report the same problem? YES! What about extending the time period? The anomaly is present along the whole period we considered On going extension of the analysis on more recent data sets (possibly exposing also other effects/anomalies) Routing? TODO route views DNS mapping? TODO RipeAtlas + ISP active probing infrastructure Other suggestions are welcomed Reasoning about the problem 27 Q1: Is this affecting only specific services? Q2: Are but the variations “faulty” servers? …ok, what are due the to final takeaways? Q3: Was this triggered by CDN performance issues? Try to automate your analysis What else? Think about what you measure and be creative especially Other vantage points report the same problem? YES! for visualization What about extending the time period? Enlarge your prospective The anomaly is present along the whole period we considered multiple vantage points On going extension of the analysis on more recent data sets (possibly multiplealso data sources exposing other effects/anomalies) analysis Routing? TODOonlarge route time viewswindows DNS mapping? TODO RipeAtlas Don’t be afraid to ask opinions + ISP active probing infrastructure Other suggestions are welcomed ?? || ## <alessandro.finamore@polito.it> TMA Traffic monitoring and Analysis