Internet traffic measurement: from packets to insight George Varghese (based on Cristi Estan’s work) University of California, San Diego May 2011 Research motivation The Internet in 1969 The Internet today Problems Flexibility, speed, scalability Overloads, attacks, failures Measurement & control Ad-hoc solutions suffice Engineered solutions needed Research direction: towards a theoretical foundation for systems doing engineered measurement of the Internet Current solutions Network Router Memory Concise? Raw data Analysis Server Accurate? Network Operator Traffic reports Fast link State of the art: simple counters (SNMP), time series plots of traffic (MRTG), sampled packet headers (NetFlow), top k reports Measurement challenges • Data reduction – performance constraints Memory (Terabytes of data each hour) Link speeds (40 Gbps links) Processing (8 ns to process a packet) • Data analysis – unpredictability Unconstrained service model (e.g. Napster, Kazaa ) Unscrupulous agents (e.g. Slammer worm) Uncontrolled growth (e.g. user growth) Main contributions • Data reduction: Algorithmic solutions for measurement building blocks Identifying heavy hitters (part 1 of talk) Counting flows or distinct addresses • Data analysis: Traffic cluster analysis automatically finds the dominant modes of network usage (part 2 of talk) AutoFocus traffic analysis system used by hundreds of network administrators Identifying heavy hitters Network Analysis Traffic Server reports Router Memory Raw data Fast link Identifying heavy hitters with multistage filters Network Operator Why are heavy hitters important? • Network monitoring: Current tools report top applications, top senders/receivers of traffic • Security: Malicious activities such as worms and flooding DoS attacks generate much traffic • Capacity planning: Largest elements of traffic matrix determine network growth trends • Accounting: Usage based billing most important for most active customers Problem definition • Identify and measure all streams whose traffic exceeds threshold (0.1% of link capacity) over certain time interval (1 minute) Streams defined by fields (e.g. destination IP) Single pass over packets Small worst case per packet processing Small memory usage Few false positives / false negatives Measuring the heavy hitters • Unscalable solution: keep hash table with a counter for each stream and report largest entries • Inaccurate solution: count only sampled packets and compensate in analysis • Ideal solution: count all packets but only for the heavy hitters • Our solution: identify heavy hitters on the fly Fundamental advantage over sampling – instead of (M is available memory) Why is sample & hold better? Sample and hold uncertainty Ordinary sampling uncertainty uncertainty uncertainty How do multistage filters work? Array of counters Hash(Pink) How do multistage filters work? Collisions are OK How do multistage filters work? Reached threshold Stream memory stream1 1 stream2 1 Insert How do multistage filters work? Stage 1 Stream memory stream1 1 Stage 2 Conservative update Gray = all prior packets Conservative update Redundant Redundant Conservative update Multistage filter analysis • Question: Find probability that a small stream (0.1% of traffic) passes filter with d = 4 stages * b = 1,000 counters, threshold T = 1% • Analysis: (any stream distribution & packet order) can pass a stage if other streams in its bucket ≥ 0.9% of traffic at most 111 such buckets in a stage => probability of passing one stage ≤ 11.1% probability of passing all 4 stages ≤ 0.1114 = 0.015% result tight Multistage filter analysis results • • • • d – filter stages T – threshold h=C/T, (C capacity) k=b/h, (b buckets) • n – number of streams • M – total memory Quantity Probability to pass filter Streams passing Relative error Result Bounds versus actual filtering 1 Average probability Worst case bound 0.1 of passing filter for small Zipf bound 0.01 0.001 streams Actual (log scale) 0.0001 Conservative update 0.00001 1 2 3 Number of stages 4 Comparing to current solution • Trace: 2.6 Gbps link, 43,000 streams in 5 seconds • Multistage filters: 1 Mbit of SRAM (4096 entries) • Sampling: p=1/16, unlimited DRAM Average absolute error / average stream size Stream size Multistage filters Sampling s > 0.1% 0.01% 5.72% 0.1% ≥ s > 0.01% 0.95% 20.8% 0.01% ≥ s > 0.001% 39.9% 46.6% Summary for heavy hitters • Heavy hitters important for measurement processes • More accurate results than random sampling: . instead of • Multistage filters with conservative update outperform theoretical bounds ? • Prototype implemented at 10 Gbps Building block 2, counting streams • Core idea Hash streams to bitmap and count bits set Sample bitmap to save memory and scale Multiple scaling factors to cover wide ranges • Result Accurate for 16-32 streams Can count up to 100 million streams with an 8-15 streams average error of 1% using 2 Kbytes of memory 0-7 streams Bitmap counting Hash based on flow identifier Estimate based on the number of bits set Does not work if there are too many flows Bitmap counting Increase bitmap size Bitmap takes too much memory Bitmap counting Store only a sample of the bitmap and extrapolate Too inaccurate if there are few flows Bitmap counting Accurate if number of flows is 16-32 8-15 0-7 Use multiple bitmaps, each accurate over a different range Must update multiple bitmaps for each packet Bitmap counting 16-32 8-15 0-7 Bitmap counting 0-32 Multiresolution bitmap Future work Traffic cluster analysis Network Analysis Traffic Server reports Router Memory Network Operator Raw data Fast link Part 2: Describing traffic Part 1: Identifying heavy with traffic cluster analysis hitters, counting streams Finding heavy hitters not enough Rank Source IP 1 forms.irs.gov Traffic 13.4% 2 3 5.78% 3.25% ftp.debian.org www.cnn.com Rank Destination IP Rank Source Network 1 att.com 2 yahoo.com Traffic 25.4% 15.8% Rank Application 1 web Traffic Traffic 42.1% 2 3 ICMP kazaa 12.5% 11.5% 1 jeff.dorm.bigU.edu 11.9% Rank Dest. network Traffic 2 lisa.dorm.bigU.edu 3.12% 12.2% 1 3 badU.edu library.bigU.edu 27.5% What apps Where doesuseful the 2.83% 3 individual risc.cs.bigU.edu • Aggregating on fields but are used? Which 2 cs.bigU.edu 18.1% traffic comeuses network Traffic reports not at right granularity 3 oftendorm.bigU.edu from? web and 17.8% …… Cannot show aggregates over multiple which onefields Most traffic goes kazaa? • Traffic analysis tool should automatically find to the dorms … aggregates over right fields at right granularity Ideal traffic report Traffic aggregate Traffic Web traffic 42.1% Web traffic to library.bigU.edu 26.7% Web traffic from forms.irs.gov 13.4% ICMP from sloppynet.badU.edu to jeff.dorm.bigU.edu 11.9% Web is the Traffic cluster reports tryof to give Thedominant library is a This is a Denial That’s a big flash application insights intouser the structure of the heavy Service ofcrowd! web attack !! traffic mix Definition • A traffic report gives the size of all traffic clusters above a threshold T and is: Multidimensional: clusters defined by ranges from natural hierarchy for each field Compressed: omits clusters whose traffic is within error T of more specific clusters in the report Prioritized: clusters have unexpectedness labels Unidimensional report example Hierarchy Threshold=100 10.0.0.0/28 500 CS Dept 10.0.0.0/29 120 10.0.0.0/30 50 10.0.0.8/29 380 10.0.0.4/30 70 2nd 10.0.0.8/30 305 75 10.0.0.12/30 floor 10.0.0.2/31 50 15 10.0.0.2 10.0.0.4/31 70 35 30 AI 10.0.0.8/31 270 10.0.0. 10/31 Lab 35 75 10.0.0.14/31 40 35 75 10.0.0.3 10.0.0.4 10.0.0.5 160 110 10.0.0.8 10.0.0.9 10.0.0.10 10.0.0.14 Unidimensional report example Compression 10.0.0.0/28 500 Source IP Traffic 10.0.0.0/29 120 10.0.0.8/29 380 380-270≥100 10.0.0.0/29 120 10.0.0.8/30 305 305-270<100 10.0.0.8/29 380 Rule: omit clusters with 10.0.0.8 160 10.0.0.8/31 270 traffic within error T of10.0.0.9 110 more specific clusters in 160 110 the report 10.0.0.8 10.0.0.9 Multidimensional structure Source net Application All traffic US EU EU Mail Web CA NY FR All traffic All traffic RU RU RU Mail RU Web Mail AutoFocus: system structure Cluster miner Grapher Traffic parser names Web based GUI categories Packet header trace / NetFlow data Traffic reports for weeks, days, three hour intervals and half hour intervals Colors – user defined traffic categories Separate reports for each category Analysis of unusual events • Sapphire/SQL Slammer worm Found worm port and protocol automatically Analysis of unusual events • Sapphire/SQL Slammer worm Identified infected hosts Related work • Databases [FS+98] Iceberg Queries Limited analysis, no conservative update • Theory [GM98,CCF02] Synopses, sketches Less accurate than multistage filters • Data Mining [AIS93] Association rules No/limited hierarchy, no compression • Databases [GCB+97] Data cube No automatic generation of “interesting” clusters