Internet2 Netflow data analysis Malathi Veeraraghavan & Zhenzhen Yan University of Virginia 1 Outline Problem statement • Software architecture • Solution approach • Findings 2 Problem statement • Can long flows be identified automatically at PE routers and redirected to dynamic circuits across the SDN? • Implement prototype and demonstrate on DOE ANI Prototype – Hybrid Network Engineering Software (HYNES) 3 Big picture: vision for how HYNES could be used (if it all works) ESnet Provider Edge (PE) router dynamic circuit (long flows) 4 Outline • Problem statement Software architecture • Solution approach • Findings 5 Hybrid Network Engineering Software (HYNES) • MFDB: Monitored Flow Data Base • Some components can be centralized and rest distributed – Centralized: OFAT, IDC interface module, user-interface module, initialization module 6 Components • Offline Analysis Tool (OFAT): statistical R programs to identify which flows are long – Most challenging component – Leverage “human knowledge” about large file transfer servers and applications – Populate MFDB • Monitored flow data base (MFDB) Source Destination Protocol Source Destination IP IP address port port address Of monitored flow (not all fields are required for each flow) Status Monitored Redirected Disabled Bypass circuit endpoints IP addresses and VLAN IDs Remote HYNES address IP address Circuit rate 7 Components contd. • Packet header processing module – receives packets for flows in MFDB – initiates the reservation and provisioning of a circuit – initiates circuit release when packet flow “ends” • IDC interface – interfaces with ESnet’s Inter-Domain Controller (IDC) • User-interface module – supports human and programmatic interface to MFDB • Router control interface module – set PBR for MFDB flows to mirror packets to HYNES server – set PBR route to redirect packets from default IP-routed path to newly established circuits, and reset when done 8 Coming back to the key problem • Can long flows be identified automatically at PE routers and redirected to dynamic circuits across the SDN? 9 Outline • Problem statement • Software architecture Solution approach • Findings 10 Solution methodology I. Netflow data (analysis with R programs) Long flows separated by apps III. Understanding applications (with tcpdump, talking to developers: SCP, SFTP, GridFTP, BBCP) II. Network requirements workshop reports (“human knowledge” mining) IP addresses for scientific computing (data transfers) servers Goal: Identify suitable candidate flows for the MFDB 11 Track I: Netflow data analysis • Methodology: – Download Netflow data from Internet2 – Use flow-export tools to get ASCII file – Shows 5-tuple, bytes, timestamps of first and last packet in flow – Statistical package R programs: • Find flowlengths and isolate out flows of length 59sec from each 5-min file • Concatenate flows from all 5-minute files in one day (one week): – gaps (1-in-100 sampling): 5-minute gaps acceptable – “definition” of “long flow”: >= 10minutes – Output: all flows longer than 10 minutes 12 Methodology contd. • Sort long flows by protocol number and only save tcp, GRE, ESP, AH, IPin-IP flows (removed ICMP and UDP) and print statistics • Sort on ip protocol field and src and dst ports, and separate out flows for different applications into different files 13 Next steps • Check if the n-tuple (n <=5) used to identify a long flow occurs in a short flow; if it does often, then cannot place this flow descriptor in the MFDB • Check for repeated occurrences of a flow on different days • Sensitivity analysis to acceptable gap parameter (now: 5 minutes) • Look for temporal patterns on flow-arrival times for forecasting 14 Track II: Mining “human knowledge” • Metholodogy: – For each report (NP, BES, BER, FES, ASCR) • For each project, – Determine if users use file transfer applications to move data or just take home the data collected from instruments on DVDs – For each participating institution (ESnet sites and major universities), access the high-performance computing web site – Look for servers dedicated to data transfers – Identify applications run (scp, sftp, bbcp, GridFTP, RFT, etc.) – Use ping to find IP addresses – Use arin.net to find IP address space allocation for participating universities 15 III: Understanding apps • SCP – learned about HPN-SSH patch – receive buffer resizing – can disable payload encryption • GridFTP – Data flows: port numbers in the 5000051000 range – obtained a tcpdump of a GridFTP session – obtained a globus-url-copy (GridFTP client) with debug enabled output 16 Outline • Problem statement • Software architecture • Solution approach Findings 17 Track I: Netflow analysis • CHIC and LOSA routers of Internet2 • One-day data analysis • 5-day (Mon-Fri) analysis 18 Unidata one-day (HYNES) srcIP dstIP srcPort dstPort 128.117.136.0 35.8.8.0 388 38650 128.117.136.0 35.8.8.0 388 128.117.136.0 35.8.8.0 128.117.136.0 protocol firstunix lastunix flowlength 6 1246923494 1246923553 59.00099993 38650 6 1246923734 1246923794 59.53600001 388 38650 6 1246923794 1246923854 59.56599998 35.8.8.0 388 38650 6 1246923854 1246923914 59.69000006 128.117.136.0 35.8.8.0 388 38650 6 1246923914 1246923973 59.398 128.117.136.0 35.8.8.0 388 38650 6 1246923974 1246924034 59.648 128.117.136.0 35.8.8.0 388 38650 6 1246924154 1246924214 59.80200005 128.117.136.0 35.8.8.0 388 38650 6 1246924214 1246924274 59.68599987 128.117.136.0 35.8.8.0 388 38650 6 1246924274 1246924334 59.61099982 128.117.136.0 35.8.8.0 388 38650 6 1246924334 1246924394 59.45600009 128.117.136.0 35.8.8.0 388 38650 6 1246924394 1246924454 59.48799992 128.117.136.0 35.8.8.0 388 38650 6 1246924455 1246924514 59.08899999 128.117.136.0 35.8.8.0 388 38650 6 1246924514 1246924573 59.00699997 128.117.136.0 35.8.8.0 388 38650 6 1246924574 1246924633 59.36100006 14 minutes Between NCAR and Michigan State University 19 Top ten fat flows in one day bytes srcIP dstIP srcport dstport 542582720 198.108.24.0 210132404 204.228.64.0 0 0 50 1246879655 1246891060 11405.383 131.225.192.0 129.114.48.0 45677 22 6 1246879655 1246890520 10865.413 186519604 128.135.64.0 131.142.152.0 22 58942 6 1246891541 1246893460 1919.008 165567747 198.32.8.0 198.32.8.0 0 0 47 1246874013 1246874133 119.8869998 146416660 208.100.88.0 141.142.24.0 43094 22 6 1246861049 1246868372 7322.738 127799228 208.100.88.0 141.142.24.0 43094 22 6 1246882716 1246890100 7384.865 113470332 128.117.136.0 128.255.24.0 388 42707 6 1246861049 1246879356 18306.468 106577448 198.108.24.0 204.228.64.0 0 0 50 1246921790 1246923291 1500.479 101287624 131.225.192.0 129.114.48.0 45677 22 6 1246873833 1246878456 4622.681 152.46.0.0 128.255.56.0 873 1934 6 1246861049 1246869092 8042.826 91662040 protocol firstunix lastunix flow length encapsulated:3; ssh:5; Unidata:1; rsync:1 2 long ssh flows to University of Texas at Austin Texas Advanced Computing Center (129.114.48.0) from Fermilab (131.225.192.0). 141.142.24.0 corresponds to NCSA (National Center for Supercomputing Applications) for two other ssh flows. The Unidata LDM flow is from NCAR (National Center for Atmospheric Research) with address 128.117.136.0. 20 Data for a per-day basis five-weekday period (July 6-10, 2009) date July 06 July 07 July 08 July 09 July 10 Longest 18306 (113470332 bytes) 25326 (112768756 bytes) 21307 (80492044 bytes) 15364 (30196708 bytes) 23825 (1544080 bytes) fattest 542582720 (11405 seconds) 867174480 (10443 seconds) 185912032 (4562 seconds) 241363080 (3360 seconds) 310908448 (779 seconds) longest 26882 (58425675 bytes) 23220 (216722816 bytes) 18004 (26475856 bytes) 19986 (40811386 bytes) 15964 (142172096 bytes) fattest 187357504 (20402 seconds) 216722816 (23220 seconds) 349049492 (5223 seconds) 385387604 (7201 seconds) 164962944 (2041 seconds) CHIC LOSA Fattest data: remember it is 1-in-100 sampled data 26882 sec = 7.5 hours 21 Data for a five-weekday period (July 6-10, 2009) Router CHIC LOSA Number of flows 841062272 268933244 Number (%) of long flows 35632 (0.00424% ) 32660 (0.012%) Total number of bytes 3.11946E+12 1.17578E+12 Number and % of bytes in long flows 1.111436E+11 (3.563%) 101783460037 (8.66%) Number of long flows of different types 33241 (TCP), 0 (IP), 260(GRE), 211(ESP), 0(AH) 26618 (TCP), 67 (IP), 206(GRE), 320(ESP), 7(AH) Number of long flows of different applications (without counting long ACK flows) 447 (20:ftp), 3023 (22:ssh), 4 (25:smtp), 3078 (80: http), 63 (119: nntp), 1 (143:imap), 1690 (388: unidata), 142 (443:https) , 25 (554 :rtsp), 560 (873:rsync), 2545 (unassigned), 1134 (dynamic-and-private), SUM =12712 1202(20:ftp), 3109 (22:ssh), 8 (25:smtp), 3528 (80: http), 1 (119: nntp), 0 (143:imap), 426 (388: unidata), 307 (443:https) , 15 (554 :rtsp), 570 (873:rsync), 1068(unassigned), 574 (dynamicand-private), SUM =10808 Fattest flow (bytes) 867174480 (10443 seconds) 1779002612 (7381 seconds) Longest flow (seconds) 25326 (112768756 bytes) 26882 (58425675 bytes) 22 Repeat customers: ssh long flows Router Number of ssh long flows that occured on multiple days in that one-5-day period (candidates for MFDB) LOSA CHIC 2days 3 days 4 days 5 days 2days 3 days 4 days 5 days 59+60 +55+5 5=229 38+42+ 37=117 32+30= 62 28 62+67+ 54+53= 236 41+32+ 33=106 20+23= 43 17 23 A GridFTP Example (from CHIC) dpkts doctets srcaddr dstaddr srcport dstport prot firstunix lastunix flowlength 1344 70852 129.93.232.0 130.246.176.0 44560 50002 6 1247027444 1247028164 719.635 792 41436 129.93.232.0 130.246.176.0 53225 50001 6 1247027444 1247028343 899.1270001 995 51944 129.93.232.0 130.246.176.0 54831 50011 6 1247027444 1247028164 719.6719999 1172 61700 129.93.232.0 130.246.176.0 54870 50013 6 1247027444 1247028164 719.608 1251 65700 129.93.232.0 130.246.176.0 44560 50003 6 1247027444 1247028164 719.4919999 1270 66616 129.93.232.0 130.246.176.0 35846 50000 6 1247027444 1247028164 719.6600001 863 45300 129.93.232.0 130.246.176.0 54870 50012 6 1247027444 1247028163 718.99 • Between University of Nebraska-Lincoln and Rutherford Appleton Laboratory 24 • Appears to be a parallel transfer (though cannot be sure that it is not striped because of anonmyized addresses) Correlation between SSH flows and ESP flows • Looked for correlation between long SSH flows and ESP flows (IP protocol number = 50) – Found none – Hypothesis: If HPN-SSH is used, then SSH flows (port 22) will likely be short and port 22 long flows could be long scp transfers – If regular ssh is used, the long flow from an scp transfer will be an ESP flow, but the SSH flow will be short 25 Findings from track II • Teragrid servers: 15 sites (server names and IP addresses found) • ESG data grid servers found • So far: NP and BES reports studied • Number of servers found so far: 51 (BES: single) + 15 (BES: ranges) + 39 (NP) • Some IP address ranges used for participating institutions 26 “Match” rate between Track I and Track II • • • CHIC LOSA NP 10060 (29.84%) 1128 (4.14%) BES 3706 (9.12%) 1254 (4.6%) Percent of flows for which the src or dst IP address matches one of the server addresses found from the Track II study of science projects Number of long flows in CHIC is 33717 Number of long flows in LOSA is 27279 27 Summary • Ready to provide ESnet the R programs and shell scripts • Track II: FES and BES report analysis • Track III application pattern recognition (scp, sftp, bbcp, GridFTP) • Run Internet2 Netflow analysis for more days and at all routers 28 Thoughts for ESnet Netflow data analysis • qsub problem: large jobs submitted in batch mode on front-end. Actual servers involved are not advertised on science project web sites. – Ask sites for GridFTP/scp/bbcp/sftp server addresses (cluster nodes) • UVA provides ESnet R programs to just extract these long flows, and ESnet runs them, and only provides UVA with long flow data for matched server addresses 29 Contd. • Anonymizing problem – will make MFDB candidacy determination difficult. Need to know that the flow concatenation process was not accidentally merging two different flows – e.g. the GridFTP problem • Need to estimate the returns on investment: percent of bytes that are candidates for handling on SDN 30