Internet2 Netflow Data Analysis - Electrical and Computer Engineering

advertisement
Internet2 Netflow data analysis
Malathi Veeraraghavan & Zhenzhen Yan
University of Virginia
1
Outline
Problem statement
• Software architecture
• Solution approach
• Findings
2
Problem statement
• Can long flows be identified
automatically at PE routers and
redirected to dynamic circuits across
the SDN?
• Implement prototype and
demonstrate on DOE ANI Prototype
– Hybrid Network Engineering Software
(HYNES)
3
Big picture: vision for how HYNES
could be used (if it all works)
ESnet Provider
Edge (PE) router
dynamic circuit (long flows)
4
Outline
• Problem statement
Software architecture
• Solution approach
• Findings
5
Hybrid Network Engineering
Software (HYNES)
• MFDB: Monitored Flow Data Base
• Some components can be centralized and rest distributed
– Centralized: OFAT, IDC interface module, user-interface module,
initialization module
6
Components
• Offline Analysis Tool (OFAT): statistical R
programs to identify which flows are long
– Most challenging component
– Leverage “human knowledge” about large file
transfer servers and applications
– Populate MFDB
• Monitored flow data base (MFDB)
Source
Destination
Protocol Source Destination
IP
IP address
port
port
address
Of monitored flow (not all fields are required for each flow)
Status
Monitored
Redirected
Disabled
Bypass
circuit
endpoints
IP
addresses
and
VLAN
IDs
Remote
HYNES
address
IP
address
Circuit
rate
7
Components contd.
• Packet header processing module
– receives packets for flows in MFDB
– initiates the reservation and provisioning of a circuit
– initiates circuit release when packet flow “ends”
• IDC interface
– interfaces with ESnet’s Inter-Domain Controller (IDC)
• User-interface module
– supports human and programmatic interface to MFDB
• Router control interface module
– set PBR for MFDB flows to mirror packets to HYNES server
– set PBR route to redirect packets from default IP-routed path
to newly established circuits, and reset when done
8
Coming back to the key problem
• Can long flows be identified
automatically at PE routers and
redirected to dynamic circuits across
the SDN?
9
Outline
• Problem statement
• Software architecture
Solution approach
• Findings
10
Solution methodology
I. Netflow data
(analysis with R programs)
Long flows separated by apps
III. Understanding applications
(with tcpdump, talking to
developers: SCP, SFTP,
GridFTP, BBCP)
II. Network requirements
workshop reports
(“human knowledge” mining)
IP addresses for scientific computing
(data transfers) servers
Goal: Identify suitable
candidate flows for the MFDB
11
Track I: Netflow data analysis
• Methodology:
– Download Netflow data from Internet2
– Use flow-export tools to get ASCII file
– Shows 5-tuple, bytes, timestamps of first and
last packet in flow
– Statistical package R programs:
• Find flowlengths and isolate out flows of length 59sec
from each 5-min file
• Concatenate flows from all 5-minute files in one day
(one week):
– gaps (1-in-100 sampling): 5-minute gaps acceptable
– “definition” of “long flow”: >= 10minutes
– Output: all flows longer than 10 minutes
12
Methodology contd.
• Sort long flows by protocol number
and only save tcp, GRE, ESP, AH, IPin-IP flows (removed ICMP and UDP)
and print statistics
• Sort on ip protocol field and src and
dst ports, and separate out flows for
different applications into different
files
13
Next steps
• Check if the n-tuple (n <=5) used to identify a long
flow occurs in a short flow; if it does often, then
cannot place this flow descriptor in the MFDB
• Check for repeated occurrences of a flow on
different days
• Sensitivity analysis to acceptable gap parameter
(now: 5 minutes)
• Look for temporal patterns on flow-arrival times
for forecasting
14
Track II: Mining “human
knowledge”
• Metholodogy:
– For each report (NP, BES, BER, FES, ASCR)
• For each project,
– Determine if users use file transfer applications to move data
or just take home the data collected from instruments on DVDs
– For each participating institution (ESnet sites and major
universities), access the high-performance computing web site
– Look for servers dedicated to data transfers
– Identify applications run (scp, sftp, bbcp, GridFTP, RFT, etc.)
– Use ping to find IP addresses
– Use arin.net to find IP address space allocation for
participating universities
15
III: Understanding apps
• SCP – learned about HPN-SSH patch
– receive buffer resizing
– can disable payload encryption
• GridFTP
– Data flows: port numbers in the 5000051000 range
– obtained a tcpdump of a GridFTP session
– obtained a globus-url-copy (GridFTP
client) with debug enabled output
16
Outline
• Problem statement
• Software architecture
• Solution approach
Findings
17
Track I: Netflow analysis
• CHIC and LOSA routers of Internet2
• One-day data analysis
• 5-day (Mon-Fri) analysis
18
Unidata one-day (HYNES)
srcIP
dstIP
srcPort
dstPort
128.117.136.0
35.8.8.0
388
38650
128.117.136.0
35.8.8.0
388
128.117.136.0
35.8.8.0
128.117.136.0
protocol
firstunix
lastunix
flowlength
6
1246923494
1246923553
59.00099993
38650
6
1246923734
1246923794
59.53600001
388
38650
6
1246923794
1246923854
59.56599998
35.8.8.0
388
38650
6
1246923854
1246923914
59.69000006
128.117.136.0
35.8.8.0
388
38650
6
1246923914
1246923973
59.398
128.117.136.0
35.8.8.0
388
38650
6
1246923974
1246924034
59.648
128.117.136.0
35.8.8.0
388
38650
6
1246924154
1246924214
59.80200005
128.117.136.0
35.8.8.0
388
38650
6
1246924214
1246924274
59.68599987
128.117.136.0
35.8.8.0
388
38650
6
1246924274
1246924334
59.61099982
128.117.136.0
35.8.8.0
388
38650
6
1246924334
1246924394
59.45600009
128.117.136.0
35.8.8.0
388
38650
6
1246924394
1246924454
59.48799992
128.117.136.0
35.8.8.0
388
38650
6
1246924455
1246924514
59.08899999
128.117.136.0
35.8.8.0
388
38650
6
1246924514
1246924573
59.00699997
128.117.136.0
35.8.8.0
388
38650
6
1246924574
1246924633
59.36100006
14 minutes
Between NCAR and Michigan State University
19
Top ten fat flows in one day
bytes
srcIP
dstIP
srcport
dstport
542582720
198.108.24.0
210132404
204.228.64.0
0
0
50
1246879655
1246891060
11405.383
131.225.192.0
129.114.48.0
45677
22
6
1246879655
1246890520
10865.413
186519604
128.135.64.0
131.142.152.0
22
58942
6
1246891541
1246893460
1919.008
165567747
198.32.8.0
198.32.8.0
0
0
47
1246874013
1246874133
119.8869998
146416660
208.100.88.0
141.142.24.0
43094
22
6
1246861049
1246868372
7322.738
127799228
208.100.88.0
141.142.24.0
43094
22
6
1246882716
1246890100
7384.865
113470332
128.117.136.0
128.255.24.0
388
42707
6
1246861049
1246879356
18306.468
106577448
198.108.24.0
204.228.64.0
0
0
50
1246921790
1246923291
1500.479
101287624
131.225.192.0
129.114.48.0
45677
22
6
1246873833
1246878456
4622.681
152.46.0.0
128.255.56.0
873
1934
6
1246861049
1246869092
8042.826
91662040
protocol
firstunix
lastunix
flow length
encapsulated:3; ssh:5; Unidata:1; rsync:1
2 long ssh flows to University of Texas at Austin Texas Advanced Computing
Center (129.114.48.0) from Fermilab (131.225.192.0).
141.142.24.0 corresponds to NCSA (National Center for Supercomputing
Applications) for two other ssh flows.
The Unidata LDM flow is from NCAR (National Center for Atmospheric
Research) with address 128.117.136.0.
20
Data for a per-day basis five-weekday
period (July 6-10, 2009)
date
July 06
July 07
July 08
July 09
July 10
Longest
18306
(113470332
bytes)
25326
(112768756
bytes)
21307
(80492044
bytes)
15364
(30196708
bytes)
23825
(1544080
bytes)
fattest
542582720
(11405 seconds)
867174480
(10443 seconds)
185912032
(4562 seconds)
241363080
(3360 seconds)
310908448
(779 seconds)
longest
26882
(58425675
bytes)
23220
(216722816
bytes)
18004
(26475856
bytes)
19986
(40811386
bytes)
15964
(142172096
bytes)
fattest
187357504
(20402 seconds)
216722816
(23220 seconds)
349049492
(5223 seconds)
385387604
(7201 seconds)
164962944
(2041 seconds)
CHIC
LOSA
Fattest data: remember it is 1-in-100 sampled data
26882 sec = 7.5 hours
21
Data for a five-weekday period
(July 6-10, 2009)
Router
CHIC
LOSA
Number of flows
841062272
268933244
Number (%) of long flows
35632 (0.00424% )
32660 (0.012%)
Total number of bytes
3.11946E+12
1.17578E+12
Number and % of bytes
in long flows
1.111436E+11
(3.563%)
101783460037 (8.66%)
Number of long flows of
different types
33241 (TCP), 0 (IP), 260(GRE),
211(ESP), 0(AH)
26618 (TCP), 67 (IP), 206(GRE),
320(ESP), 7(AH)
Number of long flows of
different applications
(without counting long
ACK flows)
447 (20:ftp), 3023 (22:ssh), 4
(25:smtp), 3078 (80: http), 63 (119:
nntp), 1 (143:imap), 1690 (388: unidata),
142 (443:https) , 25 (554 :rtsp), 560
(873:rsync), 2545 (unassigned), 1134
(dynamic-and-private), SUM =12712
1202(20:ftp), 3109 (22:ssh), 8
(25:smtp), 3528 (80: http), 1 (119:
nntp), 0 (143:imap), 426 (388:
unidata), 307 (443:https) , 15
(554 :rtsp), 570 (873:rsync),
1068(unassigned), 574 (dynamicand-private), SUM =10808
Fattest flow (bytes)
867174480 (10443 seconds)
1779002612 (7381 seconds)
Longest flow (seconds)
25326 (112768756 bytes)
26882 (58425675 bytes)
22
Repeat customers:
ssh long flows
Router
Number of ssh long
flows that occured on
multiple days in that
one-5-day period
(candidates for
MFDB)
LOSA
CHIC
2days
3 days
4 days
5
days
2days
3 days
4 days
5 days
59+60
+55+5
5=229
38+42+
37=117
32+30=
62
28
62+67+
54+53=
236
41+32+
33=106
20+23=
43
17
23
A GridFTP Example (from CHIC)
dpkts
doctets
srcaddr
dstaddr
srcport
dstport
prot
firstunix
lastunix
flowlength
1344
70852
129.93.232.0
130.246.176.0
44560
50002
6
1247027444
1247028164
719.635
792
41436
129.93.232.0
130.246.176.0
53225
50001
6
1247027444
1247028343
899.1270001
995
51944
129.93.232.0
130.246.176.0
54831
50011
6
1247027444
1247028164
719.6719999
1172
61700
129.93.232.0
130.246.176.0
54870
50013
6
1247027444
1247028164
719.608
1251
65700
129.93.232.0
130.246.176.0
44560
50003
6
1247027444
1247028164
719.4919999
1270
66616
129.93.232.0
130.246.176.0
35846
50000
6
1247027444
1247028164
719.6600001
863
45300
129.93.232.0
130.246.176.0
54870
50012
6
1247027444
1247028163
718.99
• Between University of Nebraska-Lincoln and Rutherford Appleton Laboratory
24
• Appears to be a parallel transfer (though cannot be sure that it is not striped
because of anonmyized addresses)
Correlation between SSH flows
and ESP flows
• Looked for correlation between long SSH
flows and ESP flows (IP protocol number =
50)
– Found none
– Hypothesis: If HPN-SSH is used, then SSH
flows (port 22) will likely be short and port 22
long flows could be long scp transfers
– If regular ssh is used, the long flow from an scp
transfer will be an ESP flow, but the SSH flow
will be short
25
Findings from track II
• Teragrid servers: 15 sites (server names
and IP addresses found)
• ESG data grid servers found
• So far: NP and BES reports studied
• Number of servers found so far: 51
(BES: single) + 15 (BES: ranges) + 39 (NP)
• Some IP address ranges used for
participating institutions
26
“Match” rate between
Track I and Track II
•
•
•
CHIC
LOSA
NP
10060 (29.84%)
1128 (4.14%)
BES
3706 (9.12%)
1254 (4.6%)
Percent of flows for which the src or dst IP
address matches one of the server addresses
found from the Track II study of science projects
Number of long flows in CHIC is 33717
Number of long flows in LOSA is 27279
27
Summary
• Ready to provide ESnet the R
programs and shell scripts
• Track II: FES and BES report
analysis
• Track III application pattern
recognition (scp, sftp, bbcp, GridFTP)
• Run Internet2 Netflow analysis for
more days and at all routers
28
Thoughts for ESnet
Netflow data analysis
• qsub problem: large jobs submitted in batch
mode on front-end. Actual servers involved
are not advertised on science project web
sites.
– Ask sites for GridFTP/scp/bbcp/sftp server
addresses (cluster nodes)
• UVA provides ESnet R programs to just
extract these long flows, and ESnet runs
them, and only provides UVA with long flow
data for matched server addresses
29
Contd.
• Anonymizing problem – will make
MFDB candidacy determination
difficult. Need to know that the flow
concatenation process was not
accidentally merging two different
flows – e.g. the GridFTP problem
• Need to estimate the returns on
investment: percent of bytes that are
candidates for handling on SDN
30
Download