Uncovering Functional Networks in Internet Traffic Mark Meiss September 25, 2006

Uncovering Functional Networks in Internet Traffic Mark Meiss September 25, 2006 1 Who am I? Mark Meiss • Ph.D. candidate in Computer Science – Committee: Filippo Menczer, Alessandro Vespignani, Katy Börner, Minaxi Gupta, Kay Connelly • Researcher at the Advanced Network Management Laboratory (ANML) – http://anml.iu.edu/ 2 3 What’s the agenda? The subject of today’s story: • Finding a way to improve security without compromising user privacy • A case study in applied network science This work is done with Filippo Menczer and Alessandro Vespignani. 4 What do people do online? There’s what we imagine… surfing sending email playing games 5 What do people do online? And there’s what is actually happening… file sharing worms & viruses porn 6 Not just a value judgment These applications all affect the health of a data network. There are legal problems, yes; but also… • Crowding out other applications. – (Napster was once over 70% of all IUB traffic) • Compromised computers are used to launch further attacks. • “Common nuisances” are on the ’Net as well. 7 The bottom line Network administrators need to be able to identify what applications are being used on the network. …but this can be very difficult. 8 A crash course in data networks We’ll use a running example: • Buddy Bradley wants to read a web page about his favorite band at Vulgar Entertainment, Inc. 9 10 11 12 13 14 15 16 17 18 19 Quick summary • Each network conversation is identified by four pieces of information – Client address and port number – Server address and port number • The server uses a well-known port number • The client uses an ephemeral port number 20 So why is it hard to identify applications? • Well-known ports are a convention, not a rule – Web, e-mail, etc. do have ports assigned by the IANA – BitTorrent, Gnutella, Napster, etc. do not • Client and server ports share the same namespace • In practice… – Any application can use any pair of port numbers • Our focus: discovering what application is running on a port with no assigned use. 21 The conventional solution Let’s look inside all of those packets! 22 23 24 Another problem • Packet inspection doesn’t scale – Modern high-speed networks run at 10 gigabits per second or faster (that’s one full DVD every few seconds) – General-purpose computers can’t even copy that data in real time 25 26 27 Introducing the “flow” • We can summarize Buddy’s Web surfing as two flows: – 192.168.65.33:13029 to 10.99.205.122:80 (456 bytes) – 10.99.205.122:80 to 192.168.65.33:13029 (63,211 bytes) 28 Where do flows come from? • Architectural features of Internet routers allow them to export flow data • Routers can’t summarize all the data – Packets are sampled to construct the flows – Typical sampling rate is around 1:100 29 What can you do with a flow? • Usual answer: – – – – – Treat a flow as a record in a relational database Who talked to port 1337? What proportion of our traffic is on port 80? Who is scanning for vulnerable systems? Which hosts are infected with this worm? • These are useful and valid questions. 30 What can you do with a flow? • Our approach: – Treat a flow as a directed, weighted edge – The resulting network describes user behavior • Hold that thought for now… 31 The Internet2/Abilene network • TCP/IP network connecting research and educational institutions in the U.S. – Over 200 universities and corporate research labs • Also provides transit service between Pacific Rim and European networks 32 Why study Abilene? • Wide-area network that includes both domestic and international traffic • Heterogeneous user base including hundreds of thousands of undergraduates • High capacity network (10-Gbps fiber-optic links) that has never been congested • Research partnership gives access to (anonymized) traffic data unavailable from commercial networks 33 Flow collection Flows are exported in Cisco’s netflow-v5 format and anonymized before being written to disk. 34 Data dimensions • Observed Abilene on April 14, 2005 – About 200 terabytes of data exchanged – This is roughly 25,000 DVDs of information • 600 million flow records – Almost 28 gigabytes on disk – 15 million unique hosts involved 35 Weighted bipartite digraph 37 M sin   wi ,C i 1 N sout   wC , j j 1 38 Multiple digraphs Port 80 (Web) Port 6346 (Gnutella) Port 25 (Mail) Port 19101 (???) 39 Application correlation • Consider the out-strength of a client in the networks for ports p and q: s  w p i p ij j s  w q i q ij j 40 Application correlation • Build a pair of vectors from the distribution of strength values:  p p p  (s1 , , s|C| )  q q q  (s1 , , s|C| ) 41 Application correlation • Examine the cosine similarity of the vectors:     pq  ( p, q )    pq • When σ = 0, applications p and q are never used together. • When σ = 1, applications p and q are always used together, and to the same extent. 42 Clustering applications • We now have σ(p, q) for every pair of ports • Convert these similarities into distances: 1 d ( p, q )  1  ( p, q )   • If σ = 0, then d is large; if σ = 1, then d = 0 • Now apply Ward’s hierarchical clustering algorithm 43 44 Classifying unknown applications • To classify an unknown application, see what known applications it clusters with • Our classification experiment – Take 16 unknown ports – Guess function based on similarity data – Validate or invalidate guesses based on external evidence 46 Example #1 • Port 388 is coupled with FTP and Hotline – FTP is a file transfer application – Hotline is an early file-sharing application – Our guess: traditional file transfer application • Actual identity: Unidata/LDM – Used for moving large meteorological data sets 47 Example #2 • Port 19101 is coupled with instant messaging and P2P applications – Our guess: a P2P application that relies on individual contact for file transfers • Actual identity: Clubbox – Korean file-sharing program – Users trade large files on virtual hard drives 48 49 Overall results • For our 16 guesses: – 8 were unambiguously correct – 6 were partially correct • These turned out to be trojans and malware • We learned that IRC + P2P = evil afoot – 2 could not be confirmed or disproven • Ports were in transient use during data collection 50 Implications • We can identify the type of an application without examining a single packet! – Scalable – Preserves user privacy – Difficult to do with relational view of flow data 51 52 53 54 55 56 Broader application • Generic view of the situation: – Weighted network of entities derived from activity with labeled classes of interaction – Find the sub-network for each labeled class – Use the network distributions to calculate similarity scores for the classes – Use the similarity scores to cluster the classes – Classify unknown classes using these clusters 57 Thank you! • Questions and comments… 58

Uncovering Functional Networks in Internet Traffic Mark Meiss September 25, 2006

Related documents

Products

Support

Uncovering Functional Networks in Internet Traffic Mark Meiss September 25, 2006

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib