Uncovering Functional Networks in Internet Traffic Mark Meiss September 25, 2006 1 Who am I? Mark Meiss • Ph.D. candidate in Computer Science – Committee: Filippo Menczer, Alessandro Vespignani, Katy Börner, Minaxi Gupta, Kay Connelly • Researcher at the Advanced Network Management Laboratory (ANML) – http://anml.iu.edu/ 2 3 What’s the agenda? The subject of today’s story: • Finding a way to improve security without compromising user privacy • A case study in applied network science This work is done with Filippo Menczer and Alessandro Vespignani. 4 What do people do online? There’s what we imagine… surfing sending email playing games 5 What do people do online? And there’s what is actually happening… file sharing worms & viruses porn 6 Not just a value judgment These applications all affect the health of a data network. There are legal problems, yes; but also… • Crowding out other applications. – (Napster was once over 70% of all IUB traffic) • Compromised computers are used to launch further attacks. • “Common nuisances” are on the ’Net as well. 7 The bottom line Network administrators need to be able to identify what applications are being used on the network. …but this can be very difficult. 8 A crash course in data networks We’ll use a running example: • Buddy Bradley wants to read a web page about his favorite band at Vulgar Entertainment, Inc. 9 10 11 12 13 14 15 16 17 18 19 Quick summary • Each network conversation is identified by four pieces of information – Client address and port number – Server address and port number • The server uses a well-known port number • The client uses an ephemeral port number 20 So why is it hard to identify applications? • Well-known ports are a convention, not a rule – Web, e-mail, etc. do have ports assigned by the IANA – BitTorrent, Gnutella, Napster, etc. do not • Client and server ports share the same namespace • In practice… – Any application can use any pair of port numbers • Our focus: discovering what application is running on a port with no assigned use. 21 The conventional solution Let’s look inside all of those packets! 22 23 24 Another problem • Packet inspection doesn’t scale – Modern high-speed networks run at 10 gigabits per second or faster (that’s one full DVD every few seconds) – General-purpose computers can’t even copy that data in real time 25 26 27 Introducing the “flow” • We can summarize Buddy’s Web surfing as two flows: – 192.168.65.33:13029 to 10.99.205.122:80 (456 bytes) – 10.99.205.122:80 to 192.168.65.33:13029 (63,211 bytes) 28 Where do flows come from? • Architectural features of Internet routers allow them to export flow data • Routers can’t summarize all the data – Packets are sampled to construct the flows – Typical sampling rate is around 1:100 29 What can you do with a flow? • Usual answer: – – – – – Treat a flow as a record in a relational database Who talked to port 1337? What proportion of our traffic is on port 80? Who is scanning for vulnerable systems? Which hosts are infected with this worm? • These are useful and valid questions. 30 What can you do with a flow? • Our approach: – Treat a flow as a directed, weighted edge – The resulting network describes user behavior • Hold that thought for now… 31 The Internet2/Abilene network • TCP/IP network connecting research and educational institutions in the U.S. – Over 200 universities and corporate research labs • Also provides transit service between Pacific Rim and European networks 32 Why study Abilene? • Wide-area network that includes both domestic and international traffic • Heterogeneous user base including hundreds of thousands of undergraduates • High capacity network (10-Gbps fiber-optic links) that has never been congested • Research partnership gives access to (anonymized) traffic data unavailable from commercial networks 33 Flow collection Flows are exported in Cisco’s netflow-v5 format and anonymized before being written to disk. 34 Data dimensions • Observed Abilene on April 14, 2005 – About 200 terabytes of data exchanged – This is roughly 25,000 DVDs of information • 600 million flow records – Almost 28 gigabytes on disk – 15 million unique hosts involved 35 Weighted bipartite digraph 37 M sin wi ,C i 1 N sout wC , j j 1 38 Multiple digraphs Port 80 (Web) Port 6346 (Gnutella) Port 25 (Mail) Port 19101 (???) 39 Application correlation • Consider the out-strength of a client in the networks for ports p and q: s w p i p ij j s w q i q ij j 40 Application correlation • Build a pair of vectors from the distribution of strength values: p p p (s1 , , s|C| ) q q q (s1 , , s|C| ) 41 Application correlation • Examine the cosine similarity of the vectors: pq ( p, q ) pq • When σ = 0, applications p and q are never used together. • When σ = 1, applications p and q are always used together, and to the same extent. 42 Clustering applications • We now have σ(p, q) for every pair of ports • Convert these similarities into distances: 1 d ( p, q ) 1 ( p, q ) • If σ = 0, then d is large; if σ = 1, then d = 0 • Now apply Ward’s hierarchical clustering algorithm 43 44 Classifying unknown applications • To classify an unknown application, see what known applications it clusters with • Our classification experiment – Take 16 unknown ports – Guess function based on similarity data – Validate or invalidate guesses based on external evidence 46 Example #1 • Port 388 is coupled with FTP and Hotline – FTP is a file transfer application – Hotline is an early file-sharing application – Our guess: traditional file transfer application • Actual identity: Unidata/LDM – Used for moving large meteorological data sets 47 Example #2 • Port 19101 is coupled with instant messaging and P2P applications – Our guess: a P2P application that relies on individual contact for file transfers • Actual identity: Clubbox – Korean file-sharing program – Users trade large files on virtual hard drives 48 49 Overall results • For our 16 guesses: – 8 were unambiguously correct – 6 were partially correct • These turned out to be trojans and malware • We learned that IRC + P2P = evil afoot – 2 could not be confirmed or disproven • Ports were in transient use during data collection 50 Implications • We can identify the type of an application without examining a single packet! – Scalable – Preserves user privacy – Difficult to do with relational view of flow data 51 52 53 54 55 56 Broader application • Generic view of the situation: – Weighted network of entities derived from activity with labeled classes of interaction – Find the sub-network for each labeled class – Use the network distributions to calculate similarity scores for the classes – Use the similarity scores to cluster the classes – Classify unknown classes using these clusters 57 Thank you! • Questions and comments… 58