Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004 Outline Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review Outline Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review Introduction Peer to peer systems have recently exploded onto the internet scene Two main contributing factors: Low cost and high availability of resources (computing and storage) Increased network connectivity (proliferation of “always on” connections) Introduction Peer systems build a virtual topology (overlay) with its own routing mechanisms The topology of the overlay and routing protocols directly affects Performance: Number of physical hops to send a message through virtual overlay Reliability: Will a message actually reach the other end Scalability: Can other nodes be added while keeping performance good Anonymity: Can we protect the identity of nodes in the network Introduction Gnutella is studied in depth and analysis is performed to determine how the overlay affects the four characteristics previously mentioned Started by capturing the network topology and behaviour Performed a macroscopic analysis of the network to evaluate costs and benefits Investigated possible improvements Introduction Two questions drive analysis What is the connectivity structure of Gnutella? How well does the Gnutella overlay map to the actual network topology? Introduction Connectivity Structure Networks as diverse as natural networks usually have a few well connected nodes and many poorly connected nodes • I.e. Power Law Networks We will see Gnutella is not a pure power law network, but still has good fault tolerance and is less resistant to DoS attacks Introduction Overlay Topology Important for ISP’s: overlays that don’t map closely to the physical topology adds additional stress on the infrastructure and costs ISP’s more money Scalability is directly linked to efficient use of network resources Outline Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Gnutella in Depth Gnutella is an open protocol It is decentralized and unstructured Allows group membership and searching of available files for download Gnutella should operate in a dynamic environment where hosts can join/leave at any time Gnutella should experience good performance and scalability External attacks should not cause data loss or performance degradation Users seeking or providing unpopular material should stay anonymous Gnutella in Depth Gnutella nodes are called “servents” (SERVer-cliENTS) Provide a client-side interface to allow searching of file base Provide server-side storage, routing and response to network messages and requests Gnutella in Depth To connect, a node contacts an “always on” host (I.e. gnutella.com) and sends a PING Node replies with a PONG and forwards the PING on to other nodes in the network who reply with PONG messages and forward the PING on PING stops after TTL hops Gnutella in Depth To find files, users submit QUERY messages to other nodes Messages are broadcast to all neighbours who forward them on to other neighbours, etc. for TTL hops QUERY RESPONSE messages are returned to the querying node Gnutella in Depth To download a file, nodes send GET and PUSH messages to individual hosts holding a file I.e. transfer requests and transfers are routed directly between communicating hosts, and not back-propagated Gnutella in Depth Messaging protocol has three important features TTL and “hops passed” fields are attached to each message Randomly generated message ID is attached to each message Each node keeps track of recently routed messages to prevent re-broadcasting and to implement backpropagation Gnutella in Depth PING message contains the host address and name, number of files and size of data store PONG message contains the same information from the host that received the PING Gnutella in Depth PING messages propagate until TTL has expired Hop count incremented at each servent receiving the PING Message propagates until hop count = TTL PONG messages are back-propagated (I.e. sent on the reverse path that the original message followed) to the host initiating the PING Gnutella in Depth QUERY messages are sent the same way as a PING message Nodes check the search string requested against the names of their locally stored files QUERY RESPONSE messages are backpropagated to the querying node and include information necessary to download the file Outline Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review The Crawler In order to conduct the network tests, a crawler was developed to gather information about the virtual topology Crawler starts with a list of active nodes and sends a PING message to each of them PONG messages are received and the IP, port, number of stored files and size of archive are stored in a table PING propagates to other nodes and PONG back propagates to crawler The Crawler A sequential version of the crawler was initially developed I.e. send a PING with an empirically determined optimal TTL to a set of nodes; resend to the nodes where the PING stops, etc. Proved to be very slow: 50 hours to collect data from a 4 000 node network Slowness means two things: Not scalable: Will get slower as we add more nodes Does not give an accurate network snapshot: network changes drastically over 50 hours! The Crawler A distributed crawler was developed next Client-Server architecture Server maintains node list and creates a network graph Clients receive a list of nodes to contact and discover neighbours for Decided to use only 50 clients at once Reduces invasiveness of search and consumption of network resources Reduced crawling time to a couple of hours for a large initial list and a network of 30 000 nodes The Crawler Network membership is defined as follows A node is a member of the network if the crawler is able to connect to it A node might be excluded from network membership if it was reported as active by a server or other node, but the crawler could not contact it • This might happen if nodes go offline before the crawler can contact them Outline Introduction Gnutella in Depth The Crawler Analysis of Network Growth Trends Traffic Estimates Connectivity and Reliability Overlay vs. Topology Summary and Improvements Paper Review Analysis of Network Data was collected over a 6 month period Data shows: Overhead traffic is reducing Traffic volume is a significant barrier to growth Growth Trends Size of network is growing rapidly Largest connected component in November 2000 had 2 063 neighbours Largest connected component in May 2001 had 48 195 neighbours! Number of neighbours for the largest connected component has grown 25 times! Growth Trends QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Growth Trends Despite the explosive growth, most nodes are not connected long Successive crawls of the network found: 40% of nodes leave the network in less than 4 hours 25% of nodes are alive for more than 24 hours Traffic Estimates A modified version of the crawler recorded traffic generated across one randomly chosen link 36% of total traffic (in bytes) is user generated QUERY messages 55% is group membership (PING/PONG) messages 9% is non-standard or malformed messages N.B. File transfer traffic is excluded Traffic Estimates After June 2001 (when new Gnutella implementation was released) 92% of total traffic (in bytes) was QUERY messages 8% is group membership (PING/PONG) messages N.B. File transfer traffic is excluded Traffic Estimates 95% of all nodes are reachable within 7 hops. Thus, each message typically uses a TTL = 7 Most links are expected to support similar amounts of traffic for these reasons As verified empirically, the total Gnutella generated traffic is proportional to the number of connections in the network However, the average number of connections per node stays relatively constant as the network grows Traffic Estimates QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Traffic Estimates QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Traffic Estimates The total traffic estimate for the Gnutella network is 1 Gbps I.e. 170 000 connections for a 50 000 node network times 6 kbps per connection This is approximately 330 TB/month! Excluding file transfers! Traffic Estimates This total is 1.7% of the total internet traffic in US backbones in December 2000 This volume of traffic is believed to be an obstacle to further growth The underlying network topology must be used more efficiently to allow scaling and wider deployment Connectivity and Reliability Note: Nodes decide locally: How many connections to support When to add or drop a connection Recent research shows that many natural systems organize themselves into “power law networks” I.e. networks where a few nodes are well connected and most nodes have very few connections Connectivity and Reliability Power law networks: Number of nodes with L links (connections) is proportional to L-k where k is systemdependent Resilient to losing many poorly connected nodes Falls apart quickly if only a few well connected nodes are lost Extremely robust to random failures, but vulnerable to targeted attacks Connectivity and Reliability Power law networks appear as a linear system on a log-log plot Data for December 2000 shows that early Gnutella networks were power law Data for March 2001 shows that later Gnutella networks are a mixture • There are a constant number of nodes with fewer than 10 links • Above 10 links, nodes follow a power law structure Connectivity and Reliability QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Connectivity and Reliability QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Connectivity and Reliability Why did the distribution change? Two possible reasons: About 20% of Gnutella users have modem connections - DSL and up can support more connections Gnutella users run as many connections as their network can support - perception is that more connections = better query results Connectivity and Reliability Does the change in distribution affect reliability? Yes! Preserves resilience to random failures Makes network less dependent on well connected nodes and hence less prone to DoS attacks Overlay vs. Topology Peer systems change the way bandwidth is used on the internet Servers are at the edge of the network now, and peers are constantly downloading Most ISP’s use flat-rate billing Peer systems may break this model! Overlay vs. Topology Due to the amount of traffic peer systems generate, efficient use of resources is important The greater the mismatch between the overlay and the physical network topology, the more messages need to be transmitted to route information from A to B This means more stress on the network resources Overlay vs. Topology Communication from A to all other nodes requires one message over the D - E link QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Overlay vs. Topology Communication from A to all other nodes requires six messages over the D - E link QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Overlay vs. Topology How well does Gnutella map to the topology? Assume that domain names are roughly evident of the hierarchy of the internet Check how well generated traffic maps to the cluster of domain names found by the crawler Overlay vs. Topology After analysis of 10 overlays, it was found that Gnutella nodes often connect to peers outside of their respective domains Thus, it appears that Gnutella does not make efficient use of the underlying topology Outline Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review Summary and Improvements Gnutella has a multimodal connectivity distribution that is partially constant and partially power law Network is resilient to random failures Network is harder to attack by malicious parties, but not immune to DoS attacks Gnutella makes little effort to ward off attackers E.g. topology, connectivity and traffic information is easy to obtain and can be used to plan attacks Summary and Improvements Gnutella’s traffic volume is a significant fraction of all internet traffic Makes the future growth of the network reliant on efficient use of the topology Gnutella’s overlay does not match the network topology very well This increases quite substantially the number of messages and the amount of network traffic generated Summary and Improvements Necessary improvements Make efforts to hide overlay and connectivity information (encryption?) Match overlay more closely with topology Limits to growth must be solved first and fast at the rate that Gnutella is growing Summary and Improvements Suggested Improvements Exploit locality of files and query distribution (I.e. caching and localized queries) Replace query flooding strategy with something more efficient (I.e. superpeer routing and group communication) Outline Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review Paper Review Organization Some discussions of the Gnutella architecture and protocols were scattered throughout the paper Should have combined everything into a more logical order inside the protocol section Writing Style Generally very good. Some missing words and poor grammar Paper Review Novel Ideas Presented a qualitative and quantitative analysis of the Gnutella network, and some important points for P2P as a whole Content Some backing information was missing Some claims were made without supporting evidence, or just referring the reader to another paper Questions?