Gnutella Crawling Lessons Learned David Deschenes and Scott Weber Department of Computer Science & Engineering Lehigh University April 30, 2003 Outline Gnutella Protocol Changes Crawling Concepts Crawler Architecture Comparison of Protocol Versions Characterizing Valuable Hosts Exploring Ping Depths Looking at Time Effects Summary Future Work Gnutella Protocol Changes Two Versions: 0.4 and 0.6 Version 0.6 introduced in late 2001 Version 0.6 supports protocol extensions by allowing exchange of header information The X-Try header may have implications for future Gnutella crawlers Crawling Concepts Similar to WWW document crawling Must make use of host discovery mechanisms (ping/pong messages, X-Try headers) Pong messages encode network links Will not ever know if entire network has been mapped Crawler Architecture HostList Listener Incoming Crawler Outgoing Comparison of Protocol Versions Simultaneous 0.4 and 0.6 crawls run eight times daily April 10-16 Each crawl used an identical start set of about 50,000 hosts Goals Determine extent of deployment and impact of version 0.6 Determine usefulness of each protocol version in mapping the network Host Identifications Per Visit 0.5000 0.4500 Avg. # of Hosts Identified 0.4000 0.3500 0.3000 0.2500 0.2000 0.1500 0.1000 0.0500 0.0000 0 3 6 9 12 15 Time of Day (Hour) Version 0.4 Total Version 0.6 by Pong Only Version 0.6 Total 18 21 Measured Network Sizes 50000 45000 40000 Avg. # of Hosts 35000 30000 25000 20000 15000 10000 5000 0 0 3 6 9 12 Time of Day (Hour) Version 0.4 Version 0.6 15 18 21 Characterizing Valuable Hosts Analysis of data obtained from protocol version comparison experiment Goals Determine how to identify the hosts that will give us the best topology data Determine how to order those hosts Sessions Breakdown 24.42% 39.27% 60.73% 75.58% Discovered Known Identified Unidentified Value of Identification Types 0.2750 0.2500 9.84% 0.2250 % of Hosts Providing Sessions 22.48% 0.2000 0.1750 0.1500 0.1250 0.1000 0.0750 0.0500 0.0250 67.68% 0.0000 Pongs Only Pongs Only X-Trys Only Pongs & X-Trys X-Trys Only Pongs & XTrys Exploring Ping Depths Ten crawls run daily April 9-23 Each crawl used a different TTL for the ping message sent Goals Determine which ping depth produces the best topology data Examine change in efficieny of crawl as ping depth varies Pongs by Hops 6000 5500 5000 Avg # of Pongs 4500 4000 3500 3000 2500 2000 1500 1000 500 0 0 1 2 3 4 5 6 7 8 Pong Hops Ping Depth 2 3 4 5 6 7 8 9 10 11 9 10 11 Ping Depth Efficiency 70000 0.1000 65000 60000 0.0800 55000 Avg. # of Hosts 50000 45000 0.0600 40000 35000 30000 0.0400 25000 20000 15000 0.0200 10000 5000 0 0.0000 2 3 4 5 6 7 8 Ping Depth Discovered Discovered with Sessions 9 10 11 Looking at Time Effects Analysis of data obtained from protocol version comparison experiment Goals Identify diurnal patterns in network activity Contrast weekday with weekend network activity Fluctuation in Network Activity 15% % Deviation from Avg. 10% 5% 0% -5% -10% -15% 0 3 6 9 12 15 18 Time of Day (Hour) Discoveries Pongs Measured Size Session Length Unique Pongs Sessions TCP Connections 21 Weekend Gains 26% 24% 22% % Gain Over Weekday 20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% Discoveries Pongs Unique Pongs Sessions TCP Conns Measured Size Session Length Summary Version 0.4 is still active, but cannot be used solely for an effective crawl Proper ordering of hosts may yield more efficient crawls Choice of ping depth depends on goal of crawl Depth 7 yields good balance between efficiency and quality The network is not stable on long or short timescales Must crawl often to remove statistical effects Future Work Modify crawler to take into account information gathered in this experiment Study how the Gnutella network changes over time Compare complete network measurements to those of LimeWire Questions?