Characterizing Unstructured Overlay Topologies in Modern P2P File-Sharing Systems Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Subhabrata Sen – AT&T Labs Internet Measurement Conference Berkeley, CA, USA October 19th, 2005 Motivation P2P file-sharing systems are very popular in practice. Most use an unstructured overlay Understanding overlay properties is important: Several million simultaneous users collectively. 60% of all Internet traffic [CacheLogic Research 2005] Understanding how existing P2P systems function Developing and evaluating new systems Unstructured overlays are not well-understood. We studied overlay properties in Gnutella. Size: one of the largest P2P systems; more than 1 million users Mature: In use for several years; older studies for comparisons Open: No reverse-engineering needed October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 2/18 Defining the Problem Ultrapeer Gnutella uses a two-tier overlay. Improves scalability. Ultrapeers form an unstructured mesh. Leaf peers connect to the ultrapeers. eDonkey, FastTrack are similar. Studying the overlay requires snapshots. Top-level overlay Snapshots capture the overlay as a graph. Individual snapshots reveal graph properties. Consecutive snapshots reveal dynamics. Leaf However, capturing accurate snapshots is difficult. October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 3/18 Challenges in Capturing Accurate Snapshots Snapshots are captured iteratively by a crawler. An ideal snapshot is instantaneous. But the overlay is large and rapidly changing. Therefore, captured snapshots are distorted. Sampling: Partial snapshots are less distorted, but may be unrepresentative For some types of analysis, the whole graph is needed. Previous studies capture either: Complete snapshots slowly, or Partial snapshots. October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 4/18 Cruiser: a Fast Gnutella Crawler Features: Cruiser is orders of magnitude faster. Distributed, highly parallelized implementation Dynamic adaptation to bandwidth and CPU constraints Captures one million nodes in around 7 minutes 140,000 peers/min, compared to 2,500 peers/min [Saroiu 02] We investigated the effects of speed on distortion. Daniel Stutzbach and Reza Rejaie, “Capturing Accurate Snapshots of the Gnutella Network”, the Global Internet Symposium, March, 2005. 4% node distortion 15% edge distortion October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 5/18 Data Set More than 80,000 snapshots, over the past year. To examine static properties, we focus on four: Date Total Nodes Leaves Ultrapeers Top-level Edges 9/27/04 725,120 614,912 110,208 1,212,772 10/11/04 779,535 662,568 116,967 1,244,219 10/18/04 806,948 686,719 120,229 1,331,745 2/2/05 1,031,471 873,130 158,345 1,964,121 To examine dynamic properties, we use slices: Each slice is 2 days of ~500 back-to-back snapshots Captured starting 10/14/04, 10/21/04, 11/25/04, 12/21/04, and 12/27/04 October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 6/18 Summary of Characterizations Graph Properties Implementation heterogeneity Degree Distribution: Reachability: Top-level degree distribution Ultrapeer-leaf connectivity Degree-distance correlation Dynamic Properties Existence of stable core: Uptime distribution Biased connectivity Properties of stable core: Largest connected component Path lengths Clustering coefficient Path lengths Eccentricity Small world properties Resiliency October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 7/18 Top-level Degree Max 30 in most clients Max 75 in some clients Custom This is the degree distribution among ultrapeers. There are obvious peaks at 30 and 70 neighbors. A substantial number of ultrapeers have fewer than 30. What happened to the power-law seen in prior studies? October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 8/18 What happened to power-law? [Ripeanu 02 ICJ] When a crawl is slow, many short-lived peers report long-lived peers as neighbors. However, those neighbors are not all present at the same time. Degree distribution from a slow crawl resembles prior results. October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 9/18 Shortest-Path Distances Distribution of distances among ultrapeers and among all peers In the top-level, 70% of distances are exactly 4 hops. Across all peers, most distances are 5 or 6 hops. Shows the effect of the two-tier with multiple parents Despite large size, distances are short. October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 10/18 Is Gnutella a Small World? Small worlds arise naturally in many places. Movies actors, power grid, co-authors of papers They have short distances, but significant clustering, compared to a similar random graph. Mean Clustering Distance Coefficient Gnutella 4.2 0.018 Random 3.8 0.00038 Conclusion: Gnutella is a small world. Very high clustering adversely affects flooding queries But Gnutella isn’t clustered enough to affect performance. October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 11/18 Resiliency to Node Failure Random Highest degree first After removing nodes, this figure shows how many remain connected. The Gnutella topology is extremely resilient to random node failure. It’s resilient even when the highest-degree nodes are removed first. Complex algorithms are not necessary for ensuring resilience. October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 12/18 What about Dynamic Properties? Prior work suggests many peers are short-lived while others are very long-lived. How do these nodes interact? Methodology: Capture a long series of back-to-back snapshots Annotate the last snapshot with the uptime of each peer Examine the properties of the annotated topology Group peers by uptime Present for 5 snapshots Present for 2 snapshots Departed peer Newly arrived peer Time October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 13/18 Stable Core > 20 h Most peers are recent arrivals. > 10 h Other peers have been around for a long time. We can select a set of peers based on a minimum uptime threshold. We call this the stable core. Does the longevity of a peer affect who its neighbors are? October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 14/18 Biased Connectivity Hypothesis: long-lived nodes tend to be more connected to other long-lived nodes Rationale: Once connected, they stay connected. The longer they’re around, the more opportunities they have to neighbor. Approach: Check for biased connectivity Randomize the edges to create a graph without biased connectivity Then compare Are there more edges in the observed stable core compared to random? October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 15/18 Stable Core Edges 20%—40% more edges in the stable core compared to random. There is an onion-like bias where long-lived peers are more likely to be connected to other long-lived peers. We examined other properties of the stable core. Despite high churn, there is a relatively stable “backbone”. October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 16/18 Summary Characterizations of recent and accurate snapshots Graph properties: The degree distribution in Gnutella is not power law. Gnutella exhibits small world characteristics. Gnutella is resilient. Dynamic properties: There is a stable core within the topology Peer churn causes the stable core to have an onion-like shape. This effect is likely to occur in any unstructured system. October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 17/18 Future Work Examining long-term trends in Gnutella using many snapshots. Characterizing churn Characterizing properties of other widelydeployed P2P systems Kad (a DHT with more than 1 million users) BitTorrent Developing sampling techniques for P2P October 19th, 2005 http://mirage.cs.uoregon.edu/P2P IMC 2005 Slide 18/18