P2P Architecture Case Study: Gnutella Network Matei Rîpeanu The University of Chicago Why analyze Gnutella network? Unprecedented scale – up to 100k nodes, 100TB data, 10M files today Self-organizing network Staggering growth – more than 50 times during first half of 2001 Open architecture, simple and flexible protocol Interesting mix of social and technical issues Overview Gnutella protocol Tools for exploring the network Network growth Structural graph analysis – Is Gnutella a power-law network? Generated (overhead) network traffic – Traffic estimates – Overlay network topology mapping Gnutella protocol overview P2P file sharing application on top of an overlay network – Nodes maintain open TCP connections – Messages are broadcasted (flooded) or back-propagated Protocol: Membership Query File download Broadcast (Flooding) Backpropagated PING PONG QUERY QUERY HIT Node to node GET, PUSH Gnutella search mechanism Steps: • Node 2 initiates search for file A 7 1 A 4 2 6 3 5 Gnutella search mechanism A Steps: • Node 2 initiates search for file A • Sends message to all neighbors 7 1 4 2 3 A 6 A 5 Gnutella search mechanism A A Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message 7 1 4 2 6 3 A 5 A Gnutella search mechanism A:7 A 7 1 4 2 6 3 A:5 A 5 A Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message • Nodes that have file A initiate a reply message Gnutella search mechanism 7 1 4 2 3 A:7 A:5 A 6 A 5 Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message • Nodes that have file A initiate a reply message • Query reply message is backpropagated Gnutella search mechanism 7 1 A:7 2 4 A:5 6 3 5 Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message • Nodes that have file A initiate a reply message • Query reply message is backpropagated Gnutella search mechanism download A 1 7 4 2 6 3 5 Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message • Nodes that have file A initiate a reply message • Query reply message is backpropagated • File download Tools for network exploration Eavesdropper - insert modified nodes into the network to eavesdrop traffic. Crawler - connects to all active nodes and uses the membership protocol to discover graph topology. Client-server approach. Graph analysis tools high-volume offline computations. Network growth Number of nodes in the largest network component ('000) 50 Gnutella Network Growth . High user interest 40 30 Users tolerate high latency, low quality results Better resources 20 10 05/12/01 05/16/01 05/22/01 05/24/01 05/29/01 02/27/01 03/01/01 03/05/01 03/09/01 03/13/01 03/16/01 03/19/01 03/22/01 03/24/01 11/20/00 11/21/00 11/25/00 11/28/00 - DSL and cable modem nodes grew from 24% to 41% over first 6 months. Today >50%. Open architecture / open-source environment Competing implementations Lower overhead network traffic, improved resource utilization, better structure Growth invariants (1): avg. node connectivity Number of links ('000) 3.4 links per node on average 200 150 100 50 0 0 10000 20000 30000 40000 50000 Number of nodes Growth invariants (2): network diameter Percent of node pairs (%) Node-to-node distance maintains similar distribution Average node-to-node distance grew 25% while the network grew 50 times over 6 months 50% 40% 30% 20% 10% 0% 1 2 3 4 5 6 7 8 9 10 11 12 Node-to-node shortest path (hops) Is Gnutella a power-law network? Power-law networks: the number of links per node follows a power-law distribution Num. of nodes (log scale) 10000 Examples: the Internet, in/out links to/from HTML pages, citation network, US power grid, social networks. November 2000 1000 100 10 1 1 10 100 Number of links (log scale) Implications: High tolerance to random node failure but low reliability when facing of an ‘intelligent’ adversary Is Gnutella a power-law network? Later, larger networks display a bimodal distribution Implications: High tolerance to random node failures preserved Increased reliability 10000 when facing an attack. 1000 Number of nodes (log scale) May 2001 100 10 1 1 10 100 Number of links (log scale) Overview Gnutella protocol Network growth Structural graph analysis Generated network traffic: – Traffic estimates – Does Gnutella overlay network topology match the underlying resources. Traffic analysis Message Frequency 25 Ping Push Query Other . 20 15 10 5 364 331 298 265 232 199 166 133 100 67 34 - 1 messages per secod 6-8 kbps per link over all connections Traffic structure changed over time minute Total generated traffic 1Gbps (or 330TB/month)! – Compare to 15,000TB/month in US Internet backbone (Dec. 2000) – Note that this estimate excludes actual file transfers – Q: Does it matter? Reasoning: and PING messages are flooded. They form more than 90% of generated traffic predominant TTL=7 >95% of nodes are less than 7 hops away measured traffic at each link about 6kbs network with 50k nodes and 170k links QUERY Topology mismatch The overlay network topology doesn’t match the underlying Internet infrastructure topology! 40% of all nodes are in the 10 largest Autonomous Systems (AS) Only 2-4% of all TCP connections link nodes within the same AS Largely ‘random wiring’ Entropy experiment gives similar results Conclusions Gnutella: self-organizing, large-scale, P2P application based on overlay network. It works! Growth hindered by the volume of generated traffic and inefficient resource use. Discovered growth invariants specific to largescale systems that: Help predict resource usage Give hints for better search and resource organization techniques. Thank you! Questions? What’s next? Organize the overlay network to match the underlying infrastructure topology. Investigate methods for reducing traffic (query routing/filtering, better information organization). Is Gnutella network a small-world network? What are the implications? Statistical laws of large-scale systems • Zipf’s law: the size of the rth largest occurrence of the event is inversely proportional to it's rank: y ~ r -b, with b close to unity. • Power law distributions: Probability distribution of event X is P[X=x]=x -k • Pareto distribution: Cumulative probability distribution P[X>x]=x –(k-1) =x – Zipf, Pareto and power-law distributions are basically different ways to express the same phenomenon F A F A E B E D B G G D C C H F A H F A E B E D G B G D C H C H Overview Gnutella protocol Network growth Statistical properties of large-scale systems – Power-law distributions. – Power-law networks. Generated (overhead) network traffic. Power-law distributions Probability distribution of event X is P[X=x]=x –k Present all over WWW and Internet space: the number of HTML pages within a site, visits to a site, links to a page, cache document popularity, etc Power-law distributions in Gnutella Number of shared files per node Query popularity follows a power-law distribution [Kas01] Implications: – Caching is an effective solution to reduce traffic and query latency – New search and node organizing mechanisms! Number of nodes (log scale) 1000 100 10 1 1 10 100 1000 10000 100000 Number of files shared (log scale)