P2P Architecture Case Study: Gnutella Network

advertisement
P2P Architecture Case Study:
Gnutella Network
Matei Rîpeanu
The University of Chicago
Why analyze Gnutella network?
 Unprecedented scale
– up to 100k nodes, 100TB data, 10M files today
 Self-organizing network
 Staggering growth
– more than 50 times during first half of 2001
 Open architecture, simple and flexible protocol
 Interesting mix of social and technical issues
Overview




Gnutella protocol
Tools for exploring the network
Network growth
Structural graph analysis
– Is Gnutella a power-law network?
 Generated (overhead) network traffic
– Traffic estimates
– Overlay network topology mapping
Gnutella protocol overview
 P2P file sharing application on top of an
overlay network
– Nodes maintain open TCP connections
– Messages are broadcasted (flooded) or back-propagated
 Protocol:
Membership
Query
File download
Broadcast
(Flooding)
Backpropagated
PING
PONG
QUERY
QUERY HIT
Node to node
GET, PUSH
Gnutella search mechanism
Steps:
• Node 2 initiates search for file A
7
1
A
4
2
6
3
5
Gnutella search mechanism
A
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
7
1
4
2
3
A
6
A
5
Gnutella search mechanism
A
A
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
7
1
4
2
6
3
A
5
A
Gnutella search mechanism
A:7
A
7
1
4
2
6
3
A:5
A
5
A
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
Gnutella search mechanism
7
1
4
2
3
A:7
A:5
A 6
A
5
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
• Query reply message is backpropagated
Gnutella search mechanism
7
1
A:7
2
4
A:5
6
3
5
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
• Query reply message is backpropagated
Gnutella search mechanism
download A
1
7
4
2
6
3
5
Steps:
• Node 2 initiates search for file A
• Sends message to all neighbors
• Neighbors forward message
• Nodes that have file A initiate a
reply message
• Query reply message is backpropagated
• File download
Tools for network exploration
 Eavesdropper - insert modified nodes into the
network to eavesdrop traffic.
 Crawler - connects to all active nodes and uses
the membership protocol to discover graph
topology.

Client-server approach.
 Graph analysis tools

high-volume offline
computations.
Network growth
Number of nodes in the largest
network component ('000)
50
Gnutella Network Growth
.
 High user interest

40
30
Users tolerate high latency,
low quality results
 Better resources
20

10
05/12/01
05/16/01
05/22/01
05/24/01
05/29/01
02/27/01
03/01/01
03/05/01
03/09/01
03/13/01
03/16/01
03/19/01
03/22/01
03/24/01
11/20/00
11/21/00
11/25/00
11/28/00
-
DSL and cable modem
nodes grew from 24% to
41% over first 6 months.
Today >50%.
Open architecture / open-source environment
Competing
implementations
Lower overhead network traffic, improved resource
utilization, better structure
Growth invariants (1): avg. node connectivity
Number of links ('000)
 3.4 links per node on average
200
150
100
50
0
0
10000
20000
30000
40000
50000
Number of nodes
Growth invariants (2): network diameter
Percent of node pairs (%)
 Node-to-node distance maintains similar distribution
 Average node-to-node distance grew 25% while the network
grew 50 times over 6 months
50%
40%
30%
20%
10%
0%
1
2
3
4
5 6 7 8 9 10 11 12
Node-to-node shortest path (hops)
Is Gnutella a power-law network?
Power-law networks: the number of links per node
follows a power-law distribution
Num. of nodes (log scale)
10000
Examples:
 the Internet,
 in/out links to/from
HTML pages,
 citation network,
 US power grid,
 social networks.
November 2000
1000
100
10
1
1
10
100
Number of links (log scale)
Implications: High tolerance to random node failure but low
reliability when facing of an ‘intelligent’ adversary
Is Gnutella a power-law network?
 Later, larger networks display a bimodal distribution
 Implications:

High tolerance to random node failures preserved
Increased reliability
10000
when facing an
attack.
1000
Number of nodes
(log scale)

May 2001
100
10
1
1
10
100
Number of links (log scale)
Overview
 Gnutella protocol
 Network growth
 Structural graph analysis
 Generated network traffic:
– Traffic estimates
– Does Gnutella overlay network topology match the
underlying resources.
Traffic analysis
Message Frequency
25
Ping
Push
Query
Other
.
20
15
10
5
364
331
298
265
232
199
166
133
100
67
34
-
1
messages per secod
  6-8 kbps per link over all connections
 Traffic structure changed over time
minute
Total generated traffic
1Gbps (or 330TB/month)!
– Compare to 15,000TB/month in US Internet backbone
(Dec. 2000)
– Note that this estimate excludes actual file transfers
– Q: Does it matter?
Reasoning:





and PING messages are flooded. They form more than 90% of
generated traffic
predominant TTL=7
>95% of nodes are less than 7 hops away
measured traffic at each link about 6kbs
network with 50k nodes and 170k links
QUERY
Topology mismatch
The overlay network topology doesn’t match
the underlying Internet infrastructure
topology!
 40% of all nodes are in the 10 largest Autonomous
Systems (AS)
 Only 2-4% of all TCP connections link nodes
within the same AS
 Largely ‘random wiring’
 Entropy experiment gives similar results
Conclusions
 Gnutella: self-organizing, large-scale, P2P
application based on overlay network. It works!
 Growth hindered by the volume of generated
traffic and inefficient resource use.
 Discovered growth invariants specific to largescale systems that:


Help predict resource usage
Give hints for better search and resource organization
techniques.
Thank you!
Questions?
What’s next?
 Organize the overlay network to match the
underlying infrastructure topology.
 Investigate methods for reducing traffic (query
routing/filtering, better information
organization).
 Is Gnutella network a small-world network?
What are the implications?
Statistical laws of large-scale systems
• Zipf’s law:
the size of the rth largest occurrence of the event is
inversely proportional to it's rank: y ~ r -b, with b close to
unity.
• Power law distributions:
Probability distribution of event X is P[X=x]=x -k
• Pareto distribution:
Cumulative probability distribution P[X>x]=x –(k-1) =x –
Zipf, Pareto and power-law distributions are basically
different ways to express the same phenomenon
F
A
F
A
E
B
E
D
B
G
G
D
C
C
H
F
A
H
F
A
E
B
E
D
G
B
G
D
C
H
C
H
Overview
 Gnutella protocol
 Network growth
 Statistical properties of large-scale systems
– Power-law distributions.
– Power-law networks.
 Generated (overhead) network traffic.
Power-law distributions
Probability distribution of event X is P[X=x]=x
–k
Present all over WWW and Internet space: the
number of HTML pages within a site, visits to a
site, links to a page, cache document popularity,
etc
Power-law distributions in Gnutella
 Number of shared files per node
 Query popularity follows a power-law distribution [Kas01]
 Implications:
– Caching is an effective
solution to reduce traffic
and query latency
– New search and node
organizing mechanisms!
Number of nodes (log scale)
1000
100
10
1
1
10
100
1000
10000
100000
Number of files shared (log scale)
Download