Mapping the Gneutella Network

advertisement
Mapping the Gnutella Network
Presented By:
Tony Young
M.Math Candidate
October 7th, 2004
Outline
Introduction
 Gnutella in Depth
 The Crawler
 Analysis of Network
 Summary and Improvements
 Paper Review

Outline
Introduction
 Gnutella in Depth
 The Crawler
 Analysis of Network
 Summary and Improvements
 Paper Review

Introduction
Peer to peer systems have recently
exploded onto the internet scene
 Two main contributing factors:

Low cost and high availability of resources
(computing and storage)
 Increased network connectivity (proliferation
of “always on” connections)

Introduction


Peer systems build a virtual topology (overlay)
with its own routing mechanisms
The topology of the overlay and routing
protocols directly affects




Performance: Number of physical hops to send a
message through virtual overlay
Reliability: Will a message actually reach the other
end
Scalability: Can other nodes be added while keeping
performance good
Anonymity: Can we protect the identity of nodes in
the network
Introduction




Gnutella is studied in depth and analysis is
performed to determine how the overlay affects
the four characteristics previously mentioned
Started by capturing the network topology and
behaviour
Performed a macroscopic analysis of the
network to evaluate costs and benefits
Investigated possible improvements
Introduction

Two questions drive analysis
What is the connectivity structure of
Gnutella?
 How well does the Gnutella overlay map to
the actual network topology?

Introduction

Connectivity Structure

Networks as diverse as natural networks
usually have a few well connected nodes and
many poorly connected nodes
• I.e. Power Law Networks

We will see Gnutella is not a pure power law
network, but still has good fault tolerance and
is less resistant to DoS attacks
Introduction

Overlay Topology
Important for ISP’s: overlays that don’t map
closely to the physical topology adds
additional stress on the infrastructure and
costs ISP’s more money
 Scalability is directly linked to efficient use of
network resources

Outline
Introduction
 Gnutella in Depth
 The Crawler
 Analysis of Network
 Summary and Improvements

Gnutella in Depth



Gnutella is an open protocol
It is decentralized and unstructured
Allows group membership and searching of
available files for download




Gnutella should operate in a dynamic environment
where hosts can join/leave at any time
Gnutella should experience good performance and
scalability
External attacks should not cause data loss or
performance degradation
Users seeking or providing unpopular material
should stay anonymous
Gnutella in Depth

Gnutella nodes are called “servents”
(SERVer-cliENTS)
Provide a client-side interface to allow
searching of file base
 Provide server-side storage, routing and
response to network messages and requests

Gnutella in Depth
To connect, a node contacts an “always
on” host (I.e. gnutella.com) and sends a
PING
 Node replies with a PONG and forwards
the PING on to other nodes in the network
who reply with PONG messages and
forward the PING on


PING stops after TTL hops
Gnutella in Depth

To find files, users submit QUERY
messages to other nodes
Messages are broadcast to all neighbours
who forward them on to other neighbours,
etc. for TTL hops
 QUERY RESPONSE messages are returned
to the querying node

Gnutella in Depth

To download a file, nodes send GET and
PUSH messages to individual hosts
holding a file

I.e. transfer requests and transfers are routed
directly between communicating hosts, and
not back-propagated
Gnutella in Depth

Messaging protocol has three important
features



TTL and “hops passed” fields are attached to each
message
Randomly generated message ID is attached to
each message
Each node keeps track of recently routed messages
to prevent re-broadcasting and to implement backpropagation
Gnutella in Depth
PING message contains the host address
and name, number of files and size of
data store
 PONG message contains the same
information from the host that received
the PING

Gnutella in Depth

PING messages propagate until TTL has
expired
Hop count incremented at each servent
receiving the PING
 Message propagates until hop count = TTL


PONG messages are back-propagated
(I.e. sent on the reverse path that the
original message followed) to the host
initiating the PING
Gnutella in Depth

QUERY messages are sent the same
way as a PING message


Nodes check the search string requested
against the names of their locally stored files
QUERY RESPONSE messages are backpropagated to the querying node and
include information necessary to
download the file
Outline
Introduction
 Gnutella in Depth
 The Crawler
 Analysis of Network
 Summary and Improvements
 Paper Review

The Crawler




In order to conduct the network tests, a crawler
was developed to gather information about the
virtual topology
Crawler starts with a list of active nodes and
sends a PING message to each of them
PONG messages are received and the IP, port,
number of stored files and size of archive are
stored in a table
PING propagates to other nodes and PONG
back propagates to crawler
The Crawler

A sequential version of the crawler was initially
developed



I.e. send a PING with an empirically determined
optimal TTL to a set of nodes; resend to the nodes
where the PING stops, etc.
Proved to be very slow: 50 hours to collect data
from a 4 000 node network
Slowness means two things:


Not scalable: Will get slower as we add more nodes
Does not give an accurate network snapshot:
network changes drastically over 50 hours!
The Crawler

A distributed crawler was developed next




Client-Server architecture
Server maintains node list and creates a network
graph
Clients receive a list of nodes to contact and
discover neighbours for
Decided to use only 50 clients at once


Reduces invasiveness of search and consumption of
network resources
Reduced crawling time to a couple of hours for a
large initial list and a network of 30 000 nodes
The Crawler

Network membership is defined as follows
A node is a member of the network if the
crawler is able to connect to it
 A node might be excluded from network
membership if it was reported as active by a
server or other node, but the crawler could
not contact it

• This might happen if nodes go offline before the
crawler can contact them
Outline




Introduction
Gnutella in Depth
The Crawler
Analysis of Network






Growth Trends
Traffic Estimates
Connectivity and Reliability
Overlay vs. Topology
Summary and Improvements
Paper Review
Analysis of Network
Data was collected over a 6 month period
 Data shows:

Overhead traffic is reducing
 Traffic volume is a significant barrier to
growth

Growth Trends

Size of network is growing rapidly
Largest connected component in November
2000 had 2 063 neighbours
 Largest connected component in May 2001
had 48 195 neighbours!
 Number of neighbours for the largest
connected component has grown 25 times!

Growth Trends
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Growth Trends
Despite the explosive growth, most nodes
are not connected long
 Successive crawls of the network found:

40% of nodes leave the network in less than
4 hours
 25% of nodes are alive for more than 24
hours

Traffic Estimates

A modified version of the crawler recorded
traffic generated across one randomly
chosen link
36% of total traffic (in bytes) is user
generated QUERY messages
 55% is group membership (PING/PONG)
messages
 9% is non-standard or malformed messages
 N.B. File transfer traffic is excluded

Traffic Estimates

After June 2001 (when new Gnutella
implementation was released)
92% of total traffic (in bytes) was QUERY
messages
 8% is group membership (PING/PONG)
messages
 N.B. File transfer traffic is excluded

Traffic Estimates

95% of all nodes are reachable within 7 hops.



Thus, each message typically uses a TTL = 7
Most links are expected to support similar amounts
of traffic for these reasons
As verified empirically, the total Gnutella
generated traffic is proportional to the number
of connections in the network

However, the average number of connections per
node stays relatively constant as the network grows
Traffic Estimates
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Traffic Estimates
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Traffic Estimates

The total traffic estimate for the Gnutella
network is 1 Gbps
I.e. 170 000 connections for a 50 000 node
network times 6 kbps per connection
 This is approximately 330 TB/month!
 Excluding file transfers!

Traffic Estimates
This total is 1.7% of the total internet
traffic in US backbones in December
2000
 This volume of traffic is believed to be an
obstacle to further growth
 The underlying network topology must be
used more efficiently to allow scaling and
wider deployment

Connectivity and Reliability

Note: Nodes decide locally:
How many connections to support
 When to add or drop a connection


Recent research shows that many natural
systems organize themselves into “power
law networks”

I.e. networks where a few nodes are well
connected and most nodes have very few
connections
Connectivity and Reliability

Power law networks:
Number of nodes with L links (connections) is
proportional to L-k where k is systemdependent
 Resilient to losing many poorly connected
nodes
 Falls apart quickly if only a few well
connected nodes are lost
 Extremely robust to random failures, but
vulnerable to targeted attacks

Connectivity and Reliability

Power law networks appear as a linear
system on a log-log plot
Data for December 2000 shows that early
Gnutella networks were power law
 Data for March 2001 shows that later
Gnutella networks are a mixture

• There are a constant number of nodes with fewer
than 10 links
• Above 10 links, nodes follow a power law
structure
Connectivity and Reliability
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Connectivity and Reliability
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Connectivity and Reliability
Why did the distribution change?
 Two possible reasons:

About 20% of Gnutella users have modem
connections - DSL and up can support more
connections
 Gnutella users run as many connections as
their network can support - perception is that
more connections = better query results

Connectivity and Reliability

Does the change in distribution affect
reliability? Yes!
Preserves resilience to random failures
 Makes network less dependent on well
connected nodes and hence less prone to
DoS attacks

Overlay vs. Topology

Peer systems change the way bandwidth
is used on the internet


Servers are at the edge of the network now,
and peers are constantly downloading
Most ISP’s use flat-rate billing

Peer systems may break this model!
Overlay vs. Topology

Due to the amount of traffic peer systems
generate, efficient use of resources is
important
The greater the mismatch between the
overlay and the physical network topology,
the more messages need to be transmitted to
route information from A to B
 This means more stress on the network
resources

Overlay vs. Topology

Communication from A to all other nodes
requires one message over the D - E link
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Overlay vs. Topology

Communication from A to all other nodes
requires six messages over the D - E link
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Overlay vs. Topology

How well does Gnutella map to the
topology?
Assume that domain names are roughly
evident of the hierarchy of the internet
 Check how well generated traffic maps to the
cluster of domain names found by the
crawler

Overlay vs. Topology
After analysis of 10 overlays, it was found
that Gnutella nodes often connect to
peers outside of their respective domains
 Thus, it appears that Gnutella does not
make efficient use of the underlying
topology

Outline
Introduction
 Gnutella in Depth
 The Crawler
 Analysis of Network
 Summary and Improvements
 Paper Review

Summary and Improvements

Gnutella has a multimodal connectivity
distribution that is partially constant and partially
power law



Network is resilient to random failures
Network is harder to attack by malicious parties, but
not immune to DoS attacks
Gnutella makes little effort to ward off attackers

E.g. topology, connectivity and traffic information is
easy to obtain and can be used to plan attacks
Summary and Improvements

Gnutella’s traffic volume is a significant
fraction of all internet traffic


Makes the future growth of the network
reliant on efficient use of the topology
Gnutella’s overlay does not match the
network topology very well

This increases quite substantially the number
of messages and the amount of network
traffic generated
Summary and Improvements

Necessary improvements
Make efforts to hide overlay and connectivity
information (encryption?)
 Match overlay more closely with topology
 Limits to growth must be solved first and fast
at the rate that Gnutella is growing

Summary and Improvements

Suggested Improvements
Exploit locality of files and query distribution
(I.e. caching and localized queries)
 Replace query flooding strategy with
something more efficient (I.e. superpeer
routing and group communication)

Outline
Introduction
 Gnutella in Depth
 The Crawler
 Analysis of Network
 Summary and Improvements
 Paper Review

Paper Review

Organization
Some discussions of the Gnutella
architecture and protocols were scattered
throughout the paper
 Should have combined everything into a
more logical order inside the protocol section


Writing Style

Generally very good. Some missing words
and poor grammar
Paper Review

Novel Ideas


Presented a qualitative and quantitative
analysis of the Gnutella network, and some
important points for P2P as a whole
Content
Some backing information was missing
 Some claims were made without supporting
evidence, or just referring the reader to
another paper

Questions?
Download