3. rdf- and schema-based peer-to-peer systems

advertisement
Approaches to Making Peer-to-Peer Networks Scalable
Miriam Gobel
Student # 152520
239 -103 Cumberland Avenue
(306) 955-6670
mig254@mail.usask.ca
David Ford
Student # 286942
308 McKinnon Avenue
(306) 668-2049
drf928@mail.usask.ca
ABSTRACT
With the escalating volume of peer-to-peer file sharing users,
there is an ever-increasing need for a distributed algorithm that
would allow peer-to-peer applications to scale to a large
community of clients. The current problem in designing such an
algorithm is that, at present, very little is known about the network
configurations on which these algorithms would be implemented.
Even seemingly simple protocols, such as Gnutella’s
decentralized approach, result in complex interactions that
directly impinge on the system’s overall performance. In this
paper we will be discussing the various types of peer-to-peer
applications and their corresponding protocols. We will also be
discussing the advantages and disadvantages of each of these
protocols and how they affect the scalability of the p2p network.
Our goal is to learn more about the inner workings of these peerto-peer networks in order to develop distributed algorithm …
1. INTRODUCTION
This paper aims at finding optimal solutions of making peer-topeer network applications scalable. During the course of this
paper we will be frequently referring to the different types of peerto-peer applications, so we thought it would be beneficial to
include definitions of they five basic types of p2p protocols
according to …[site paper here]

Centralised – this is what we might also classify as a
hybrid Peer-to-peer network. In this, resources are
spread out across the peers, but a central server
maintains an index of resources.

Decentralised, strongly structured – this is where there
is a lack of a central server, but resources are located at
peers in such a way that makes it easy to find them, and
peers themselves have some form of structure that
makes routing between them easy. See the section on
Pastry for the pros and cons of this sort of network.

Decentralised, weakly structured – again, there is a lack
of a central server, this time resources are placed
according to hints on the network. See the section on
Freenet.

Decentralised, unstructured – this is a P2P network in
which resources are randomly placed at nodes and
routing takes place randomly between them. Gnutella is
an example of this.

Decentralised, unstructured optimized – again resources
are randomly distributed at nodes, but this time some
order is imposed on the routing. Generally, this will
Lawrence Cheng
Student # 312383
135 Roborecki Crescent
(306) 242-1868
lac387@mail.usask.ca
take the form of some form of supernodes, and the
caching of data in the network. Various extensions to
the Gnutella protocol form this sort of network.
Figure 1 - Centralised and Decentralised Protocols
The figure above shows diagrams of both centralized and
decentralized protocols.
2. OPTIMIZED GNUTELLA
The original Gnutella protocol was introduced as a decentralized,
TCP using [Jacklin, 9], unstructured, P2P file sharing network,
however, since then, the network topology has significantly
improved [Jacklin, 11]. Its makeup is still decentralized, TCP and
unstructured, but current advances have minimized scalability
issues by incorporating greater-than-average bandwidth
supernodes undertaking extra responsibilities (such as caching
search results) [Jacklin, 11] and connection (PONG) message
caching carried out by Gnutella servents [Blundell, 3]. These
servents are consistently contributing QUERY messages, which
account for 92% of all Gnutella traffic [Matei, 6], because
Gnutella's current query scheme is implemented via message
flooding [Rohrs].
2.1 Environment
To say that human nature is unpredictable would not be surprising
to anyone. Consequently, the unstructured nature of Gnutella, in
fact, of any unstructured P2P network, is practically appropriate if
scalable systems are desired. Considering the continuous increase
of users, a structured overlay network would exhaust itself simply
attempting to discover node failures and furthermore replicating
loss data or pointers [Chawathe, 408]. In contrast, Gnutella-like
networks effortlessly adapt to unexpected disconnects because of
the lack of state and randomness of both node and resource
placement [Jacklin, 4]. In the event of a disconnected Gnutella
servent, a peer can easily rejoin the network and recover state,
such as PONG caching.
2.2 PONG Caching
PONG caching is a significant improvement to the once
overwhelming PING/PONG messages traversing through the
network (55% of all Gnutella traffic [Blundell, 3]). However, a
clarification should be taken into consideration. It is known that
to maintain an up-to-date PONG cache, a servent would
periodically broadcast PING messages to all of its peer [Blundell,
3]. It is then questionable as to whether the servents “feel” when
the network is “bursty” and when the network is “slow”. For
example, if the network is “bursty”, do periodic PONG cache
updates adjust to a higher frequency. There is then an implication
that servents be required to undertake additional responsibilities
in what seems to be a irresponsible and unstructured setting.
2.3 Supernodes
Could it then be that the notion of supernodes is a solution to the
aforementioned question? Perhaps not. If it were so, the entire
Gnutella P2P file sharing network would consist of supernodes.
This is not to imply that supernodes serve a useless purpose but,
in fact, are quite purposeful. It is evident as demonstrated by
KaZaA and Morpheus where popular music is quicker and easier
to find and at the same time the network is everlasting (as long as
one peer is in existence) [ ]. It is well aware that two networks,
centralized and decentralized, are highly advantageous
respectively, however, it is the disadvantages that each possess
where conflict upon which network is “better” occurs. It then
appears as if the intuitive concept of this hybrid node (supernode)
is a cross between the two advantageous systems and by doing so,
a few disadvantages can be addressed.
Decentralized systems make it hard to find desired files with
accuracy without distributing the queries widely [Lv, 1]. Since
supernodes are responsible for caching popular search results, the
amount of queries that are further distributed to children is
significantly reduced [Jacklin, 11]. Centralized systems, or focal
points, are always and greatly vulnerable to attacks whether
random or purposeful and not to forget, self-failure. Supernodes
are designed and implemented in such a manner that failure,
regardless of the reason, does not severely disturb a vast majority
of neighboring peers.
Obviously, it is then noticeable that yes, supernodes are
advantageous in many aspects beyond both centralized and
decentralized systems, however, it too is constrained with a
drawback. If the assumption is made that query messages are
assumed to be for “popular” files (most queries are for hay, not
needles [Chawathe, 2], supernodes, although greater in
performance and efficiency than “average” nodes, will eventually
be overloaded with query messages. It is because peer numbers
will ever-so increase whereas bandwidth capacity would remain
relatively the same (at this moment, the possibilities of bandwidth
are uncertain).
2.4 QUERY Scheme
The integration of the message flooding technique, as a textbook
theory, would and does seem quite sufficient. Frankly, send
messages to everyone and thereafter wait patiently for a reply.
Although, bandwidth limitations have placed a burden upon this
technique and specifically upon unstructured networks in general.
Message flooding is perhaps an ideal solution if locality of
resource and peer placement were present, given that 95% of any
two nodes are less than 7 hops away [Matei Ripenau]. Otherwise,
as with randomly placed peers, network traffic then becomes
increasing unscalable.
It is apparent that sending QUERY messages is problematic and is
proving to be more complex than simply caching search results
[Blundell, 2] and as a result, many innovative file searching
techniques have been introduced, such as query caching,
expanding ring, random walks, heterogeneity & flow control,
query routing, and mapping efficiently to the underlying Internet
topology [Blundell] (for example, implementing UDP instead of
TCP, which, by the way, Gnutella2, Gnutella as a UDP protocol
was specified but as of early May 2003, has not yet been
implemented [Jacklin]).
3. RDF- AND SCHEMA-BASED PEER-TOPEER SYSTEMS
3.1 Approach
The Approach of schema-based P2P networks – we will discuss
mainly Edutella here - is to extend the representation and query
functionalities offered by P2P networks.
This is done by trying to build data management infrastructures
upon traditional P2P networks as well as distributed/heterogenous
database research. We currently see first explorations toward true
P2P data management infrastructures which will have all
characteristics of P2P systems in this field like local control of
data, dynamic addition and removal of peers, only local
knowledge of available data and schemas, self- organization and –
optimization.
The Building Blocks for any schema based P2P system like
Edutella are:

a schema language to define and use the various
schemas

which specify the kind of data available in the P2P
network

a query language to retrieve the data stored in the P2P
network

a network topology combined with an appropriate query
routing

algorithm to allow efficient queries

facilities to integrate heterogenous information stored in
the P2P network
Edutella itself uses designated super peers (unlike for example
Kazaa, which uses super peers, that are not explicitly designated)
queries

they employ routing indices, which explicitly
acknowledge the semantic heterogeneity of schemabased P2P networks, and therefore include schema
information as well as other possible index information.

responsible for message routing and integration /
mediation of metadata

indices are updated when a peer connects to a superpeer, and contain all necessary information about
connected peers.

Entries are valid only for a certain time, and are deleted
when the peer does not renew/ update it regularly.

Peers notify their super-peers when their content
changes
Edutella introduces super-peer/super-peer (SP/SP) routing indices
to route among the super-peers. These SP/SP routing indices
queries

contain the same kind of information as SP/P indices

but refer only to the direct neighbors of a super-peer

are not replicated versions of a central index but rather
parts of a distributed index similar to routing indices in
TCP/IP networks.
3.1.1 Searching in Edutella
Querying the network could work basically in two routing steps:
1.
Propagation of the query to those concept clusters
that contain peers which the query is aiming at.
2.
Within all of these concept clusters, a broadcast is
carried out.
For the creation of the clusters many concepts are introduced.
Each of them seems to have advantages as well as disadvantages.
One of the approaches is to group clients with similar interests
into a group, which is a very intuitive approach.
Another interesting way is to find peers who searched for similar
things in the past and cluster those peers together.
3.2 Critique
On the one hand, this approach seems to introduce a lot of
overhead. Overhead normally slows down and thereby in general
doesn’t help with scalability. Especially the idea to “rely on local
transformation mechanisms and rules”, which is supposed to be
an essential gain for this schema- based approach doesn’t raise a
lot of sympathy in us.
On the other hand, the way of introducing super-nodes seems to
be a good factor to increase the scalability.
Searching, which is a major issue, and an especially weak point in
the original versions of Gnutella, has rich concepts.
By introducing clusters made up by having common “interests”,
Edutella is one of the only concepts we found that has a way to
deal with “locality”. This is supposed to increase the performance
and therefore the scalability of the system tremendously.
Based on the structure, meta-data available for every peer, and the
introduced ideas for searching based on these facts, seem to be
very powerful and a true innovation to the current state in P2P
networks.
All in all, the extra effort for this huge amount of effort might pay
off on the long run, if and only if implemented wisely. An
especially big plus for this technology would be, if they would
find a way to actually make use of one of today’s existing big P2P
networks like Gnutella.
3.3 Future work
The future work will focus on identifying optimal granularity
levels for different application scenarios.
Another part is to try to extend the restricted mediation
capabilities to all peers.
Currently the only sources allowed are RDF and RDFS. The next
step for this would be to allow mediation with relational database
sources.
Also, all of the current Edutella design approaches need further
exploration and evaluation.
4. INS/TWINE
4.1 Approach
INS/Twine is a P2P network of resolvers. It works with XML, so
that it can be compared to Edutella, which uses RDF. This gives
the opportunity to introduce metadata descriptions to all resources
available. These resources can be represented using AVTrees.
INS/Twain claims to be “a scalable resource discovery system”
and facilitates intentional naming. The approach to achieving
scalability is “via hash-based partitioning of resource descriptions
amongst a set of symmetric peer resolvers”. This combination is
not limited to file-sharing.
The basic way to achieve scalability is to use “a hash-based
mapping of resource descriptions to resolvers”.
4.2 Goal
The Goal is to reach a dynamic and even distribution of resource
information as well as queries among resolvers and avoidance of
central bottlenecks. Another goal is the independence of results
from the location where the query was fired.
Address the current lack of adequate querying capability for
complex resource queries in most of the current P2P networks.
4.3 Critique
This approach seems to be a well-thought-out one.
For once, the overhead doesn’t seem to be unnecessarily high,
what is already an advantage.
The idea we probably like best is to apply refresh rates in the core
much slower than at the edges of the network. This is an almost
natural reflection of what is going to be most likely happening,
and we did not see this lot in other approaches.
One thing that we liked, to, is the sense of – for example –
geographical location. This can bring a huge speed-up in typical
searches. About 60 percent are possible, if the clients mainly
search for e.g. files – in the case of file-sharing – that are
“popular” in their area. This is a very likely case. For Canadian
citizens who are mainly interested in files from Japan, this might
be a disadvantage, though, compared to systems that don’t
distinguish between geographical locations.
Another good point is the distinction between core and edges.
Through adding of new resolvers, this feature makes the approach
scale nicely.
The evenly distribution of queries among the resolvers is
important for scalability.
A downside seems to be the total unawareness of locality,
although it is one goal of INS/Twine.
5.3 Centralization
Although Napster was considered to be an extremely well
designed protocol, it seems that having a centralized index seemed
to have its drawbacks. Napster’s survivability seemed to be quite
poor as a single failure or a court order could stop the entire
network by switching off the central index.
Scalability in server/client models (client/server model)
To increase the size of a client/server network we simply increase
the number of clients that can connect to the server by increasing
the bandwidth and/or processing resources of the server. When
the limits have been reached for upgrading a single server or when
the law of diminishing returns starts to apply to hardware
upgrades, multiple servers can be clustered together.
Napster Approach to Scalability
Napsters scalability was completely dependent upon the
scalability of the central server.
Client server model. Increasing the size of a client/server
5. NAPSTER
5.1 Approach
A disadvantage to this approach is the vulnerability of the system.
Napster
[Performance and Design of P2P Networks for Efficient File
Sharing]
Allows nodes to be anonymous so all nodes do not need to have
ids.
[A Measurement Study of Peer-to-Peer File Sharing Systems]
In the Napster protocol, a large cluster of dedicated central servers
maintain an index of all the files that are currently being shared by
active peers. Every peer maintains a connection to one of the
central servers through which the file location queries are sent.
The servers then work together to process the query and return a
list of matching files and locations to the client. When the client
receives the results, they may choose to initiate a file exchange
directly from another peer. In addition to maintaining an index of
shared files, the centralized servers also monitor the state of each
peer in the system, keeping track of metadata such as the peer’s
reported connection bandwidth and the duration that the peer has
remained connected to the system. The metadata is returned with
the results of a query, so that the peer that initiated the query has
some information to distinguish between possible download sites.
Malicious nodes are less likely to bring down the whole network.
5.2 Traffic Flow
Gnutella floods queries over the entire system, leading to high
processing costs and network congestion
For the Napster architecture, the file names and machine
addresses contain tens of bytes. Files being exchanged typically
contain megabytes of data. The large data transfers occur between
machines chosen virtually at random. This tends to spread data
traffic evenly throughout the internet. This architecture is
extremely efficient.
both applications perform data lookups using user supplied
keywords, which presents difficulties. User supplied keywords
can be vague and incomplete, leading to failures and excessive
delays in data lookups.
napster uses a central index server which represents a single point
of failure. If this server goes down, users of napster can no longer
perform any lookups.
6. ICQ
6.1 Approach
Not unlike Napster, ICQ has a central server that indexes
resources (other clients of the system). With ICQ, communication
between clients is direct and does not have to be routed through
the server. This P2P system also uses an addressing system which
is in some ways independent of IP. Each client has a persistent
address (their ICQ identification number) which is mapped onto
an IP address whenever that client logs on.
addresses this scalability issue. (Find out how)
9. CAN
9.1 Approach
Similar once again to Napster, ICQ scales like a client/server
model with one significant difference.
CAN – distributed hash tables.
CAN – scales well, fault tolerant
7. PASTRY
10. TAPESTRY
7.1 Approach
10.1 Approach
Pastry is a generic peer-to-peer content location and routing
system based on a self-organizing overlay network of nodes
connected via the Internet. Pastry is completely decentralized,
fault-resilient, scalable, and reliably routes a message to the live
node.
Tapestry is a peer to peer, wide-area decentralized routing and
location network infrastructure. Tapestry forms an overlay
network that sits at the application layer (on top of an Operating
System). If tapestry is installed on different network nodes it will
allow any one node to route messages to any other node running
tapestry, given a location and a network independent name. Also
nodes in a Tapestry network can advertise location information
about data it possesses in a specific format understood by other
nodes running tapestry. This special format allows the other
nodes to find and access this data easily and efficiently, given that
they know the data name. So, we can see that tapestry allows
nodes the ability to share data, thereby creating their own p2p
system.
Pastry is a peer-to-peer application library which can be described
as a “self-organizing overlay network” for the internet.[12]
Pastry routes messages from one node to another similar to
Gnutella, but it does this in a much more orgranized way. Each
node within the Pastry network is assigned a random numeric
identifier. Messages on the network consist of a segment of data
as well as a numeric key. When a node in the network is
presented with a message, that node forwards the message to a
different node with a closer id to the key than itself, if this is
possible. If N is the number of nodes present within the network,
then the number of routing steps for any message grows at a rate
of O(log N).
Tapestry has an algorithm that knows the network topology, so
queries never travel more than the network distance required to
reach them.
Pastry takes into account network locality. It seeks to minimize
the distance messages travel, according to a scalar proximity
metric like the number of IP routing hops.
The lookup time for tapestry protocol is log(n)
Pastry like tapestry has a more complicated join protocol. A new
node’s routing table will be populated with information from
nodes along the path taken by the join message. This leads to
latency.
Tapestry does not handle node joins and failures as well as chord
since it is more complicated. Chord is better for p2p systems with
many nodes arriving and departing at random intervals
Tapestry does not require a centralized server
The lookup time for tapestry protocol is log(n)
Tapestry does not require a centralized server
8. FREENET
8.1 Approach
Freenet has a different approach then its predecessors. Freenet is
not primarily concerned with bandwidth, rather its goal is
complete anonymity. Files are associated with a certain key and
are stored in the network. The Freenet peer-to-peer system is
decentralized and symmetric and automatically adapts when
clients leave and join. Freenet’s lookups take the form of searches
from cached copies and assigns no responsibility for documents to
specific servers. This is what allows Freenet to provide
anonymity, but at the same time it prevents guaranteed retrieval of
existing documents or from providing low bounds on retrieval
costs.
Tapestry handles node arrivals and departures well, unlike
tapestry.
11. CHORD
11.1 Approach
Chord provides fast distributed computation of a hash function,
mapping keys to nodes responsible for them. Chord assigns keys
to nodes with consistent hashing which has several desirable
properties. With high probability, the hash function balances load
The Chord protocol supports just one operation
Chord does not have the idea of locality. Chord routes are based
on a numerical difference between the message id and the node
address.
Chord provides fast distributed computation for hash function
[Chord: A Scalable Peer-to-peer lookup protocol for internet
applications]
[4] Jacklin, A. Using UDP to Increase the Scalability of
Peer-to-Peer networks
[5] Kapur, A., Brooks, R., and Rai, S. Performance and Design
of P2P Networks for Efficient File Sharing.
[6]
[7] Lv, Q., Ratnasamy, S., and Shenker, S. Can Heterogeneity
Make Gnutella Scalable. In Proceedings of the 1st
International Workshop on Peer-to-Peer Systems (IPTPS
’02). (Cambridge, MA, Mar. 2002).
[8]
[9] Ripeanu, M., Foster, I., and Iamnitchi, A. Mapping the
Gnutella Network: Properties of Large-Scale Peer-to-Peer
Systems and Implications for System Design. IEEE Internet
Computing Journal 6, 1 (2002).
[10]
11.2 Local Locality
Chord improves the scalability of consistent hashing by avoiding
the requirements that every node must know about every other
node. One node in the Chord network needs only a small amount
of “routing” information about other nodes.
Since this
information is distributed, a node resolves the hash function by
communicating with other nodes. In a network with N nodes,
each node maintains information about only O(log N) other
nodes, and a lookup requires O(log N) messages.
12. ACKNOWLEDGMENTS
Our thanks to ACM SIGCHI for allowing us to modify templates
they had developed.
13. REFERENCES
[11] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D. R.,
Kaashock, M. F., Dabek, F., and Balakrishnan, H. Chord: A
Scalable Peer-to-Peer Lookup Protocol for Internet
Applications.
[12] 9
[13]
[14] Query Routing for the Gnutella Network
http://www.limewire.com/developer/query_routing/keyword
%20routing.htm
[15] Gnutella Protocol Development
http://rfcgnutella.sourceforge.net/developer/testing/pongCaching.html
[16] The 'Net After Napster
http://www.npr.org/programs/atc/features/2001/jul/afternapst
er/010726.after.napster.guide.html#supernode
[1] Blundell, N., Mathy, L. An Overview of Gnutella
Optimisation Techniques.
[2] Bratislav, M., Jelena, K., and Veljko, M. Survey of Peer to
Peer technologies.
[3] Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N.,
Shenker, S. Making Gnutella-like P2P Systems Scalable
Columns on Last Page Should Be Made As Close As
Possible to Equal Length
Download