Adriana Iamnitchi
University of South Florida anda@cse.usf.edu
http://www.cse.usf.edu/~anda
P2P Systems – 1
Why Peer-to-Peer Systems?
• Wide-spread user experience
• Large-scale distributed
application, unprecedented growth and popularity
– KaZaA – 389 millions downloads
(1M/week) one of the most popular applications ever!
• Heavily researched in the last 8-
9 years with results in:
– User behavior characterization
– Scalability
– Novel problems (or aspects): reputation, trust, incentives for fairness
• Commercial impact
– Do you know of any examples?
eDonkey
FastTrack (Kazaa)
Gnutella
Cvernet
3,108,909
2,114,120
2,899,788
691,750
Filetopia 3,405
Number of users for file-sharing applications
(estimate www.slyck.com, Sept ‘06)
P2P Systems – 2
Outline Today: Peer-to-peer Systems
• Background
• Some history
• Unstructured overlays
– Napster
– Gnutella (original and new)
– BitTorrent
– Exploiting user behavior in distributed file-sharing systems
• Structured overlays (“DHT”s)
– Basics
– Chord
– CAN
P2P Systems – 3
Node
What Is a P2P System?
Node
Node
Internet
Node Node
• A distributed system architecture:
– No centralized control (debatable: Napster?)
– Nodes are symmetric in function (debatable: New Gnutella protocol?)
• Large number of unreliable nodes
• Initially identified with music file sharing
P2P Systems – 4
P2P Definition(s)
A number of definitions coexist:
• Def 1 : “A class of applications that takes advantage of resources — storage, cycles, content, human presence — available at the edges of the Internet.”
– Edges often turned off, without permanent IP addresses
• Def 2 : “A class of decentralized, self-organizing distributed systems, in which all or most communication is symmetric.”
Lots of other definitions that fit in between
P2P Systems – 5
The Promise of P2P Computing
• High capacity through parallelism:
– Many disks
– Many network connections
– Many CPUs
• Reliability:
– Many replicas:
• of data
• of network paths
– Geographic distribution
• Automatic configuration
• Useful in public and proprietary settings
P2P Systems – 6
History
• Decentralized, P2P solutions: USENET
• As a grass-root movement: started in 1999 with Napster
– Objective: (music) file sharing
P2P Systems – 7
Popularity since 2004
Britney Spears 2.60
p2p 1.00
Normalized and compared to the popularity of Britney Spears as shown by Google Trends
P2P Systems – 8
Napster: History
• Program for sharing files over the Internet
• History:
– 5/99: Shawn Fanning (freshman, Northeasten U.) founds Napster Online music service
– 12/99: first lawsuit
– 3/00: 25% UWisc traffic Napster
– 2000: est. 60M users
– 2/01 : US Circuit Court of Appeals:
Napster knew users violating copyright laws
– 7/01: # simultaneous online users:
Napster 160K, Gnutella: 40K, Morpheus: 300K
P2P Systems – 9
Basic Primitives for File Sharing
Join: How do I begin participating?
Publish: How do I advertise my file(s)?
Search: How do I find a file?
Fetch: How do I retrieve a file?
P2P Systems – 10
Napster: How It Works napster.com
• Client-Server: Use central server to locate files
• Peer-to-Peer: Download files directly from peers
P2P Systems – 11
1. File list is uploaded
(Join and Publish)
Napster napster.com
users
P2P Systems – 12
Napster
2. User requests search at server (Search).
Request and results napster.com
user
P2P Systems – 13
Napster
3.
User pings hosts that apparently have data.
Looks for best transfer rate.
ping napster.com
ping user
P2P Systems – 14
4.
User retrieves file
(Fetch)
Napster napster.com
Download file user
P2P Systems – 15
Lessons Learned from Napster
• Strengths: Decentralization of Storage
– Every node “pays” its participation by providing access to its resources
• physical resources (disk, network), knowledge (annotations), ownership (files)
– Every participating node acts as both a client and a server (“servent”): P2P
– Decentralization of cost and administration = avoiding resource bottlenecks
• Weaknesses: Centralization of Data Access Structures (Index)
– Server is single point of failure
– Unique entity required for controlling the system = design bottleneck
– Copying copyrighted material made Napster target of legal attack increasing degree of resource sharing and decentralization
Centralized
System
Decentralized
System
P2P Systems – 16
Gnutella: File-Sharing with No Central Server
P2P Systems – 17
Gnutella: History
• Developed in a 14 days “quick hack” by Nullsoft (winamp)
– Originally intended for exchange of recipes
• Evolution of Gnutella
– Published under GNU General Public License on the Nullsoft web server
– Taken off after a couple of hours by AOL (owner of Nullsoft)
• Too late: this was enough to “infect” the Internet
– Gnutella protocol was reverse engineered from downloaded versions of the original
– Protocol published
– Third-party clients were published and Gnutella started to spread
– Many iterations to fix poor initial design
• High impact:
– Many versions implemented
– Many different designs
– Lots of research papers/ideas
P2P Systems – 18
Gnutella: Search in an Unstructured Overlay
I have file A.
I have file A.
Reply
Flooding
Query
Where is file A?
P2P Systems – 19
Gnutella: Overview
• Join: on startup, client contacts a few other nodes; these become its “neighbors”
– Initial list of contacts published as gnutellahosts.com:6346
– Outside the Gnutella protocol specification
– Default value for number of open connections (neighbors): C = 4
• Publish: no need
• Search:
– Flooding: ask neighbors, who ask their neighbors, and so on...
• Each forwarding of requests decreases a TTL. Default: TTL = 7
• When/if found, reply to sender
• Drop forwarding requests when TTL expires
• One request leads to
2 *
TTL i
0
C * ( C
1 ) i
26 , 240
– Back-propagation in case of success (Why?) messages
• Fetch: get the file directly from peer (HTTP)
P2P Systems – 20
Gnutella: Protocol Message Types
Type
Ping
Pong
Query
QueryHit
Push
Description
Announce availability and probe for other servents
Response to a ping
Search request
Returned by servents that have the requested file
File download requests for servents behind a firewall
Contained Information
None
IP address and port# of responding servent; number and total kb of files shared
Minimum network bandwidth of responding servent; search criteria
IP address, port# and network bandwidth of responding servent; number of results and result set
Servent identifier; index of requested file; IP address and port to send file to
P2P Systems – 21
P2P Systems – 22
Gnutella: Tools for Network Exploration
• Eavesdrop traffic - insert modified node into the network and log traffic.
• Crawler - connect to active nodes and use the membership protocol to discover membership and topology.
P2P Systems – 23
Gnutella: Heterogeneity
All Peers Equal? (1)
1.5Mbps DSL 1.5Mbps DSL
56kbps Modem
1.5Mbps DSL
10Mbps LAN
1.5Mbps DSL
56kbps Modem
56kbps Modem
P2P Systems – 24
Gnutella Network Structure: Improvement
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
Data transfer (file download)
Control messages (search, join, etc)
P2P Systems – 25
Déjà vu?
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
Data transfer (file download)
Control messages (search, join, etc)
P2P Systems – 26
Gnutella: Free Riding
All Peers Equal? (2)
More than 25% of Gnutella clients share no files; 75% share 100 files or less
Conclusion: Gnutella has a high percentage of free riders
• If only a few individuals contribute to the public good, these few peers effectively act as centralized servers.
• Outcome:
– Significant efforts in building incentive-based systems
– BitTorrent?
Adar and Huberman (Aug ’00)
P2P Systems – 27
Flooding in Gnutella: Loops?
Seen request already
P2P Systems – 28
Improvements of Message Flooding
• Expanding Ring
– start search with small TTL (e.g. TTL = 1)
– if no success iteratively increase TTL (e.g. TTL = TTL +2)
• k-Random Walkers
– forward query to one randomly chosen neighbor only, with large TTL
– start k random walkers
– random walker periodically checks with requester whether to continue
• Experiences (from simulation)
– adaptive TTL is useful
– message duplication should be avoided
– flooding should be controlled at fine granularity
P2P Systems – 29
Gnutella Topology (Mis)match?
P2P Systems – 30
Gnutella: Network Size?
Explosive growth in 2001, slowly shrinking thereafter
• High user interest
– Users tolerate high latency, low quality results
• Better resources
– DSL and cable modem nodes grew from 24% to 41% over first 6 months.
P2P Systems – 31
Is Gnutella a Power-Law Network?
Power-law networks: the number of links per node follows a power-law distribution N = L -k
10000
1000
100
10
November 2000
Examples of power-law networks:
– The Internet at AS level
– In/out links to/from HTML pages
– Airports
– US power grid
– Social networks
1
1 10
Number of links (log scale)
100
Implications: High tolerance to random node failure but low reliability when facing of an ‘intelligent’ adversary
P2P Systems – 32
Network Resilience
Partial Topology Random 30% die Targeted 4% die from Saroiu et al., MMCN 2002
P2P Systems – 33
Is Gnutella a Power-Law Network? (Later Data)
Later, larger networks display a bimodal distribution
Implications :
– High tolerance to random node failures preserved
– Increased reliability when facing an attack.
10000
From Ripeanu, Iamnitchi, Foster, 2002
1000
100
10
1
1
May 2001
10
Number of links (log scale)
100
P2P Systems – 34
Discussion Unstructured Networks
• Performance
– Search latency: low (graph properties)
– Message Bandwidth: high
• improvements through random walkers, but essentially the whole network needs to be explored
– Storage cost: low (only local neighborhood)
– Update cost: low (only local updates)
– Resilience to failures good: multiple paths are explored and data is replicated
• Qualitative Criteria
– search predicates: very flexible, any predicate is possible
– global knowledge: none required
– peer autonomy: high
P2P Systems – 35
P2P Systems – 36
BitTorrent Components
• Torrent File
– Metadata of file to be shared
– Address of tracker
– List of pieces and their checksums
• Tracker
– Lists peers interested in the distribution of the file
• Peers
– Clients interested in the distribution of the file
– Can be “seeds” or “leachers”
P2P Systems – 37
A BitTorrent Swarm
• A “seed” node has the file
• A “tracker” associated with the file
• A “.torrent” meta-file is built for the file: identifies the address of the tracker node
• The .torrent file is published on web
• File is split into fixed-size segments
(e.g., 256KB)
P2P Systems – 38
Choking Algorithm
• Each connected peer is in one of two states
– Choked: Download requests by a choked peer are ignored
– Unchoked: Download requests by an unchoked peer are honored
– Choking occurs at the peer level
• Each peer has a certain number of unchoke slots
– 4 regular unchoke slots (per BitTorrent standard)
– 1 optimistic unchoke slot (per BitTorrent standard)
• Choking Algorithm
– Peers unchoke connected peers with best service rate
• Service rate = rolling 20 second average of its upload bandwidth
– Optimistically unchoking peers prevents a static set of unchoked peers
– The choking algorithm runs every 10 seconds
– Peers optimistically unchoked every 30 seconds
• New peers are 3 times more likely to be optimistically unchoked
P2P Systems – 39
Piece Selection
• Random First Piece
– Piece download at random
– Algorithm used by new peers
• Rarest Piece First
– Ensures > 1 distributed copies of a piece
– Increases interest of connected peers
– Increases scalability
• Random Piece vs. Rarest Piece
– Rarest has probabilistically high download time
– New peers want to reduce download time but also increase their interest
P2P Systems – 40
BitTorrent: Overview
• Join: nothing
– Just find that there is a community ready to host your tracker
• Publish: Create tracker, upload .torrent metadata file
• Search:
– For file: nothing
• the community is supposed to provide search tools
– For segments: exchange segment IDs maps with other peers.
• Fetch: exchange segments with other peers (HTTP)
P2P Systems – 41
Gnutella vs. BitTorrent: Discussion
• Architecture
– Decentralization?
• System properties
– Reliability?
– Scalability?
– Fairness?
– Overheads?
– Quality of Service
• Search coverage for content?
• Ability to download content fast?
P2P Systems – 42
P2P Systems – 43
What Is a DHT?
• A building block used to locate key-based objects over millions of hosts on the internet
• Inspired from traditional hash table:
– key = Hash(name)
– put(key, value)
– get(key) -> value
– Service: O(1) storage
• How to do this across millions of hosts on the Internet?
– Distributed Hash Tables
• What might be difficult?
– Decentralized: no central authority
– Scalable: low network traffic overhead
– Efficient: find items quickly (latency)
– Dynamic: nodes fail, new nodes join
– General-purpose: flexible naming
P2P Systems – 44
From Hash Tables to Distributed Hash Tables
Challenge: Scalably distributing the index space:
– Scalability issue with hash tables: Add new entry => move many items
– Solution: consistent hashing (Karger 97)
• Consistent hashing:
– Circular ID space with a distance metric
– Objects and nodes mapped onto the same space
– A key is stored at its successor: node with next higher ID K5
N105
K: object IDs
N: node IDs
Circular
ID space
N90
K20
N32
K80
P2P Systems – 45
The Lookup Problem
Put (Key=“title”
Value=file data…)
Publisher
N
1
N
2 N
3
N
4
Internet
?
N
5
N
6
Client
Get(key=“title”)
P2P Systems – 46
DHTs: Main Idea
N
1
N
2
Publisher
Key=H(audio data)
Value={artist, album title, track title}
N
4
N
9
N
6
N
7
N
3
Client
Lookup(H(audio data))
N
8
P2P Systems – 47
What Is a DHT?
• Distributed Hash Table: key = Hash(data) lookup(key) -> IP address send-RPC(IP address, PUT, key, value) send-RPC(IP address, GET, key) -> value
• API supports a wide range of applications
– DHT imposes no structure/meaning on keys
– And thus build complex data structures
P2P Systems – 48
Approaches
• Different strategies
– Chord: constructing a distributed hash table
– CAN: Routing in a d-dimensional space
– Many more…
• Commonalities
– Each peer maintains a small part of the index information (routing table)
– Searches performed by directed message forwarding
• Differences
– Performance and qualitative criteria
P2P Systems – 49
Example 1: Distributed Hash Tables (Chord)
• Hashing of search keys AND peer addresses on binary keys of length m
– Key identifier = SHA-1(key); Node identifier = SHA-1(IP address)
– SHA-1 distributes both uniformly
– e.g. m=8, key(“yellow-submarine.mp3")= 17 , key(192.178.0.1)= 3 predecessor m=8
32 keys stored at k p3 peer data k with hashed identifier with hashed identifier stored at node p2 p k
, such that p2 the smallest node ID larger than p2
Search possibilities?
1. every peer knows every other
O(n) routing table size
2. peers know successor
O(n) search cost
P2P Systems – 50 is k
Routing Tables
• Every peer knows m peers with exponentially increasing distance p p+1 p+2
Each peer p stores a routing table
First peer with hashed identifier p such that s i
=successor(p+2 i-1 ) for i=1,..,m
We write also s i
= finger(i, p) p+4 s
5 s
1, s s
2,
4 s
3 p3 p2 p+8 s i s
1 s
2 s
3 s
4 s
5 p p2 p2 p2 p3 p4 p4 p+16
Search
O(log n) routing table size
P2P Systems – 51
k2 p4 s
5
Search search(p, k) find in routing table largest (s i node ID in interval [p,k]
, p*) such that p* largest if such a p* exists then search(p*, k) else return (successor(p)) // found p p+1 p+2 p+4 s
1, s
2, s
3 p2 s
4 p3 p+8 k1
Search
O(log n) search cost p+16
P2P Systems – 52
Finger i
Points to Successor of n+2 i
112
¼
N120
1/8
1/16
1/32
1/64
1/128
N80
½
P2P Systems – 53
Lookups Take O( log(N)
) Hops
N5
N110
N99
N10
N20
K19
N32
Lookup(K19)
N80
N50
N60
P2P Systems – 54
Node Insertion (Join)
• New node q joining the network p4 p p+1 p+2 q p+4 p2 p+8 p3 p+16 routing table of p i s
1 s
2 s
3 s
4 s
5 p q q p2 p3 p4 routing table of q i s
3 s
4 s
1 s
2 s
5 p p2 p2 p3 p3 p4
P2P Systems – 55
Network size n=10^4
5 10^5 keys
Load Balancing in Chord
P2P Systems – 56
Network size n=2^12
100 2^12 keys
Path length ½ Log
2
(n)
Length of Search Paths
P2P Systems – 57
Chord Discussion
• Performance
– Search latency: O(log n) (with high probability, provable)
– Message Bandwidth: O(log n) (selective routing)
– Storage cost: O(log n) (routing table)
– Update cost: low (like search)
– Node join/leave cost: O(Log 2 n)
– Resilience to failures: replication to successor nodes
• Qualitative Criteria
– search predicates: equality of keys only
– global knowledge: key hashing, network origin
– peer autonomy: nodes have by virtue of their address a specific role in the network
P2P Systems – 58
Example 2: Topological Routing (CAN)
• Based on hashing of keys into a d-dimensional space (a torus)
– Each peer is responsible for keys of a subvolume of the space (a zone)
– Each peer stores the addresses of peers responsible for the neighboring zones for routing
– Search requests are greedily forwarded to the peers in the closest zones
• Assignment of peers to zones depends on a random selection made by the peer
P2P Systems – 59
Network Search and Join
Node 7 joins the network by choosing a coordinate in the volume of 1
P2P Systems – 60
CAN Refinements
• Multiple Realities
– We can have r different coordinate spaces
– Nodes hold a zone in each of them
– Creates r replicas of the (key, value) pairs
– Increases robustness
– Reduces path length as search can be continued in the reality where the target is closest
• Overloading zones
– Different peers are responsible for the same zone
– Splits are only performed if a maximum occupancy (e.g. 4) is reached
– Nodes know all other nodes in the same zone
– But only one of the neighbors
P2P Systems – 61
CAN Path Length
P2P Systems – 62
Increasing Dimensions and Realities
P2P Systems – 63
CAN Discussion
• Performance
– Search latency: O(d n 1/d ), depends on choice of d (with high probability, provable)
– Message Bandwidth: O(d n 1/d ), (selective routing)
– Storage cost: O(d) (routing table)
– Update cost: low (like search)
– Node join/leave cost: O(d n 1/d )
– Resilience to failures: realities and overloading
• Qualitative Criteria
– search predicates: spatial distance of multidimensional keys
– global knowledge: key hashing, network origin
– peer autonomy: nodes can decide on their position in the key space
P2P Systems – 64
Gnutella
Chord
CAN
Comparison of (some) P2P Solutions
Search Paradigm Overlay maintenance costs
Search Cost
Breadth-first on search graph
O(1)
2*
TTL i
0
C *( C
1) i
Implicit binary search trees
O(log n) O(log n) d-dimensional space
O(d) O(d n 1/d )
P2P Systems – 65
DHT Applications
Not only for sharing music anymore…
– Global file systems [OceanStore, CFS, PAST, Pastiche, UsenetDHT]
– Naming services [Chord-DNS, Twine, SFR]
– DB query processing [PIER, Wisc]
– Internet-scale data structures [PHT, Cone, SkipGraphs]
– Communication services [i3, MCAN, Bayeux]
– Event notification [Scribe, Herald]
– File sharing [OverNet]
P2P Systems – 66
P2P Systems – 67
Research Trends: A Superficial History Based on
Articles in IPTPS
• In the early ‘00s (2002-2004):
– DHT-related applications, optimizations, reevaluations… (more than 50% of
IPTPS papers!)
– System characterization
– Anonymization
• 2005-…
– BitTorrent: improvements, alternatives, gaming it
– Security, incentives
• More recently:
– Live streaming
– P2P TV (IPTV)
– Games over P2P
P2P Systems – 68
What’s Missing?
• Very important lessons learned
– …but did we move beyond vertically-integrated applications?
• Can we distribute complex services on top of p2p overlays?
P2P Systems – 69
References
• Chord: A Scalable Peer-to-peer Lookup Service for Internet
Applications, Stoica et al., Sigcomm 2001
• A Scalable Content-Addressable Network, Ratnasamy et al., Sigcomm
2001
• Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer
Systems and Implications for System Design. Matei Ripeanu, Adriana
Iamnitchi and Ian Foster. IEEE Internet Computing, vol. 6(1), Feb 2002
• Interest-Aware Information Dissemination in Small-World Communities,
Adriana Iamnitchi and Ian Foster, HPDC 2005, Raleigh, NC, July 2005
• Small-World File-Sharing Communities. Adriana Iamnitchi, Matei
Ripeanu, Ian Foster, Infocom 2004, Hong Kong, March 2004
• IPTPS paper archive: http://www.iptps.org/papers.html
• Many materials available on the web, including lectures by Matei Ripeanu,
Karl Aberer, Brad Karp, and others.
P2P Systems – 70
P2P Systems – 71
Context and Motivation
• By the time we did this research, many p2p communities were large, active and stable
• Characterization of p2p systems showed particular user behavior
• Our question: instead of building system without user behavior in mind, could we (learn, observe, and) exploit it in system design?
• Follow-up questions:
– What user behavior should we focus on?
– How to exploit it?
– Is this pattern particular to one type of file-sharing community or more general?
P2P Systems – 72
“ Yellow Submarine ”
“ Wood Is a Pleasant
Thing to Think About ”
“ Yellow Submarine ”
“ Les Bonbons ”
“No 24 in B minor, BWV 869”
“ Les Bonbons ”
“ Wood Is a Pleasant
Thing to Think
About ”
New metric: The Data-Sharing Graph
G m
T (V, E) :
V is set of users active during interval T
An edge in E connects users that asked for at least m common files within T
P2P Systems – 73
The DØ Collaboration
2
1.5
1
0.5
0
4
3.5
3
2.5
6 months of traces (January – June 2002)
300+ users, 2 million requests for 200K files
Average path length: 7days, 50 files
Random D0
Small average path length
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
CCoef =
# Existing Edges
# Possible
Edges
Clustering coeficient: 7days, 50 files
Random D0
Large clustering coefficient
P2P Systems – 74
Small-World Graphs
• Small path length, large clustering coefficient
– Typically compared against random graphs
• Think of:
– “It’s a small world!”
– “Six degrees of separation”
• Milgram’s experiments in the 60s
• Guare’s play “Six Degrees of Separation”
P2P Systems – 75
10.0
Other Small Worlds
Food web
Power grid
LANL coauthors
Web
Film actors
1.0
Internet
Word co-occurrences
0.1
1 10 100 1000 10000
Clustering coefficient ratio (log scale)
D. J. Watts and S. H. Strogatz, Collective dynamics of small-world networks . Nature, 393:440-442, 1998
R. Albert and A.-L. Barabási, Statistical mechanics of complex networks , R. Modern Physics 74, 47 (2002).
Web Data-Sharing Graphs
10.0
300s,
1file
Web data-sharing graph
Other small-world graphs
1800s,
10file
7200s,
50files
1.0
1800s,
100files
3600s,
50files
0.1
1 10 100 1000 10000
Clustering coefficient ratio (log scale)
Data-Sharing Relationships in the Web, Iamnitchi, Ripeanu, and Foster, WWW’03
P2P Systems – 77
10.0
DØ Data-Sharing Graphs
Web data-sharing graph
D0 data-sharing graph
Other small-world graphs
1.0
0.1
1
28 days,
1 file
7days,
1file
10 100 1000 10000
Clustering coefficient ratio (log scale)
P2P Systems – 78
KaZaA Data-Sharing Graphs
10.0
Web data-sharing graph
D0 data-sharing graph
Other small-world graphs
Kazaa data-sharing graph
2 hours
1 file
1.0
4h
2 files
28 days
1 file
12h
4 files
7day,
1file
1 day
2 files
0.1
1 10 100 1000 10000
Clustering coefficient ratio (log scale)
Small-World File-Sharing Communities, Iamnitchi, Ripeanu, and Foster, Infocom ‘04
P2P Systems – 79
Overview
• Small-world file sharing communities:
– The data-sharing graph
– Traces from 3 file-sharing communities:
• D0, Kazaa, Web
– It’s a small world!
• Exploiting small-world patterns:
– Overlay construction
– Cluster identification
– Information dissemination
P2P Systems – 80
Exploiting Small-World Patterns
• Exploit the small-world properties of the data-sharing graph:
– Large clustering coefficient
– (… and small average path length)
• Interest-aware information dissemination
– Objective: dynamically identify groups of users with proven common interests in data
– Direct relevant information to groups of interest
• Case study: File location
– Concrete problem
– Real traces
– Specific performance metrics
– Real, new requirements
• Other mechanisms:
– Reputation mechanisms
– Replica placement
– …
P2P Systems – 81
Graph construction
Clustering
Dissemination
Interest-Aware Information Dissemination in Small-World Communities,
Iamnitchi and Foster, HPDC’05
P2P Systems – 82
Step 1: Graph Construction
Objective:
Make nodes aware of their common interests without central control
N
<T
1
, F>
N
F
A
X, T
0
, F x
A, T
1
C, T
1
, F
, F
3
…
<T
2
, F>
N
A,T
1
,F
F
B
Log access when downloading file (not when requesting location!)
P2P Systems – 83
Step 2: Clustering
• (extra) Challenge: no global knowledge (graph)
• Idea: Label edges
– Each node labels its edges as ‘short’ or ‘long’ based only on local information
• Multiple ways to define ‘short’/‘long’:
– “Short” edge if:
• Dead end
• In a triad
– “Long” otherwise
Web
70
60
50
40
30
20
10
0
2 min 5 min 15 min 30 min
• Skewed cluster size distribution
• Similar results obtained with centralized algorithm
• Need solutions to limit size
P2P Systems – 84
P2P Systems – 85
Step 3: Information Dissemination (1)
Hit rate due to previous information dissemination within clusters: up to 70%
(compared to under 5% for random groups of same size)
100
90
80
70
60
50
40
30
20
10
0
Web
2 min
Except largest cluster
Total hit rate
5 min 15 min 30 min
D0 100
90
80
70
60
50
40
30
20
10
Except largest cluster
Total hit rate
0
Kazaa
100
3 days 7 days 10 days 14 days 21 days 28 days
Except largest cluster
Total hit rate
90
80
70
60
50
40
30
20
10
0
1 hour 4 hours 8 hours
P2P Systems – 86
Step 3: Information Dissemination (2)
3 days, 10 files 3 days, 100 files
CDF of files per collection found locally due to information dissemination.
% collection found locally due to information dissemination
21 days, 100 files 21 days, 500 files
% collection found locally due to information dissemination
P2P Systems – 87
•The D0 experiment
•Web traces
•KaZaA network
Small Worlds!
Dissemination
P2P Systems – 88
Where Are We?
• We saw the major solutions in unstructured P2P systems:
– Napster
– Gnutella
– BitTorrent
• And a solution that starts from usage patterns to get inspiration for system design
– Exploiting small-world patterns in file-sharing
• There are many other ideas for unstructured p2p networks but
• There are also the structured p2p networks!
P2P Systems – 89