1
Unstructured P2P Systems
COP 6390.all
02/21/2011
Anda Iamnitchi
2
Google trends: Grid computing ( blue ), peer-to-peer ( red ), cloud computing ( orange )
3
Passed the test of deployment and usability
Interesting phenomena at large scale
Scale: users, resources, geographical
Wide-spread user experience
Large-scale distributed application, unprecedented growth and popularity
KaZaA – 389 millions downloads
(1M/week) one of the most popular applications ever!
Heavily researched in the last 10 years with results in:
User behavior characterization
Scalability
Novel problems: reputation, trust, incentives for fairness
Commercial impact eDonkey
FastTrack (Kazaa)
Gnutella
Cvernet
3,108,909
2,114,120
2,899,788
691,750
Filetopia 3,405
Number of users for file-sharing applications
(estimate www.slyck.com, Sept ‘06)
Node
Node
Internet
Node
Node Node
5
A distributed system architecture:
No centralized control (debatable: Napster?)
Nodes are symmetric in function (debatable: New Gnutella protocol?)
Large number of unreliable nodes
Initially identified with music file sharing
A number of definitions coexist:
Def 1 : “A class of applications that takes advantage of resources — storage, cycles, content, human presence — available at the edges of the Internet.”
Edges often turned off, without permanent IP addresses
Def 2 : “A class of decentralized, self-organizing distributed systems, in which all or most communication is symmetric.”
6
Lots of other definitions that fit in between
High capacity through parallelism:
Many disks
Many network connections
Many CPUs
Reliability:
Many replicas:
of data of network paths
Geographic distribution
Automatic configuration
Useful in public and proprietary settings
7
Britney Spears 1.00
p2p 0.40
8
Normalized and compared to the popularity of Britney Spears as shown by Google Trends
Program for sharing files over the Internet
History:
5/99: Shawn Fanning (freshman, Northeasten U.) founds Napster Online music service
12/99: first lawsuit
3/00: 25% UWisc traffic Napster
2000: est. 60M users
2/01 : US Circuit Court of Appeals:
9
Napster knew users violating copyright laws
7/01: # simultaneous online users:
Napster 160K, Gnutella: 40K, Morpheus: 300K
10
Join: How do I begin participating?
Publish: How do I advertise my file(s)?
Search: How do I find a file?
Fetch: How do I retrieve a file?
napster.com
11
• Client-Server: Use central server to locate files
• Peer-to-Peer: Download files directly from peers
12
1. File list is uploaded
(Join and Publish) napster.com
users
13
2. User requests search at server (Search).
Request and results napster.com
user
14
3.
User pings hosts that apparently have data.
Looks for best transfer rate.
ping napster.com
ping user
15
4.
User retrieves file
(Fetch) napster.com
Download file user
Strengths: Decentralization of Storage
Every node “pays” its participation by providing access to its resources
physical resources (disk, network), knowledge (annotations), ownership (files)
Every participating node acts as both a client and a server (“servent”): P2P
Decentralization of cost and administration = avoiding resource bottlenecks
Weaknesses: Centralization of Data Access Structures (Index)
Server is single point of failure
Unique entity required for controlling the system = design bottleneck
Copying copyrighted material made Napster target of legal attack increasing degree of resource sharing and decentralization
Centralized
System
Decentralized
System
16
17
18
Developed in a 14 days “quick hack” by Nullsoft (winamp)
Originally intended for “exchange of cooking recipes”
Evolution of Gnutella
Published under GNU General Public License on the Nullsoft web server
Taken off after a couple of hours by AOL (owner of Nullsoft): too late
Gnutella protocol was reverse engineered from the original
Protocol published
Third-party clients were published and Gnutella started to spread
Many iterations to fix poor initial design
High impact:
Many versions implemented
Many different designs
Lots of research papers/ideas
19
I have file A.
I have file A.
Reply
Flooding
Query
Where is file A?
Join: on startup, client contacts a few other nodes; these become its “neighbors”
Initial list of contacts published as gnutellahosts.com:6346
Outside the Gnutella protocol specification
Default value for number of open connections (neighbors): C = 4
Publish: no need
Search:
Flooding: ask neighbors, who ask their neighbors, and so on...
Each forwarding of requests decreases a TTL. Default: TTL = 7
When/if found, reply to sender
Drop forwarding requests when TTL expires
One request leads to
2 *
TTL i
0
C * ( C
1 ) i
26 , 240
Back-propagation in case of success (Why?) messages
Fetch: get the file directly from peer (HTTP)
Type
Ping
Pong
Query
QueryHit
Push
Description
Announce availability and probe for other servents
Response to a ping
Search request
Returned by servents that have the requested file
File download requests for servents behind a firewall
Contained Information
None
IP address and port# of responding servent; number and total kb of files shared
Minimum network bandwidth of responding servent; search criteria
IP address, port# and network bandwidth of responding servent; number of results and result set
Servent identifier; index of requested file; IP address and port to send file to
22
Eavesdrop traffic - insert modified node into the network and log traffic.
Crawler - connect to active nodes and use the membership protocol to discover membership and topology.
23
24
1.5Mbps DSL 1.5Mbps DSL
56kbps Modem
1.5Mbps DSL
10Mbps LAN
1.5Mbps DSL
56kbps Modem
56kbps Modem
25
Gnutella Network Structure: Improvement
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
Data transfer (file download)
Control messages (search, join, etc)
26
Déjà vu?
Gnutella Protocol 0.6
Two tier architectures of ultrapeers and leaves
Ultrapeers
Leaves
Data transfer (file download)
Control messages (search, join, etc)
27
More than 25% of Gnutella clients share no files; 75% share 100 files or less
Conclusion: Gnutella has a high percentage of free riders
If only a few individuals contribute to the public good, these few peers effectively act as centralized servers.
Outcome:
Significant efforts in building incentive-based systems
BitTorrent?
Adar and Huberman (Aug ’00)
28
Seen request already
29
Expanding Ring
start search with small TTL (e.g. TTL = 1)
if no success iteratively increase TTL (e.g. TTL = TTL +2) k-Random Walkers
forward query to one randomly chosen neighbor only, with large TTL
start k random walkers random walker periodically checks with requester whether to continue
Experiences (from simulation)
adaptive TTL is useful
message duplication should be avoided flooding should be controlled at fine granularity
30
31
Explosive growth in 2001, slowly shrinking thereafter
• High user interest
– Users tolerate high latency, low quality results
• Better resources
– DSL and cable modem nodes grew from 24% to 41% over first 6 months.
Power-law networks: the number of links per node follows a power-law distribution N = L -k
10000
1000
100
10
November 2000
Examples of power-law networks:
– The Internet at AS level
– In/out links to/from HTML pages
– Airports
– US power grid
– Social networks
32
1
1 10
Number of links (log scale)
100
Implications : High tolerance to random node failure but low reliability when facing of an ‘intelligent’ adversary
33
Partial Topology Random 30% die Targeted 4% die from Saroiu et al ., MMCN 2002
Later, larger networks display a bimodal distribution
Implications :
– High tolerance to random node failures preserved
– Increased reliability when facing an attack.
10000
May 2001
1000
100
From Ripeanu, Iamnitchi, Foster, 2002
34
10
1
1 10
Number of links (log scale)
100
35
Performance
Search latency: low (graph properties)
Message Bandwidth: high
improvements through random walkers, but essentially the whole network needs to be explored
Storage cost: low (only local neighborhood)
Update cost: low (only local updates)
Resilience to failures good: multiple paths are explored and data is replicated
Qualitative Criteria
search predicates: very flexible, any predicate is possible
global knowledge: none required peer autonomy: high
36
37
Torrent File
Metadata of file to be shared
Address of tracker
List of pieces and their checksums
Tracker
Lists peers interested in the distribution of the file
Peers
Clients interested in the distribution of the file
Can be “seeds” or “leachers”
38
• A “seed” node has the file
• A “tracker” associated with the file
• A “.torrent” meta-file is built for the file: identifies the address of the tracker node
• The .torrent file is published on web
• File is split into fixed-size segments (e.g.,
256KB)
39
Each connected peer is in one of two states
Choked: Download requests by a choked peer are ignored
Unchoked: Download requests by an unchoked peer are honored
Choking occurs at the peer level
Each peer has a certain number of unchoke slots
4 regular unchoke slots (per BitTorrent standard)
1 optimistic unchoke slot (per BitTorrent standard)
Choking Algorithm
Peers unchoke connected peers with best service rate
Service rate = rolling 20 second average of its upload bandwidth
Optimistically unchoking peers prevents a static set of unchoked peers
The choking algorithm runs every 10 seconds
Peers optimistically unchoked every 30 seconds
New peers are 3 times more likely to be optimistically unchoked
40
Random First Piece
Piece download at random
Algorithm used by new peers
Rarest Piece First
Ensures > 1 distributed copies of a piece
Increases interest of connected peers
Increases scalability
Random Piece vs. Rarest Piece
Rarest has probabilistically high download time
New peers want to reduce download time but also increase their interest
41
Join: nothing
Just find that there is a community ready to host your tracker
Publish: Create tracker, upload .torrent metadata file
Search:
For file: nothing
the community is supposed to provide search tools
For segments: exchange segment IDs maps with other peers.
Fetch: exchange segments with other peers (HTTP)
42
Architecture
Decentralization?
System properties
Reliability?
Scalability?
Fairness?
Overheads?
Quality of Service
Search coverage for content?
Ability to download content fast?