Unstructured P2P

advertisement
Topics in Database Systems: Data Management in
Peer-to-Peer Systems
Search in Unstructured P2p
p2p, Fall 06
1
Topics in Database Systems: Data Management in
Peer-to-Peer Systems
D. Tsoumakos and N. Roussopoulos, “A Comparison of Peer-toPeer Search Methods”, WebDB03
p2p, Fall 06
2
Overview
 Centralized
Constantly-updated directory hosted at central locations (do
not scale well, updates, single points of failure)
 Decentralized but structured
The overlay topology is highly controlled and files (or
metadata/index) are not placed at random nodes but at
specified locations
Decentralized and Unstructured
peers connect in an ad-hoc fashion
the location of document/metadata is not controlled by the system
 No guarantee for the success of a search
 No bounds on search time
 No maintenance cost
 Any kind of query (not just single key or range queries)
p2p, Fall 06
3
Flooding on Overlays
xyz.mp3
xyz.mp3 ?
p2p, Fall 06
4
Flooding on Overlays
xyz.mp3
xyz.mp3 ?
Flooding
p2p, Fall 06
5
Flooding on Overlays
xyz.mp3
xyz.mp3 ?
Flooding
p2p, Fall 06
6
Flooding on Overlays
xyz.mp3
p2p, Fall 06
7
Search in Unstructured P2P
Must find a way to stop the search: Time-to-Leave (TTL)
Exponential Number of Messages
Cycles (?)
Note: cycles can be detected but not avoided
p2p, Fall 06
8
Search in Unstructured P2P
BFS vs DFS
BFS better response time, larger number of nodes
(message overhead per node and overall)
Note: search in BFS continues (if TTL is not reached), even
if the object has been located on a different path
Recursive vs Iterative
During search, whether the node issuing the query directly
contacts others, or recursively.
Does the result follows the same path?
p2p, Fall 06
9
Iterative vs. Recursive Routing
Iterative: Originator requests IP address of each hop
• Message transport is actually done via direct IP
Recursive: Message transferred hop-by-hop
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
retrieve (K1)
p2p, Fall 06
10
Search in Unstructured P2P
Two general types of search in unstructured p2p:
Blind: try to propagate the query to a sufficient number of
nodes (example Gnutella)
Informed: utilize information about document locations (example
Routing Indexes)
Informed search increases the cost of join for
an improved search cost
p2p, Fall 06
11
Blind Search Methods
Gnutella:
Use flooding (BFS) to contact all accessible nodes within the
TTL value
Huge overhead to a large number of peers +
Overall network traffic
Hard to find unpopular items
Up to 60% bandwidth consumption of the total Internet
traffic
p2p, Fall 06
12
Free-riding on Gnutella [Adar00]
•
•
•
•
p2p, Fall 06
24 hour sampling period:
– 70% of Gnutella users share no files
– 50% of all responses are returned by top 1% of sharing
hosts
A social problem not a technical one
Problems:
– Degradation of system performance: collapse?
– Increase of system vulnerability
– “Centralized” (“backbone”) Gnutella  copyright issues?
Verified hypotheses:
– H1: A significant portion of Gnutella peers are free riders
– H2: Free riders are distributed evenly across domains
– H3: Often hosts share files nobody is interested in (are
not downloaded)
13
Free-riding Statistics - 1 [Adar00]
H1: Most Gnutella users are free riders
Of 33,335 hosts:
– 22,084 (66%) of the peers share no files
– 24,347 (73%) share ten or less files
– Top 1 percent (333) hosts share 37% (1,142,645) of total files shared
– Top 5 percent (1,667) hosts share 70% (1,142,645) of total files shared
– Top 10 percent (3,334) hosts share 87% (2,692,082) of total files shared
p2p, Fall 06
14
Free-riding Statistics - 2 [Adar00]
H3: Many servents share files nobody downloads
Of 11,585 sharing hosts:
– Top 1% of sites provide nearly 47% of all answers
– Top 25% of sites provide 98% of all answers
– 7,349 (63%) never provide a query response
p2p, Fall 06
15
Free Riders
• File sharing studies
– Lots of people download
– Few people serve files
• Is this bad?
– If there’s no incentive to serve, why do people do so?
– What if there are strong disincentives to being a major
server?
p2p, Fall 06
16
Simple Solution: Thresholds
• Many programs allow a threshold to be set
– Don’t upload a file to a peer unless it shares > k files
• Problems:
– What’s k?
– How to ensure the shared files are interesting?
p2p, Fall 06
17
Categories of Queries [Sripanidkulchai01]
Categorized top 20 queries
p2p, Fall 06
18
Popularity of Queries [Sripanidkulchai01]
•
•
•
Very popular documents are approximately equally popular
Less popular documents follow a Zipf-like distribution (i.e., the
probability of seeing a query for the ith most popular query is
proportional to 1/(ialpha))
Access frequency of web documents also follows Zipf-like distributions
 caching might also work for Gnutella
p2p, Fall 06
19
Caching in Gnutella [Sripanidkulchai01]
•
•
Average bandwidth consumption in tests: 3.5Mbps
Best case: trace 2 (73% hit rate = 3.7 times traffic reduction)
p2p, Fall 06
20
Topology of Gnutella [Jovanovic01]
• Power-law properties verified (“find everything close by”)
• Backbone + outskirts
Power-Law
(PLRG):
Random
Graph
The node degrees follow a
power law distribution:
if one ranks all nodes from the
most connected to the least
connected, then
the i’th most connected node
has ω/ia neighbors,
where w is a constant.
p2p, Fall 06
21
Gnutella Backbone [Jovanovic01]
p2p, Fall 06
22
Why does it work? It’s a small World! [Hong01]
•
•
Milgram: 42 out of 160 letters from Oregon to Boston (~ 6 hops)
Watts: between order and randomness
– short-distance clustering + long-distance shortcuts
Regular graph:
n nodes, k nearest neighbors
 path length ~ n/2k
4096/16 = 256
p2p, Fall 06
Rewired graph (1% of nodes):
path length ~ random graph
clustering ~ regular graph
Random graph:
path length ~ log (n)/log(k)
~4
23
Links in the small World [Hong01]
•
“Scale-free” link distribution
– Scale-free: independent of the total number of nodes
– Characteristic for small-world networks
– The proportion of nodes having a given number of links n is:
P(n) = 1 /n k
– Most nodes have only a few connections
– Some have a lot of links: important for binding disparate regions
together
p2p, Fall 06
24
Freenet: Links in the small World [Hong01]
P(n) ~ 1/n 1.5
p2p, Fall 06
25
Gnutella: “New” Measurements
[1] Stefan Saroiu, P. Krishna Gummadi, Steven D. Gribble:
A Measurement Study of Peer-to-Peer File Sharing Systems,
Proceedings of Multimedia Computing and Networking (MMCN)
2002, San Jose, CA, USA, January 2002.
[2] M. Ripeanu, I. Foster, and A. Iamnitchi.
Mapping the gnutella network: Properties of large-scale peer-to-peer systems and implications for
system design.
IEEE Internet Computing Journal, 6(1), 2002
[3] Evangelos P. Markatos,
Tracing a large-scale Peer to Peer System: an hour in the life of Gnutella,
2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2002.
[4] Y. HawatheAWATHE, S. Ratnasamy, L. Breslau, and S. Shenker.
Making Gnutella-like P2P Systems Scalable. In Proc. ACM SIGCOMM (Aug. 2003).
[5] Qin Lv, Pei Cao, Edith Cohen, Kai Li, Scott Shenker:
Search and replication in unstructured peer-to-peer networks. ICS 2002: 84-95
p2p, Fall 06
26
Gnutella: Bandwidth Barriers
•
•
Clip2 measured Gnutella over 1 month:
– typical query is 560 bits long (including TCP/IP headers)
– 25% of the traffic are queries, 50% pings, 25% other
– on average each peer seems to have 3 other peers actively connected
Clip2 found a scalability barrier with substantial performance degradation if
queries/sec > 10:
10 queries/sec
* 560 bits/query
* 4 (to account for the other 3 quarters of message traffic)
* 3 simultaneous connections
67,200 bps
 10 queries/sec maximum in the presence of many dialup users
 won’t improve (more bandwidth - larger files)
p2p, Fall 06
27
Gnutella: Summary
•
•
•
•
•
Completely decentralized
Hit rates are high
High fault tolerance
Adopts well and dynamically to changing peer populations
Protocol causes high network traffic (e.g., 3.5Mbps). For example:
– 4 connections C / peer, TTL = 7
TTL
i  26,240
2
*
C
*
(
C

1
)
– 1 ping packet can cause packets
i 0
No estimates on the duration of queries can be given
No probability for successful queries can be given
Topology is unknown  algorithms cannot exploit it
Free riding is a problem
Reputation of peers is not addressed
Simple, robust, and scalable (at the moment)

•
•
•
•
•
•
p2p, Fall 06
28
Lessons and Limitations
•
Client-Server performs well
– But not always feasible
• Ideal performance is often not the key issue!
•
Things that flood-based systems do well
– Organic scaling
– Decentralization of visibility and liability
– Finding popular stuff (e.g., caching)
– Fancy local queries
•
Things that flood-based systems do poorly
– Finding unpopular stuff [Loo, et al VLDB 04]
– Fancy distributed queries
– Vulnerabilities: data poisoning, tracking, etc.
– Guarantees about anything (answer quality, privacy, etc.)
p2p, Fall 06
29
Comparison
Gnutella
Expressivness
Comprehensivness
Autonomy
Efficiency
Robustness
Others?





Topology
pwr law
Data Placement
arbitrary
Message Routing
flooding
p2p, Fall 06
CAN
30
Comparison
Gnutella
CAN










Topology
pwr law
grid
Data Placement
arbitrary
hashing
Message Routing
flooding
directed
Expressivness
Comprehensivness
Autonomy
Efficiency
Robustness
p2p, Fall 06
Others?
31
Security & Privacy
• Issues:
– Anonymity
– Reputation
– Accountability
– Information Preservation
– Information Quality
– Trust
– Denial of service attacks
p2p, Fall 06
32
Authenticity
title: origin of species
author: charles darwin
?
date: 1859
body: In an island far,
far away ...
...
p2p, Fall 06
33
More than Just File Integrity
title: origin of species
author: charles darwin
?
date: 1859 00
body: In an island far,
far away ...
checksum
p2p, Fall 06
34
More than Fetching One File
T=origin
Y=?
A=darwin
B=?
T=origin
Y=1800
A=darwin
p2p, Fall 06
T=origin T=origin
Y=1859
Y=1859
A=darwin A=darwin
B=abcd
T=origin
Y=1859
A=darwin
35
Solutions
•
•
•
Authenticity Function A(doc): T or F
– at expert sites, at all sites?
– can use signature expert
sig(doc)
Voting Based
– authentic is what majority says
Time Based
– e.g., oldest version (available) is authentic
p2p, Fall 06
user
36
Issues
•
•
•
•
•
•
Trust computations in dynamic system
Overloading good nodes
Bad nodes can provide good content sometimes
Bad nodes can build up reputation
Bad nodes can form collectives
...
p2p, Fall 06
37
Back to searching
p2p, Fall 06
38
Blind Search Methods
Modified-BFS:
Choose only a ratio of the neighbors (some random subset)
Iterative Deepening:
Start BFS with a small TTL and repeat the BFS at
increasing depths if the first BFS fails
Works well when there is some stop condition and a
“small” flood will satisfy the query
Else even bigger loads than standard flooding
(more later …)
p2p, Fall 06
39
Random Walks:
Blind Search Methods
The node that poses the query sends out k query messages to an
equal number of randomly chosen neighbors
Each step follows each own path at each step randomly choosing
one neighbor to forward it
Each path – a walker
Two methods to terminate each walker:
 TTL-based or
 checking method (the walkers periodically check with the query source if the
stop condition has been met)
It reduces the number of messages to k x TTL in the worst case
Some kind of local load-balancing
p2p, Fall 06
40
Blind Search Methods
Random Walks:
In addition, the protocol bias its walks towards high-degree
nodes (choose the highest degree neighbor)
p2p, Fall 06
41
Blind Search Methods
Using Super-nodes:
Super (or ultra) peers are connected to each other
Each super-peer is also connected with a number of leaf nodes
Routing among the super-peers
The super-peers then contact their leaf nodes
p2p, Fall 06
42
Blind Search Methods
Using Super-nodes:
Gnutella2
When a super-peer (or hub) receives a query from a leaf, it
forwards it to its relevant leaves and to neighboring super-peers
The hubs process the query locally and forward it to their
relevant leaves
Neighboring super-peers regularly exchange local repository
tables to filter out traffic between them
p2p, Fall 06
43
Blind Search Methods
Ultrapeers can be installed (KaZaA) or self-promoted (Gnutella)
Interconnection between
the superpeers
p2p, Fall 06
44
Informed Search Methods
Local Index
Each node indexes all files stored at all nodes within a certain
radius r and can answer queries on behalf of them
Search process at steps of r, hop distance between two
consecutive searches 2r+1
Increased cost for join/leave
• Flood inside each r with TTL = r, when join/leave the network
p2p, Fall 06
45
Informed Search Methods
Intelligent BFS
query
...
?
Nodes store simple statistics on its neighbors:
(query, NeigborID) tuples for recently answered requests from or
through their neighbors
so they can rank them
For each query, a node finds similar ones and selects a direction
How?
p2p, Fall 06
46
Informed Search Methods
Intelligent or Directed BFS
query
•
...
?
Heuristics for Selecting Direction
>RES: Returned most results for previous queries
<TIME: Shortest satisfaction time
<HOPS: Min hops for results
>MSG: Forwarded the largest number of messages (all types),
suggests that the neighbor is stable
<QLEN: Shortest queue
<LAT: Shortest latency
>DEG: Highest degree
p2p, Fall 06
47
Informed Search Methods
Intelligent or Directed BFS
• No negative feedback
• Depends on the assumption that nodes specialize in certain
documents
p2p, Fall 06
48
Informed Search Methods
APS
Again, each node keeps a local index with one entry for each object it has
requested per neighbor –
this reflects the relative probability of the node to be chosen to forward
the query
k independent walkers and probabilistic forwarding
Each node forwards the query to one of its neighbor based on the local
index (for each object, choose a neighbor using the stored probability)
If a walker, succeeds the probability is increased, else is decreased –
Take the reverse path to the requestor and update the probability, after a
walker miss (optimistic update) or after a hit (pessimistic update)
p2p, Fall 06
49
Download