Scalable peer-to-peer substrates: A new foundation for distributed applications? Rice University

advertisement
Scalable peer-to-peer substrates:
A new foundation for distributed
applications?
Peter Druschel, Rice University
Antony Rowstron,
Microsoft Research Cambridge, UK
Collaborators:
Miguel Castro, Anne-Marie Kermarrec, MSR Cambridge
Y. Charlie Hu, Sitaram Iyer, Animesh Nandi, Atul Singh,
Dan Wallach, Rice University
Outline
•
•
•
•
•
•
Background
Pastry
Pastry proximity routing
PAST
SCRIBE
Conclusions
Background
Peer-to-peer systems
•
•
•
•
distribution
decentralized control
self-organization
symmetry (communication, node roles)
Peer-to-peer applications
•
•
•
•
•
•
•
Pioneers: Napster, Gnutella, FreeNet
File sharing: CFS, PAST [SOSP’01]
Network storage: FarSite [Sigmetrics’00],
Oceanstore [ASPLOS’00], PAST [SOSP’01]
Web caching: Squirrel[PODC’02]
Event notification/multicast: Herald [HotOS’01],
Bayeux [NOSDAV’01], CAN-multicast [NGC’01],
SCRIBE [NGC’01], SplitStream [submitted]
Anonymity: Crowds [CACM’99], Onion routing
[JSAC’98]
Censorship-resistance: Tangler [CCS’02]
Common issues
• Organize, maintain overlay network
– node arrivals
– node failures
• Resource allocation/load balancing
• Resource location
• Network proximity routing
Idea: provide a generic p2p substrate
Architecture
Event
notification
Network
storage
Pastry
TCP/IP
?
P2p application layer
P2p substrate
(self-organizing
overlay network)
Internet
Structured p2p overlays
One primitive:
route(M, X): route message M to the live
node with nodeId closest to key X
• nodeIds and keys are from a large, sparse
id space
Distributed Hash Tables (DHT)
nodes
k1,v1
Operations:
insert(k,v)
lookup(k)
P2P
overlay
network
k2,v2
k3,v3
k4,v4
k5,v5
k6,v6
• p2p overlay maps keys to nodes
• completely decentralized and self-organizing
• robust, scalable
Why structured p2p overlays?
• Leverage pooled resources (storage,
bandwidth, CPU)
• Leverage resource diversity (geographic,
ownership)
• Leverage existing shared infrastructure
• Scalability
• Robustness
• Self-organization
Outline
•
•
•
•
•
•
Background
Pastry
Pastry proximity routing
PAST
SCRIBE
Conclusions
Pastry: Related work
• Chord [Sigcomm’01]
• CAN [Sigcomm’01]
• Tapestry [TR UCB/CSD-01-1141]
•
•
•
•
•
PNRP [unpub.]
Viceroy [PODC’02]
Kademlia [IPTPS’02]
Small World [Kleinberg ’99, ‘00]
Plaxton Trees [Plaxton et al. ’97]
Pastry: Object distribution
2128-1 O
Consistent hashing
[Karger et al. ‘97]
128 bit circular id space
objId
nodeIds (uniform random)
nodeIds
objIds (uniform random)
Invariant: node with
numerically closest nodeId
maintains object
Pastry: Object insertion/lookup
2128-1 O
X
Msg with key X
is routed to live
node with nodeId
closest to X
Problem:
complete routing
table not feasible
Route(X)
Pastry: Routing
Tradeoff
• O(log N) routing table size
• O(log N) message forwarding steps
Pastry: Routing table (# 65a1fcx)
Row 0
0
x
1
x
2
x
3
x
4
x
Row 1
6
0
x
6
1
x
6
2
x
6
3
x
6
4
x
Row 2
6
5
0
x
6
5
1
x
6
5
2
x
6
5
3
x
6
5
4
x
Row 3
6
5
a
0
x
6
5
a
2
x
6
5
a
3
x
6
5
a
4
x
log16 N
rows
5
x
7
x
8 9 a
x x x
b
x
c
x
d
x
e
x
f
x
6 6
6 7
x x
6 6 6
8 9 a
x x x
6
b
x
6
c
x
6
d
x
6
e
x
6
f
x
6
5
5
x
6
5
6
x
6
5
7
x
6
5
8
x
6
5
9
x
6
5
b
x
6
5
c
x
6
5
d
x
6
5
e
x
6
5
f
x
6
5
a
5
x
6
5
a
6
x
6
5
a
7
x
6
5
a
8
x
6
5
a
9
x
6
5
a
b
x
6
5
a
c
x
6
5
a
d
x
6
5
a
e
x
6
5
a
f
x
6
5
a
a
x
Pastry: Routing
d471f1
d467c4
d462ba
d46a1c
d4213f
Route(d46a1c)
65a1fc
d13da3
Properties
• log16 N steps
• O(log N) state
Pastry: Leaf sets
Each node maintains IP addresses of the
nodes with the L/2 numerically closest
larger and smaller nodeIds, respectively.
• routing efficiency/robustness
• fault detection (keep-alive)
• application-specific local coordination
Pastry: Routing procedure
if (destination is within range of our leaf set)
forward to numerically closest member
else
let l = length of shared prefix
let d = value of l-th digit in D’s address
if (Rld exists)
forward to Rld
else
forward to a known node that
(a) shares at least as long a prefix
(b) is numerically closer than this node
Pastry: Performance
Integrity of overlay/ message delivery:
• guaranteed unless L/2 simultaneous failures of
nodes with adjacent nodeIds
Number of routing hops:
• No failures: < log16 N expected, 128/b + 1 max
• During failure recovery:
– O(N) worst case, average case much better
Pastry: Self-organization
Initializing and maintaining routing tables and
leaf sets
• Node addition
• Node departure (failure)
Pastry: Node addition
d471f1
d467c4
d462ba
d46a1c
d4213f
New node: d46a1c
Route(d46a1c)
65a1fc
d13da3
Node departure (failure)
Leaf set members exchange keep-alive
messages
• Leaf set repair (eager): request set from
farthest live node in set
• Routing table repair (lazy): get table from
peers in the same row, then higher rows
Pastry: Experimental results
Prototype
• implemented in Java
• emulated network
• deployed testbed (currently ~25 sites
worldwide)
Pastry: Average # of hops
4.5
Average number of hops
4
3.5
3
2.5
2
1.5
Pastry
Log(N)
1
0.5
0
1000
10000
Number of nodes
L=16, 100k random queries
100000
Pastry: # of hops (100k nodes)
0.7
0.6449
0.6
Probability
0.5
0.4
0.3
0.1745
0.1643
0.2
0.1
0.0000
0.0006
0.0156
0
1
2
0.0000
0
3
Number of hops
L=16, 100k random queries
4
5
6
Pastry: # routing hops (failures)
3
2.96
Average hops per lookup
2.95
2.9
2.85
2.8
2.75
2.74
2.73
2.7
2.65
2.6
No Failure
Failure
After routing table repair
L=16, 100k random queries, 5k nodes, 500 failures
Outline
•
•
•
•
•
•
Background
Pastry
Pastry proximity routing
PAST
SCRIBE
Conclusions
Pastry: Proximity routing
Assumption: scalar proximity metric
• e.g. ping delay, # IP hops
• a node can probe distance to any other node
Proximity invariant:
Each routing table entry refers to a node close to
the local node (in the proximity space), among
all nodes with the appropriate nodeId prefix.
Pastry: Routes in proximity space
d467c4
d471f1
d467c4
d462ba
d46a1c
Proximity space
d4213f
Route(d46a1c)
d13da3
d4213f
65a1fc
NodeId space
d462ba
65a1fc
d13da3
Pastry: Distance traveled
1.4
Relative Distance
1.3
1.2
1.1
1
Pastry
0.9
Complete routing table
0.8
1000
10000
Number of nodes
L=16, 100k random queries, Euclidean proximity space
100000
Pastry: Locality properties
1) Expected distance traveled by a message in the
proximity space is within a small constant of the
minimum
2) Routes of messages sent by nearby nodes with
same keys converge at a node near the source
nodes
3) Among k nodes with nodeIds closest to the key,
message likely to reach the node closest to the
source node first
Pastry: Node addition
d471f1
d467c4
d462ba
d46a1c
d4213f
Route(d46a1c)
d467c4
Proximity space
d13da3
65a1fc
d4213f
New node: d46a1c
NodeId space
d462ba
65a1fc
d13da3
Distance traveled by Pastry message
Pastry delay vs IP delay
2500
Mean = 1.59
2000
1500
1000
500
0
0
200
400
600
800
1000
1200
1400
Distance between source and destination
GATech top., .5M hosts, 60K nodes, 20K random messages
Pastry: API
• route(M, X): route message M to node with
nodeId numerically closest to X
• deliver(M): deliver message M to
application
• forwarding(M, X): message M is being
forwarded towards key X
• newLeaf(L): report change in leaf set L to
application
Pastry: Security
•
•
•
•
Secure nodeId assignment
Secure node join protocols
Randomized routing
Byzantine fault-tolerant leaf set
membership protocol
Pastry: Summary
• Generic p2p overlay network
• Scalable, fault resilient, self-organizing,
secure
• O(log N) routing steps (expected)
• O(log N) routing table size
• Network proximity routing
Outline
•
•
•
•
•
•
Background
Pastry
Pastry proximity routing
PAST
SCRIBE
Conclusions
PAST: Cooperative, archival file
storage and distribution
•
•
•
•
•
•
Layered on top of Pastry
Strong persistence
High availability
Scalability
Reduced cost (no backup)
Efficient use of pooled resources
PAST API
• Insert - store replica of a file at k diverse
storage nodes
• Lookup - retrieve file from a nearby live
storage node that holds a copy
• Reclaim - free storage associated with a file
Files are immutable
PAST: File storage
fileId
Insert fileId
PAST: File storage
k=4
fileId
Storage Invariant:
File “replicas” are
stored on k nodes
with nodeIds
closest to fileId
Insert fileId
(k is bounded by the
leaf set size)
PAST: File Retrieval
C
k replicas
Lookup
fileId
file located in log16 N
steps (expected)
usually locates replica
nearest client C
PAST: Exploiting Pastry
• Random, uniformly distributed nodeIds
– replicas stored on diverse nodes
• Uniformly distributed fileIds
– e.g. SHA-1(filename,public key, salt)
– approximate load balance
• Pastry routes to closest live nodeId
– availability, fault-tolerance
PAST: Storage management
• Maintain storage invariant
• Balance free space when global utilization
is high
– statistical variation in assignment of files to
nodes (fileId/nodeId)
– file size variations
– node storage capacity variations
• Local coordination only (leaf sets)
Experimental setup
• Web proxy traces from NLANR
– 18.7 Gbytes, 10.5K mean, 1.4K median, 0 min,
138MB max
• Filesystem
– 166.6 Gbytes. 88K mean, 4.5K median, 0 min,
2.7 GB max
• 2250 PAST nodes (k = 5)
– truncated normal distributions of node storage
sizes, mean = 27/270 MB
Need for storage management
• No diversion (tpri = 1, tdiv = 0):
– max utilization 60.8%
– 51.1% inserts failed
• Replica/file diversion (tpri = .1, tdiv = .05):
– max utilization > 98%
– < 1% inserts failed
PAST: File insertion failures
2097152
30%
25%
Failure ratio
File size (Bytes)
1572864
20%
1048576
15%
Failed insertion
Failure ratio
524288
10%
5%
0
0
20
40
60
80
Global Utilization (%)
0%
100
PAST: Caching
• Nodes cache files in the unused portion of
their allocated disk space
• Files caches on nodes along the route of
lookup and insert messages
Goals:
• maximize query xput for popular documents
• balance query load
• improve client latency
PAST: Caching
fileId
Lookup topicId
PAST: Caching
1
2.5
None: # Hops
Global Cache Hit Rate
0.8
2
GD-S : Hit Rate
LRU: Hit Rate
0.7
0.6
1.5
0.5
0.4
1
LRU: # Hops
GD-S: Hit Rate
LRU : Hit Rate
GD-S: # Hops
LRU: # Hops
None: # Hops
0.3
GD-S: # Hops
0.2
0.1
0
0
20
40
60
Utilization (%)
80
0.5
0
100
Average number of routing hops
0.9
PAST: Security
• No read access control; users may encrypt
content for privacy
• File authenticity: file certificates
• System integrity: nodeIds, fileIds nonforgeable, sensitive messages signed
• Routing randomized
PAST: Storage quotas
Balance storage supply and demand
• user holds smartcard issued by brokers
– hides user private key, usage quota
– debits quota upon issuing file certificate
• storage nodes hold smartcards
– advertise supply quota
– storage nodes subject to random audits within
leaf sets
PAST: Related Work
• CFS [SOSP’01]
• OceanStore [ASPLOS 2000]
• FarSite [Sigmetrics 2000]
Outline
•
•
•
•
•
•
Background
Pastry
Pastry locality properties
PAST
SCRIBE
Conclusions
SCRIBE: Large-scale,
decentralized multicast
• Infrastructure to support topic-based
publish-subscribe applications
• Scalable: large numbers of topics,
subscribers, wide range of subscribers/topic
• Efficient: low delay, low link stress, low
node overhead
SCRIBE: Large scale multicast
topicId
Publish topicId
Subscribe topicId
Scribe: Results
• Simulation results
• Comparison with IP multicast: delay, node
stress and link stress
• Experimental setup
– Georgia Tech Transit-Stub model
– 100,000 nodes randomly selected out of .5M
– Zipf-like subscription distribution, 1500 topics
Scribe: Topic popularity
100000
Group Size
10000
1000
100
10
1
0
150
300
450
600
750
900
1050 1200 1350 1500
Group Rank
gsize(r) = floor(Nr -1.25 + 0.5); N=100,000; 1500 topics
Scribe: Delay penalty
1500
Cumulative Groups
1200
RMD
900
RAD
600
300
0
0
1
2
3
Delay Penalty
Relative delay penalty, average and maximum
4
5
Scribe: Node stress
55
20000
50
45
40
Number of Nodes
Number of Nodes
15000
10000
35
30
25
20
15
10
5
5000
0
50
150
250
350
450
850
750
650
550
950
Total Number of Children Table Entries
0
0
100
200
300
400
500
600
700
800
Total Number of Children Table Entries
900
1000
1100
1050
Scribe: Link stress
30000
Scribe
Number of Links
25000
IP Multicast
20000
15000
10000
Maximum
stress
5000
0
1
10
100
1000
Link Stress
One message published in each of the 1,500 topics
10000
Related works
• Narada
• Bayeux/Tapestry
• CAN-Multicast
Summary
Self-configuring P2P framework for topicbased publish-subscribe
• Scribe achieves reasonable performance when
compared to IP multicast
– Scales to a large number of subscribers
– Scales to a large number of topics
– Good distribution of load
Status
Functional prototypes
• Pastry [Middleware 2001]
• PAST [HotOS-VIII, SOSP’01]
• SCRIBE [NGC 2001, IEEE JSAC]
• SplitStream [submitted]
• Squirrel [PODC’02]
http://www.cs.rice.edu/CS/Systems/Pastry
Current Work
• Security
– secure routing/overlay maintenance/nodeId assignment
– quota system
•
•
•
•
•
Keyword search capabilities
Support for mutable files in PAST
Anonymity/Anti-censorship
New applications
Free software releases
Conclusion
For more information
http://www.cs.rice.edu/CS/Systems/Pastry
Download