Scalable peer-to-peer substrates: A new foundation for distributed applications? Peter Druschel, Rice University Antony Rowstron, Microsoft Research Cambridge, UK Collaborators: Miguel Castro, Anne-Marie Kermarrec, MSR Cambridge Y. Charlie Hu, Sitaram Iyer, Animesh Nandi, Atul Singh, Dan Wallach, Rice University Outline • • • • • • Background Pastry Pastry proximity routing PAST SCRIBE Conclusions Background Peer-to-peer systems • • • • distribution decentralized control self-organization symmetry (communication, node roles) Peer-to-peer applications • • • • • • • Pioneers: Napster, Gnutella, FreeNet File sharing: CFS, PAST [SOSP’01] Network storage: FarSite [Sigmetrics’00], Oceanstore [ASPLOS’00], PAST [SOSP’01] Web caching: Squirrel[PODC’02] Event notification/multicast: Herald [HotOS’01], Bayeux [NOSDAV’01], CAN-multicast [NGC’01], SCRIBE [NGC’01], SplitStream [submitted] Anonymity: Crowds [CACM’99], Onion routing [JSAC’98] Censorship-resistance: Tangler [CCS’02] Common issues • Organize, maintain overlay network – node arrivals – node failures • Resource allocation/load balancing • Resource location • Network proximity routing Idea: provide a generic p2p substrate Architecture Event notification Network storage Pastry TCP/IP ? P2p application layer P2p substrate (self-organizing overlay network) Internet Structured p2p overlays One primitive: route(M, X): route message M to the live node with nodeId closest to key X • nodeIds and keys are from a large, sparse id space Distributed Hash Tables (DHT) nodes k1,v1 Operations: insert(k,v) lookup(k) P2P overlay network k2,v2 k3,v3 k4,v4 k5,v5 k6,v6 • p2p overlay maps keys to nodes • completely decentralized and self-organizing • robust, scalable Why structured p2p overlays? • Leverage pooled resources (storage, bandwidth, CPU) • Leverage resource diversity (geographic, ownership) • Leverage existing shared infrastructure • Scalability • Robustness • Self-organization Outline • • • • • • Background Pastry Pastry proximity routing PAST SCRIBE Conclusions Pastry: Related work • Chord [Sigcomm’01] • CAN [Sigcomm’01] • Tapestry [TR UCB/CSD-01-1141] • • • • • PNRP [unpub.] Viceroy [PODC’02] Kademlia [IPTPS’02] Small World [Kleinberg ’99, ‘00] Plaxton Trees [Plaxton et al. ’97] Pastry: Object distribution 2128-1 O Consistent hashing [Karger et al. ‘97] 128 bit circular id space objId nodeIds (uniform random) nodeIds objIds (uniform random) Invariant: node with numerically closest nodeId maintains object Pastry: Object insertion/lookup 2128-1 O X Msg with key X is routed to live node with nodeId closest to X Problem: complete routing table not feasible Route(X) Pastry: Routing Tradeoff • O(log N) routing table size • O(log N) message forwarding steps Pastry: Routing table (# 65a1fcx) Row 0 0 x 1 x 2 x 3 x 4 x Row 1 6 0 x 6 1 x 6 2 x 6 3 x 6 4 x Row 2 6 5 0 x 6 5 1 x 6 5 2 x 6 5 3 x 6 5 4 x Row 3 6 5 a 0 x 6 5 a 2 x 6 5 a 3 x 6 5 a 4 x log16 N rows 5 x 7 x 8 9 a x x x b x c x d x e x f x 6 6 6 7 x x 6 6 6 8 9 a x x x 6 b x 6 c x 6 d x 6 e x 6 f x 6 5 5 x 6 5 6 x 6 5 7 x 6 5 8 x 6 5 9 x 6 5 b x 6 5 c x 6 5 d x 6 5 e x 6 5 f x 6 5 a 5 x 6 5 a 6 x 6 5 a 7 x 6 5 a 8 x 6 5 a 9 x 6 5 a b x 6 5 a c x 6 5 a d x 6 5 a e x 6 5 a f x 6 5 a a x Pastry: Routing d471f1 d467c4 d462ba d46a1c d4213f Route(d46a1c) 65a1fc d13da3 Properties • log16 N steps • O(log N) state Pastry: Leaf sets Each node maintains IP addresses of the nodes with the L/2 numerically closest larger and smaller nodeIds, respectively. • routing efficiency/robustness • fault detection (keep-alive) • application-specific local coordination Pastry: Routing procedure if (destination is within range of our leaf set) forward to numerically closest member else let l = length of shared prefix let d = value of l-th digit in D’s address if (Rld exists) forward to Rld else forward to a known node that (a) shares at least as long a prefix (b) is numerically closer than this node Pastry: Performance Integrity of overlay/ message delivery: • guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds Number of routing hops: • No failures: < log16 N expected, 128/b + 1 max • During failure recovery: – O(N) worst case, average case much better Pastry: Self-organization Initializing and maintaining routing tables and leaf sets • Node addition • Node departure (failure) Pastry: Node addition d471f1 d467c4 d462ba d46a1c d4213f New node: d46a1c Route(d46a1c) 65a1fc d13da3 Node departure (failure) Leaf set members exchange keep-alive messages • Leaf set repair (eager): request set from farthest live node in set • Routing table repair (lazy): get table from peers in the same row, then higher rows Pastry: Experimental results Prototype • implemented in Java • emulated network • deployed testbed (currently ~25 sites worldwide) Pastry: Average # of hops 4.5 Average number of hops 4 3.5 3 2.5 2 1.5 Pastry Log(N) 1 0.5 0 1000 10000 Number of nodes L=16, 100k random queries 100000 Pastry: # of hops (100k nodes) 0.7 0.6449 0.6 Probability 0.5 0.4 0.3 0.1745 0.1643 0.2 0.1 0.0000 0.0006 0.0156 0 1 2 0.0000 0 3 Number of hops L=16, 100k random queries 4 5 6 Pastry: # routing hops (failures) 3 2.96 Average hops per lookup 2.95 2.9 2.85 2.8 2.75 2.74 2.73 2.7 2.65 2.6 No Failure Failure After routing table repair L=16, 100k random queries, 5k nodes, 500 failures Outline • • • • • • Background Pastry Pastry proximity routing PAST SCRIBE Conclusions Pastry: Proximity routing Assumption: scalar proximity metric • e.g. ping delay, # IP hops • a node can probe distance to any other node Proximity invariant: Each routing table entry refers to a node close to the local node (in the proximity space), among all nodes with the appropriate nodeId prefix. Pastry: Routes in proximity space d467c4 d471f1 d467c4 d462ba d46a1c Proximity space d4213f Route(d46a1c) d13da3 d4213f 65a1fc NodeId space d462ba 65a1fc d13da3 Pastry: Distance traveled 1.4 Relative Distance 1.3 1.2 1.1 1 Pastry 0.9 Complete routing table 0.8 1000 10000 Number of nodes L=16, 100k random queries, Euclidean proximity space 100000 Pastry: Locality properties 1) Expected distance traveled by a message in the proximity space is within a small constant of the minimum 2) Routes of messages sent by nearby nodes with same keys converge at a node near the source nodes 3) Among k nodes with nodeIds closest to the key, message likely to reach the node closest to the source node first Pastry: Node addition d471f1 d467c4 d462ba d46a1c d4213f Route(d46a1c) d467c4 Proximity space d13da3 65a1fc d4213f New node: d46a1c NodeId space d462ba 65a1fc d13da3 Distance traveled by Pastry message Pastry delay vs IP delay 2500 Mean = 1.59 2000 1500 1000 500 0 0 200 400 600 800 1000 1200 1400 Distance between source and destination GATech top., .5M hosts, 60K nodes, 20K random messages Pastry: API • route(M, X): route message M to node with nodeId numerically closest to X • deliver(M): deliver message M to application • forwarding(M, X): message M is being forwarded towards key X • newLeaf(L): report change in leaf set L to application Pastry: Security • • • • Secure nodeId assignment Secure node join protocols Randomized routing Byzantine fault-tolerant leaf set membership protocol Pastry: Summary • Generic p2p overlay network • Scalable, fault resilient, self-organizing, secure • O(log N) routing steps (expected) • O(log N) routing table size • Network proximity routing Outline • • • • • • Background Pastry Pastry proximity routing PAST SCRIBE Conclusions PAST: Cooperative, archival file storage and distribution • • • • • • Layered on top of Pastry Strong persistence High availability Scalability Reduced cost (no backup) Efficient use of pooled resources PAST API • Insert - store replica of a file at k diverse storage nodes • Lookup - retrieve file from a nearby live storage node that holds a copy • Reclaim - free storage associated with a file Files are immutable PAST: File storage fileId Insert fileId PAST: File storage k=4 fileId Storage Invariant: File “replicas” are stored on k nodes with nodeIds closest to fileId Insert fileId (k is bounded by the leaf set size) PAST: File Retrieval C k replicas Lookup fileId file located in log16 N steps (expected) usually locates replica nearest client C PAST: Exploiting Pastry • Random, uniformly distributed nodeIds – replicas stored on diverse nodes • Uniformly distributed fileIds – e.g. SHA-1(filename,public key, salt) – approximate load balance • Pastry routes to closest live nodeId – availability, fault-tolerance PAST: Storage management • Maintain storage invariant • Balance free space when global utilization is high – statistical variation in assignment of files to nodes (fileId/nodeId) – file size variations – node storage capacity variations • Local coordination only (leaf sets) Experimental setup • Web proxy traces from NLANR – 18.7 Gbytes, 10.5K mean, 1.4K median, 0 min, 138MB max • Filesystem – 166.6 Gbytes. 88K mean, 4.5K median, 0 min, 2.7 GB max • 2250 PAST nodes (k = 5) – truncated normal distributions of node storage sizes, mean = 27/270 MB Need for storage management • No diversion (tpri = 1, tdiv = 0): – max utilization 60.8% – 51.1% inserts failed • Replica/file diversion (tpri = .1, tdiv = .05): – max utilization > 98% – < 1% inserts failed PAST: File insertion failures 2097152 30% 25% Failure ratio File size (Bytes) 1572864 20% 1048576 15% Failed insertion Failure ratio 524288 10% 5% 0 0 20 40 60 80 Global Utilization (%) 0% 100 PAST: Caching • Nodes cache files in the unused portion of their allocated disk space • Files caches on nodes along the route of lookup and insert messages Goals: • maximize query xput for popular documents • balance query load • improve client latency PAST: Caching fileId Lookup topicId PAST: Caching 1 2.5 None: # Hops Global Cache Hit Rate 0.8 2 GD-S : Hit Rate LRU: Hit Rate 0.7 0.6 1.5 0.5 0.4 1 LRU: # Hops GD-S: Hit Rate LRU : Hit Rate GD-S: # Hops LRU: # Hops None: # Hops 0.3 GD-S: # Hops 0.2 0.1 0 0 20 40 60 Utilization (%) 80 0.5 0 100 Average number of routing hops 0.9 PAST: Security • No read access control; users may encrypt content for privacy • File authenticity: file certificates • System integrity: nodeIds, fileIds nonforgeable, sensitive messages signed • Routing randomized PAST: Storage quotas Balance storage supply and demand • user holds smartcard issued by brokers – hides user private key, usage quota – debits quota upon issuing file certificate • storage nodes hold smartcards – advertise supply quota – storage nodes subject to random audits within leaf sets PAST: Related Work • CFS [SOSP’01] • OceanStore [ASPLOS 2000] • FarSite [Sigmetrics 2000] Outline • • • • • • Background Pastry Pastry locality properties PAST SCRIBE Conclusions SCRIBE: Large-scale, decentralized multicast • Infrastructure to support topic-based publish-subscribe applications • Scalable: large numbers of topics, subscribers, wide range of subscribers/topic • Efficient: low delay, low link stress, low node overhead SCRIBE: Large scale multicast topicId Publish topicId Subscribe topicId Scribe: Results • Simulation results • Comparison with IP multicast: delay, node stress and link stress • Experimental setup – Georgia Tech Transit-Stub model – 100,000 nodes randomly selected out of .5M – Zipf-like subscription distribution, 1500 topics Scribe: Topic popularity 100000 Group Size 10000 1000 100 10 1 0 150 300 450 600 750 900 1050 1200 1350 1500 Group Rank gsize(r) = floor(Nr -1.25 + 0.5); N=100,000; 1500 topics Scribe: Delay penalty 1500 Cumulative Groups 1200 RMD 900 RAD 600 300 0 0 1 2 3 Delay Penalty Relative delay penalty, average and maximum 4 5 Scribe: Node stress 55 20000 50 45 40 Number of Nodes Number of Nodes 15000 10000 35 30 25 20 15 10 5 5000 0 50 150 250 350 450 850 750 650 550 950 Total Number of Children Table Entries 0 0 100 200 300 400 500 600 700 800 Total Number of Children Table Entries 900 1000 1100 1050 Scribe: Link stress 30000 Scribe Number of Links 25000 IP Multicast 20000 15000 10000 Maximum stress 5000 0 1 10 100 1000 Link Stress One message published in each of the 1,500 topics 10000 Related works • Narada • Bayeux/Tapestry • CAN-Multicast Summary Self-configuring P2P framework for topicbased publish-subscribe • Scribe achieves reasonable performance when compared to IP multicast – Scales to a large number of subscribers – Scales to a large number of topics – Good distribution of load Status Functional prototypes • Pastry [Middleware 2001] • PAST [HotOS-VIII, SOSP’01] • SCRIBE [NGC 2001, IEEE JSAC] • SplitStream [submitted] • Squirrel [PODC’02] http://www.cs.rice.edu/CS/Systems/Pastry Current Work • Security – secure routing/overlay maintenance/nodeId assignment – quota system • • • • • Keyword search capabilities Support for mutable files in PAST Anonymity/Anti-censorship New applications Free software releases Conclusion For more information http://www.cs.rice.edu/CS/Systems/Pastry