Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors Primary issues • • • • Organize, maintain overlay network Resource allocation/load balancing Object / resource location Network proximity routing Pastry provides a generic p2p substrate Architecture Event notification Network storage Pastry TCP/IP ? P2p application layer P2p substrate (self-organizing overlay network) Internet Pastry: Object distribution 2128-1 O Consistent hashing [Karger et al. ‘97] objId nodeIds 128 bit circular id space * nodeIds (uniform random) • objIds (uniform random) Invariant: node with numerically closest nodeId maintains object. (Recall Chord) 0 1 2 3 4 5 x x x x x x 6 6 6 6 6 0 1 2 3 4 x x x x x 6 5 0 x 6 5 1 x 6 5 2 x 6 5 3 x 6 5 4 x 7 8 9 a b c d e fx x x x x x x x x 6 6 6 6 6 6 6 6 6 6 6 7 8 9 a b c d e fx x x x x x x x x x 6 5 5 x 6 5 6 x 6 5 7 x 6 5 8 x 6 5 9 x 6 5 b x 6 5 c x 6 5 d x 6 6 5 5 e fx x Log 16 N rows Leaf set Routing table for node 65a1fc (b=4, so 2b = 16) Pastry: Leaf sets Each node maintains IP addresses of the nodes with the L/2 numerically closest larger and smaller nodeIds, respectively. • routing efficiency/robustness • fault detection (keep-alive) • application-specific local coordination Pastry: Routing d471f1 d467c4 d462ba Prefix routing d46a1c d4213f Route(d46a1c) 65a1fc d13da3 Properties log16 N steps O(log N) size routing table / node Pastry: Routing procedure if then else destination is within the range of leaf set forward to numerically closest member {let l = length of shared prefix} {let d = value of l-th digit in D’s address} if (Rld exists) (Rld = entry at column d row l) then forward to Rld else {rare case} forward to a known node that (a) shares at least as long a prefix, and (b) is numerically closer than this node Pastry: Performance Integrity of overlay/ message delivery: • guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds Number of routing hops: • No failures: < log16 N expected, 128/b + 1 max • During failure recovery: – O(N) worst case (loose upper bound), average case much better Self-organization How are the routing tables and leaf sets initialized and maintained? • Node addition • Node departure (failure) Pastry: Node addition d471f1 d467c4 d462ba d46a1c d4213f New node: d46a1c The new node X asks node 65a1fc to route a message to it. Nodes in the route share their routing tables with X Route(d46a1c) 65a1fc d13da3 Node departure (failure) Leaf set members exchange heartbeat • Leaf set repair (eager): request the set from farthest live node • Routing table repair (lazy): get table from peers in the same row, then higher rows Pastry: Average # of hops 4.5 Average number of hops 4 3.5 3 2.5 2 1.5 Pastry Log(N) 1 0.5 0 1000 10000 Number of nodes L=16, 100k random queries 100000 Pastry: Proximity routing Proximity metric = time delay estimated by ping A node can probe distance to any other node Each routing table entry uses a node close to the local node (in the proximity space), among all nodes with the appropriate node Id prefix. Pastry: Routes in proximity space d467c4 d471f1 d467c4 d462ba d46a1c Proximity space d4213f Route(d46a1c) d13da3 d4213f 65a1fc NodeId space d462ba 65a1fc d13da3 Pastry: Proximity routing Assumption: scalar proximity metric • e.g. ping delay, # IP hops • a node can probe distance to any other node Proximity invariant Each routing table entry refers to a node close to the local node (in the proximity space), among all nodes with the appropriate nodeId prefix. Pastry: Distance traveled 1.4 Relative Distance 1.3 1.2 1.1 1 Pastry 0.9 Complete routing table 0.8 1000 10000 Number of nodes L=16, 100k random queries, Euclidean proximity space 100000 PAST API • Insert - store replica of a file at k diverse storage nodes • Lookup - retrieve file from a nearby live storage node that holds a copy • Reclaim - free storage associated with a file Files are immutable PAST: File storage k=4 fileId Storage Invariant: File “replicas” are stored on k nodes with nodeIds closest to fileId Insert fileId (k is bounded by the leaf set size) PAST operations Insert A fileId is computed from the filename, and client’s public key using SHA-1. A file certificate is issued, signed by owner’s private key. The certificate and the file are then sent to the first among the k destinations. The storing node verifies the certificate, and checks the content hash, and issues a store receipt. Lookup Client sends lookup request to fileId, and a node responds with the content and the file certificate. PAST: File Retrieval C k replicas Lookup fileId file located in log16 N steps (expected) usually locates replica nearest to client C SCRIBE: Large-scale, decentralized multicast • Infrastructure to support topic-based publish-subscribe applications • Scalable: large numbers of topics, subscribers, wide range of subscribers/topic • Efficient: low delay, low link stress, low node overhead SCRIBE: Large scale multicast topicId Publish topicId Subscribe topicId Summary Self-configuring P2P framework for topic-based publish-subscribe • Scribe achieves reasonable performance when compared to IP multicast – Scales to a large number of subscribers – Scales to a large number of topics – Good distribution of load