SKIP GRAPHS Level 2 Level 1 Level 0 James Aspnes Gauri Shah To appear in SODA 2003. 2 Outline • Peer-to-peer systems • Existing approach: Distributed Hash Tables • Our Approach: Skip Graphs • Algorithms and Properties • Experimental Results • Conclusions and open problems 3 P2P system Peers Resources Key • Bunch of peers. • Store resources identified by keys. • Peers subject to crash failures. • Goal: locate resources efficiently. 4 Properties of ideal network •Data availability •Decentralization •Fault-tolerance •Scalability •Load balancing •Maintaining the network •Dynamic node addition/deletion •Self-stabilization •Efficient searching •Incorporating geography •Incorporating locality [temporal, spatial] 5 Early P2P systems Napster Gnutella x ? x ? x Central server bottleneck Inefficient flooding 6 Tapestry [JKZ’01] Uses Plaxton’s Algorithm: Node xyz links to *XX, x*X and xy* [* = all digits, X = any digit] 427 768 368 123 327 135 365 360 Correct one digit at a time to reach target. Pastry [DR’01] is also similar. 7 CAN [RFHKS’01] Partition d-dimensional co-ordinate space into zones. (0,1) (1,1) 3 d=2 2 (0,0) 5 zone 7 8 (1,0) Nodes own zones and keys hashed to them. Greedy routing: forward to neighbor closest to target. 8 Chord [SMKKB‘01] Nodes and resources mapped to 2m identifier circle. Routing table: successor nodes at distances 2i . 0 successors 0 0 3 7 3 3 6 1 6 2 3 5 6 6 0 identifier circle (n=8) 4 Greedy routing: forward to node in routing table closest to target. 9 Distributed Hash Tables Nodes v4 Keys Virtual Route v2 v1 HASH Physical Link Actual Route PHYSICAL NETWORK v1 v2 v3 v4 Virtual Link v3 VIRTUAL OVERLAY NETWORK 10 Advantages Disadvantages • Load balancing. • No locality properties. • Decentralization. • No tolerance to adversarial faults. • O(log n) space and search time. • O(log2n) insert and delete time [search for (log n) neighbors]. • Tolerance of random faults. • No self-stabilization. • No optimization wrt. geography. SKIP GRAPHS 11 Skip List [Pugh ’90] Data structure based on a linked list. HEAD J Level 2 Level 1 Level 0 TAIL A J M 0 1 0 A G J M R W 1 0 1 1 0 0 Each node linked at higher level with probability 1/2. 12 Searching in a skip list Search for key ‘R’ HEAD success failure TAIL Level 1 Level 2 J Level 0 A - A G J M J M R W + Time for search: O(log n) on average. On average, constant number of pointers per node. 13 Skip lists for P2P? Advantages • O(log n) expected search time. • Retains locality. • Dynamic node additions/deletions. Disadvantages • Heavily loaded top-level nodes. • Easily susceptible to random failures. • Lacks redundancy. 14 Level 2 A Skip Graph A 100 Level 1 000 J M R 001 011 110 G A 100 001 Level 0 W G J M 001 011 101 R W 110 101 Membership vectors A G J M R W 001 100 001 011 110 101 Link at level i to nodes with matching prefix of length i. Think of a tree of skip lists that share lower layers. 15 Properties of skip graphs 1. Searching. 2. Node insertions. 3. Independence from system size. 4. Locality and range queries. 16 Searching: avg. O (log n) Level 0 Level 1 Level 2 Restricting to the lists containing the starting element of the search, we get a skip list. A A A G G G J M J M J M R W R W R W Same performance as DHTs. 17 Design aspects Use doubly linked lists at each level to account for absence of head and tail nodes. So search can start at any node. Cannot use circular singly-linked list because it is hard to detect and repair an error like this: Level 0 1 5 3 1 2 4 6 7 9 11 12 10 8 3 5 7 9 11 2 4 6 8 10 12 18 Node Insertion – 1 Level 2 buddy G A 100 Level 1 000 011 A 100 R 101 001 110 R W 110 101 M R W 011 110 101 G 001 Level 0 M W new node J M 011 A G 001 100 Starting at buddy node, find nearest key at level 0. Basically a range query looking for key closest to new key. Takes O(log n) on average. 19 Node Insertion - 2 Level 2 At each level i, find nearest node with matching prefix of membership vector of length i+1. A 100 Level 1 000 J M 001 011 G A 100 001 Level 0 W G A G 001 100 R 101 110 R W 110 101 W J M 001 011 J M R 001 011 110 101 Total time for insertion: O(log n) DHTs take: O(log2n) 20 Independent of system size No need to know size of keyspace or number of nodes. Level 1 Level 0 E Z E Z 1 0 insert J E J Z Level 2 E J Z 00 01 Level 1 E J Z 1 0 0 Level 0 Old nodes extend membership vector as required with arrivals. DHTs require knowledge of keyspace size initially. 21 Locality and range queries • Find key < F, > F. • Find largest key < x. • Find least key > x. D A F I • Find all keys in interval [D..O]. A D F I L • Initial node insertion at level 0. O S 22 Applications of locality Version Control e.g. find latest news from yesterday. find largest key < news:10/29. Level 0 news:10/25 news:10/26 news:10/27 news:10/28 news:10/29 Data Replication e.g. find any copy of some Britney Spears song. Level 0 britney01 britney02 britney03 britney04 britney05 DHTs cannot do this easily as hashing destroys locality. 23 So far... Decentralization. Locality properties. O(log n) space per node. O(log n) search, insert, and delete time. Independent of system size. Coming up... • Load balancing. •Tolerance to faults. • Random faults. • Adversarial faults. • Self-stabilization. 24 Load balancing Interested in average load on a node u. i.e. the number of searches from source s to destination t that use node u. Theorem: Let dist (u, t) = d. Then the probability that a search from s to t passes through u is < 2/(d+1). where V = {nodes v: u <= v <= t} and |V| = d+1. 25 Skip list restriction Level 2 Level 1 s Nodes u Level 0 Node u is on the search path from s to t only if it is in the skip list formed from the lists of s at each level. 26 Tallest nodes s u is not on path. s u is on path. u u u t u u t Node u is on the search path from s to t only if it is in T = the set of k tallest nodes in [u..t]. d+1 Pr [u εT] = Pr[|T|=k] • k/(d+1) = E[|T|]/(d+1). k=1 Heights independent of position, so distances are symmetric. 27 Load on node u Start with n nodes. Each node goes to next set with prob. 1/2. We want expected size of T = last non-empty set. We show that: E[|T|] < 2. =T Asymptotically: E[|T|] = 1/(ln 2) 2x10-5 1.4427… [Trie analysis] Average load on a node is inversely proportional to the distance from the destination. We also show that the distribution of average load declines exponentially beyond this point. 28 Experimental result 1.1 1.0 Load on node 0.9 Expected load Actual load Destination = 76542 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 76400 76450 76500 76550 Node location 76600 76650 29 Fault tolerance How do node failures affect skip graph performance? Random failures: Randomly chosen nodes fail. Experimental results. Adversarial failures: Adversary carefully chooses nodes that fail. Bound on expansion ratio. 30 Random faults Size of largest connected component as fraction of live nodes 1.20 131072 nodes 1.00 0.60 0.40 0.20 Probability of node failure 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.00 Size 0.80 31 Searches with random failures Fraction of failed searches 131072 nodes 10000 messages 0.20 0.15 0.10 Probability of node failure 0.6 0.5 0.4 0.3 0.2 0.00 0.1 0.05 0.0 Failed searches 0.25 32 Adversarial faults A dA dA = nodes adjacent to A but not in A. Expansion ratio = min |dA|/|A|, 1 <= |A| <= n/2. Theorem: A skip graph with n nodes has expansion ratio = Ω (1/log n). f failures can isolate only O(f•log n ) nodes. 33 Proof intuition Consider neighbors of set A at level 0. A Level 0 1. Clumpy sets dA A Low probability of clumpy sets. A 2. Non-clumpy sets Non-clumpy sets have many neighbors at level 0. Gives high expansion ratio. Level 0 34 Expansion ratio All sets have low probability of few neighbors at level h. And there are not too many clumpy sets. Low probability that any set A has few neighbors at level 0 or h. This gives expansion ratio = Ω (1/log n). Same analysis applicable to DHTs? 35 Level 0 Level 1 Level 2 Need for repair mechanism A A A G G G J M J M J M R W R W R W Node failures can leave skip graph in inconsistent state. 36 Ideal skip graph Let xRi (xLi) be the right (left) neighbor of x at level i. If xLi, xRi exist: xLi < x < xRi. xLiRi = xRiLi = x. Invariant k xLi = xLi-1. k xRi = xRi-1. Level i x Level i-1 x ..00.. Successor constraints xRi 1 xR i-1 ..01.. 2 xR i-1 ..00.. 37 Basic repair If a node detects a missing neighbor, it tries to patch the link using other levels. 1 5 1 1 3 2 3 4 5 6 5 6 Also relink at other lower levels. Successor constraints may be violated by node arrivals or failures. 38 Constraint violation Neighbor at level i not present at level (i-1). Level i x x ..00.. ..01.. ..01.. ..01.. x x Level i-1 ..00.. ..01.. ..01.. ..01.. x x Level i-1 ..00.. ..01.. zipper Level i ..01.. ..01.. ..00.. ..01.. 39 Self-stabilization Level i zOp(B) A C B zOp(A) zOp(E) D zOp(I) F E zOp(D) J G H I zipperOp message zOp(F) Eventually want each connected component of the skip graph to reorganize itself into an ideal skip graph. 40 Conclusions Similarities with DHTs • Decentralization. • O(log n) space at each node. • O(log n) search time. • Load balancing properties. • Tolerant of random faults. 41 Differences Property DHTs Skip Graphs O(log2n) O(log n) No Yes Repair mechanism ? Partial Tolerance of adversarial faults ? Yes Reqd. Not reqd. Insert/Delete time Locality Keyspace size 42 Open Problems • Design efficient repair mechanism. • Incorporate geographical proximity. • Study multi-dimensional skip graphs. • Evaluate performance in practice. • Study effect of byzantine failures. ?