SKIP GRAPHS (Joint work with James Aspnes: SODA 2003)

advertisement
SKIP GRAPHS
Level 2
Level 1
Level 0
James Aspnes
Gauri Shah
To appear in SODA 2003.
2
Outline
• Peer-to-peer systems
• Existing approach: Distributed Hash Tables
• Our Approach: Skip Graphs
• Algorithms and Properties
• Experimental Results
• Conclusions and open problems
3
P2P system
Peers
Resources
Key
• Bunch of peers.
• Store resources identified by keys.
• Peers subject to crash failures.
• Goal: locate resources efficiently.
4
Properties of ideal network
•Data availability
•Decentralization
•Fault-tolerance
•Scalability
•Load balancing
•Maintaining the network
•Dynamic node addition/deletion
•Self-stabilization
•Efficient searching
•Incorporating geography
•Incorporating locality [temporal, spatial]
5
Early P2P systems
Napster
Gnutella
x
?
x
?
x
Central server bottleneck
Inefficient flooding
6
Tapestry [JKZ’01]
Uses Plaxton’s Algorithm:
Node xyz links to *XX, x*X and xy*
[* = all digits, X = any digit]
427
768
368
123
327
135
365
360
Correct one digit at a time to reach target.
Pastry [DR’01] is also similar.
7
CAN [RFHKS’01]
Partition d-dimensional co-ordinate space into zones.
(0,1)
(1,1)
3
d=2
2
(0,0)
5
zone
7
8
(1,0)
Nodes own zones and keys hashed to them.
Greedy routing: forward to neighbor closest to target.
8
Chord [SMKKB‘01]
Nodes and resources mapped to 2m identifier circle.
Routing table: successor nodes at distances 2i .
0
successors
0
0
3
7
3
3
6
1
6
2
3
5
6
6
0
identifier circle
(n=8)
4
Greedy routing: forward to node in routing
table closest to target.
9
Distributed Hash Tables
Nodes
v4
Keys
Virtual
Route
v2
v1
HASH
Physical
Link
Actual Route
PHYSICAL NETWORK
v1 v2 v3 v4
Virtual
Link
v3
VIRTUAL OVERLAY
NETWORK
10
Advantages
Disadvantages
• Load balancing.
• No locality properties.
• Decentralization.
• No tolerance to adversarial
faults.
• O(log n) space and
search time.
• O(log2n) insert and
delete time [search for
(log n) neighbors].
• Tolerance of random
faults.
• No self-stabilization.
• No optimization wrt.
geography.
SKIP GRAPHS
11
Skip List [Pugh ’90]
Data structure based on a linked list.
HEAD
J
Level 2
Level 1
Level 0
TAIL
A
J
M
0
1
0
A
G
J
M
R
W
1
0
1
1
0
0
Each node linked at higher level with probability 1/2.
12
Searching in a skip list
Search for key ‘R’
HEAD
success
failure
TAIL
Level 1
Level 2
J
Level 0
A
-
A
G
J
M
J
M
R
W
+
Time for search: O(log n) on average.
On average, constant number of pointers per node.
13
Skip lists for P2P?
Advantages
• O(log n) expected search time.
• Retains locality.
• Dynamic node additions/deletions.
Disadvantages
• Heavily loaded top-level nodes.
• Easily susceptible to random failures.
• Lacks redundancy.
14
Level 2
A Skip Graph
A
100
Level 1
000
J
M
R
001
011
110
G
A
100
001
Level 0
W
G
J
M
001
011
101
R
W
110
101
Membership vectors
A
G
J
M
R
W
001
100
001
011
110
101
Link at level i to nodes with matching prefix of length i.
Think of a tree of skip lists that share lower layers.
15
Properties of skip graphs
1. Searching.
2. Node insertions.
3. Independence from system size.
4. Locality and range queries.
16
Searching: avg. O (log n)
Level 0
Level 1
Level 2
Restricting to the lists containing the starting
element of the search, we get a skip list.
A
A
A
G
G
G
J
M
J
M
J
M
R
W
R
W
R
W
Same performance as DHTs.
17
Design aspects
Use doubly linked lists at each level to account for
absence of head and tail nodes.
So search can start at any node.
Cannot use circular singly-linked list because it is
hard to detect and repair an error like this:
Level 0
1
5
3
1
2
4
6
7
9
11
12
10
8
3
5
7
9
11
2
4
6
8
10
12
18
Node Insertion – 1
Level 2
buddy
G
A
100
Level 1
000
011
A
100
R
101
001
110
R
W
110
101
M
R
W
011
110
101
G
001
Level 0
M
W
new node
J
M
011
A
G
001
100
Starting at buddy node, find nearest key at level 0.
Basically a range query looking for key closest to new key.
Takes O(log n) on average.
19
Node Insertion - 2
Level 2
At each level i, find nearest node with matching
prefix of membership vector of length i+1.
A
100
Level 1
000
J
M
001
011
G
A
100
001
Level 0
W
G
A
G
001
100
R
101
110
R
W
110
101
W
J
M
001
011
J
M
R
001
011
110
101
Total time for insertion: O(log n)
DHTs take: O(log2n)
20
Independent of system size
No need to know size of keyspace or number of nodes.
Level 1
Level 0
E
Z
E
Z
1
0
insert
J
E
J
Z
Level 2
E
J
Z
00
01
Level 1
E
J
Z
1
0
0
Level 0
Old nodes extend membership vector as required with arrivals.
DHTs require knowledge of keyspace size initially.
21
Locality and range queries
• Find key < F, > F.
• Find largest key < x.
• Find least key > x.
D
A
F
I
• Find all keys in interval [D..O].
A
D
F
I
L
• Initial node insertion at level 0.
O
S
22
Applications of locality
Version Control
e.g. find latest news from yesterday.
find largest key < news:10/29.
Level 0
news:10/25
news:10/26
news:10/27
news:10/28
news:10/29
Data Replication
e.g. find any copy of some Britney Spears song.
Level 0
britney01
britney02
britney03
britney04
britney05
DHTs cannot do this easily as hashing destroys locality.
23
So far...
Decentralization.
Locality properties.
O(log n) space per node.

O(log n) search, insert, and

delete time.
Independent of system size.
Coming up...
• Load balancing.
•Tolerance to faults.
• Random faults.
• Adversarial faults.
• Self-stabilization.
24
Load balancing
Interested in average load on a node u.
i.e. the number of searches from source
s to destination t that use node u.
Theorem: Let dist (u, t) = d. Then the
probability that a search from s to t passes
through u is < 2/(d+1).
where V = {nodes v: u <= v <= t} and |V| = d+1.
25
Skip list restriction
Level 2
Level 1
s
Nodes u
Level 0
Node u is on the search path from s to t only if it is in
the skip list formed from the lists of s at each level.
26
Tallest nodes
s
u is not on path.
s

u is on path.
u
u
u
t
u
u
t
Node u is on the search path from s to t only if it is
in T = the set of k tallest nodes in [u..t].
d+1
Pr [u εT] =  Pr[|T|=k] • k/(d+1) = E[|T|]/(d+1).
k=1
Heights independent of position, so distances are symmetric.
27
Load on node u
Start with n nodes. Each node goes to next set with prob. 1/2.
We want expected size of T = last non-empty set.
We show that: E[|T|] < 2.
=T
Asymptotically: E[|T|] = 1/(ln 2)  2x10-5  1.4427… [Trie analysis]
Average load on a node is inversely proportional
to the distance from the destination.
We also show that the distribution of average load
declines exponentially beyond this point.
28
Experimental result
1.1
1.0
Load on node
0.9
Expected load
Actual load
Destination = 76542
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
76400
76450
76500
76550
Node location
76600
76650
29
Fault tolerance
How do node failures affect skip graph performance?
Random failures: Randomly chosen nodes fail.
Experimental results.
Adversarial failures: Adversary carefully chooses
nodes that fail.
Bound on expansion ratio.
30
Random faults
Size of largest connected component
as fraction of live nodes
1.20
131072 nodes
1.00
0.60
0.40
0.20
Probability of node failure
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0.00
Size
0.80
31
Searches with random failures
Fraction of failed searches
131072 nodes
10000 messages
0.20
0.15
0.10
Probability of node failure
0.6
0.5
0.4
0.3
0.2
0.00
0.1
0.05
0.0
Failed searches
0.25
32
Adversarial faults
A
dA
dA = nodes adjacent to A but
not in A.
Expansion ratio = min |dA|/|A|,
1 <= |A| <= n/2.
Theorem: A skip graph with n nodes has
expansion ratio = Ω (1/log n).
f failures can isolate only O(f•log n ) nodes.
33
Proof intuition
Consider neighbors of set A at level 0.
A
Level 0
1. Clumpy sets
dA
A
Low probability of clumpy sets.
A
2. Non-clumpy sets
Non-clumpy sets have many neighbors at level 0.
Gives high expansion ratio.
Level 0
34
Expansion ratio
All sets have low probability of few neighbors at level h.
And there are not too many clumpy sets.
Low probability that any set A has few
neighbors at level 0 or h.
This gives expansion ratio = Ω (1/log n).
Same analysis applicable to DHTs?
35
Level 0
Level 1
Level 2
Need for repair mechanism
A
A
A
G
G
G
J
M
J
M
J
M
R
W
R
W
R
W
Node failures can leave skip graph in inconsistent state.
36
Ideal skip graph
Let xRi (xLi) be the right (left) neighbor of x
at level i.
If xLi, xRi exist:
xLi < x < xRi.
xLiRi = xRiLi = x.
Invariant
k
xLi = xLi-1.
k
xRi = xRi-1.
Level i
x
Level i-1
x
..00..
Successor
constraints
xRi
1
xR i-1
..01..
2
xR i-1
..00..
37
Basic repair
If a node detects a missing neighbor, it tries
to patch the link using other levels.
1
5
1
1
3
2
3
4
5
6
5
6
Also relink at other lower levels.
Successor constraints may be violated by node
arrivals or failures.
38
Constraint violation
Neighbor at level i not present at level (i-1).
Level i
x
x
..00.. ..01.. ..01.. ..01..
x
x
Level i-1
..00.. ..01.. ..01.. ..01..
x
x
Level i-1
..00..
..01..
zipper
Level i
..01..
..01.. ..00.. ..01..
39
Self-stabilization
Level i
zOp(B)
A
C
B
zOp(A)
zOp(E)
D
zOp(I)
F
E
zOp(D)
J
G
H
I
zipperOp
message
zOp(F)
Eventually want each connected component of the skip
graph to reorganize itself into an ideal skip graph.
40
Conclusions
Similarities with DHTs
• Decentralization.
• O(log n) space at each node.
• O(log n) search time.
• Load balancing properties.
• Tolerant of random faults.
41
Differences
Property
DHTs
Skip Graphs
O(log2n)
O(log n)
No
Yes
Repair mechanism
?
Partial
Tolerance of
adversarial faults
?
Yes
Reqd.
Not reqd.
Insert/Delete
time
Locality
Keyspace size
42
Open Problems
• Design efficient repair mechanism.
• Incorporate geographical proximity.
• Study multi-dimensional skip graphs.
• Evaluate performance in practice.
• Study effect of byzantine failures.
?
Download