Joseph C. Koo and John T. Gill, III Department of Electrical

advertisement
Distributed Storage Systems for Efficient Repair and Scalable Construction
Joseph C. Koo and John T. Gill, III
Department of Electrical Engineering, Stanford University
Introduction
Contributions
• Many distributed storage systems built using commodity hardware, where nodes individually unreliable
• Design of storage systems based on Steiner systems, via
construction of special types of bipartite graphs
• Solution: Use redundancy to improve reliability of entire system
◦ Failed storage nodes replaced by accessing replicas of
lost packets
• New problem: How to repair failed storage nodes without incurring large downtime of any remaining node
◦ Want to take advantage of parallel data access
Scenario
Bipartite graph interpretation
• If equivalent bipartite graph has no cycle of 4 vertices,
then no two storage nodes share any pair of data packets
description
number of replicas of each data packet
number of data packets stored at each node
total number of storage nodes
total number of data packets (usually, u ≫ l)
• Want to construct biregular girth-6 cage graph:
• Steiner systems guarantee that packets (and replicas) are
sufficiently spread throughout the storage system
◦ Steiner system (i.e., block design) ensures that no two
packets occur simultaneously together in more than
one storage node
◦ Set of packets at failed node can always be recovered
by obtaining one packet from each of l other nodes
◦ Approach applied to distributed storage systems by
El Rouayheb and Ramchandran [1]—called fractional
repetition codes
replicas
k =q+1
k =q+1
q =k−1
qpn−1 (q) = l − 1
node size
l =q+1
l = pn(q)
storage nodes
v = q2 + q + 1
v = pn+1(q)
xl
y0
Example
···
x1
x2
···
storage nodes
total packets
k
l
v
u
3
3
3
3
3
3
3
3
3
7
15
31
63
127
255
511
7
15
31
63
127
255
511
1023
7
35
155
651
2667
10795
43435
174251
replicas
node size
storage nodes
total packets
k
l
v
u
4
4
4
4
4
4
4
4
4
13
40
121
364
1093
3280
9841
13
40
121
364
1093
3280
9841
29524
13
130
1210
11011
99463
896260
8069620
72636421
x3
• Each node holds 3 packets; each packet has 4 replicas
0
1
2
b4
1
3
8
b8
2
4
6
b1
0
3
6
b5
1
4
7
b9
2
5
8
b2
0
4
8
b6
1
5
6
b10
3
4
5
b3
0
5
b7
7
2
3
b11
7
6
x4
x6
x7
x8
x9
x10 x11 x12
• Connect bottom layer of vertices using q − 1 mutuallyorthogonal Latin squares and 1 more mutually-orthogonal
square (of dimensions q × q):
8
7
x5
L(0)
• Smallest storage system with these properties requires
12 storage nodes





0 0 0
0 1 2
0 2 1
=  1 1 1  L(1) =  1 2 0  L(2) =  1 0 2 
2 2 2
2 0 1
2 1 0
y0
x0
• Entire system stores total of 9 distinct data packets
x2
x1
x3
y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12
• Equivalent bipartite graph:
0
1
2
3
6
4
8
5
7
packets
x4
(0,1,2) (0,3,6) (0,4,8) (0,5,7) (3,4,5) (6,8,7) (1,4,7) (1,3,8) (1,6,5) (2,8,5) (2,3,7) (2,6,4)
storage nodes

x5
x6
x7
x8
x9
x10 x11 x12
node size
l=3
storage nodes
v=7
total packets
u=7
b0
0
1
2
b3
1
3
5
b5
2
3
4
b1
0
3
6
b4
1
4
6
b6
2
5
6
b2
0
4
5
q=2
node size
y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12
b0
replicas
k=3
• Next iteration:
xl+1 xl+2 xl+3 xl+4 xl+5
replicas
y0
x2
q=2
• Some possible designs:
◦ Start with
x1
total packets
u = q2 + q + 1
(q)pn (q)
u = pn+1q+1
··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ···
• Example with q = 3, so k = l = 4:
x0
x0
• Example:
y1 y2 y3
Constructing regular cage graphs
• Construct bipartite graphs satisfying k = l = q + 1,
where q is a prime number or a power of a prime number
◦ Only need to expand size of current storage nodes,
and increase number of storage nodes
◦ Works because at each iteration, graph from previous
iteration is subgraph of new bipartite graph
• From regular cage, can recursively construct the following bipartite cages:
q+1=k
◦ Bipartite graph G = (X, Y, E), where degree of X vertices is k and degree of Y vertices is l (also, u = |X|
and v = |Y |)
◦ Require shortest cycle to have at least 6 vertices
• When expanding storage system, no moving of existing
data required
pn(q) = q n + q n−1 + · · · q 2 + q + 1
• Constructions for scenarios parameterized by number of
replicas, k, and storage node size, l
◦ Designs guaranteed to have minimum number of nodes
◦ Can construct designs where storage node size much
larger than number of replicas (where parameters are
highly unbalanced, i.e., l ≫ k)
Scalability of recursive designs
• Let q be any prime or prime power, and n be any positive
integer. Let
pn (q) = l
• Parameters:
variable
k
l
v
u
◦ Application of methods from graph theory
Recursive construction of
unbalanced cage graphs
Other constructible designs
• Other designs with similar properties:
replicas
k=q
k=q
node size
l =q+1
l = pn(q)
storage nodes
v = q2
v = q n+1
total packets
u = q2 + q
u = q npn(q)
• Some possible designs:
replicas
k=3
node size
l=7
storage nodes
v = 15
total packets
u = 35
b0 0 1 2 7 8 9 10
b5
b1
b6 2 5 6 28 29 32 33 b11 9 15 17 19 20 31 32
0 3 6 11 14 15 18
2 3 4 27 30 31 34
b10 8 13 14 24 26 32 34
b2 0 4 5 12 13 16 17
b7
7 11 13 19 21 27 29
b12 9 16 18 21 22 33 34
b3
1 3 5 19 22 23 26
b8
7 12 14 20 22 28 30
b13 10 15 16 23 24 27 28
b4 1 4 6 20 21 24 25
b9
8 11 12 23 25 31 33
b14 10 17 18 25 26 29 30
Conclusion and discussion
• Storage system designs where number of packet replicas, k, and storage node size, l, may be very different
◦ Repair of storage nodes always parallelizable
◦ Storage system may be readily expanded
• Steiner design makes storage system resistant to multiple node failures
◦ In worst case scenario, if k nodes fail, entire storage
system will have lost at most 1 data packet (instead of
all data packets for some node)
• Future work: system implementation or prototype
replicas
node size
storage nodes
total packets
k
l
v
u
3
3
3
3
3
3
3
3
4
13
40
121
364
1093
3280
9841
9
27
81
243
729
2187
6561
19683
12
117
1080
9801
88452
796797
7173360
64566801
Contact: Joseph C. Koo, Department of Electrical Engineering, Stanford University, Stanford, CA 94305; Email: jckoo@stanford.edu
◦ Possibly in conjunction with dist. hash tables (DHTs)
References
[1] S. El Rouayheb and K. Ramchandran, “Fractional repetition codes for repair in distributed storage systems,”
in Proceedings of the 48th Annual Allerton Conference
on Communication, Control, and Computing, Monticello, IL, Sept. 29 – Oct. 1, 2010, pp. 1510–1517.
Download