Distributed Storage Systems for Efficient Repair and Scalable Construction Joseph C. Koo and John T. Gill, III Department of Electrical Engineering, Stanford University Introduction Contributions • Many distributed storage systems built using commodity hardware, where nodes individually unreliable • Design of storage systems based on Steiner systems, via construction of special types of bipartite graphs • Solution: Use redundancy to improve reliability of entire system ◦ Failed storage nodes replaced by accessing replicas of lost packets • New problem: How to repair failed storage nodes without incurring large downtime of any remaining node ◦ Want to take advantage of parallel data access Scenario Bipartite graph interpretation • If equivalent bipartite graph has no cycle of 4 vertices, then no two storage nodes share any pair of data packets description number of replicas of each data packet number of data packets stored at each node total number of storage nodes total number of data packets (usually, u ≫ l) • Want to construct biregular girth-6 cage graph: • Steiner systems guarantee that packets (and replicas) are sufficiently spread throughout the storage system ◦ Steiner system (i.e., block design) ensures that no two packets occur simultaneously together in more than one storage node ◦ Set of packets at failed node can always be recovered by obtaining one packet from each of l other nodes ◦ Approach applied to distributed storage systems by El Rouayheb and Ramchandran [1]—called fractional repetition codes replicas k =q+1 k =q+1 q =k−1 qpn−1 (q) = l − 1 node size l =q+1 l = pn(q) storage nodes v = q2 + q + 1 v = pn+1(q) xl y0 Example ··· x1 x2 ··· storage nodes total packets k l v u 3 3 3 3 3 3 3 3 3 7 15 31 63 127 255 511 7 15 31 63 127 255 511 1023 7 35 155 651 2667 10795 43435 174251 replicas node size storage nodes total packets k l v u 4 4 4 4 4 4 4 4 4 13 40 121 364 1093 3280 9841 13 40 121 364 1093 3280 9841 29524 13 130 1210 11011 99463 896260 8069620 72636421 x3 • Each node holds 3 packets; each packet has 4 replicas 0 1 2 b4 1 3 8 b8 2 4 6 b1 0 3 6 b5 1 4 7 b9 2 5 8 b2 0 4 8 b6 1 5 6 b10 3 4 5 b3 0 5 b7 7 2 3 b11 7 6 x4 x6 x7 x8 x9 x10 x11 x12 • Connect bottom layer of vertices using q − 1 mutuallyorthogonal Latin squares and 1 more mutually-orthogonal square (of dimensions q × q): 8 7 x5 L(0) • Smallest storage system with these properties requires 12 storage nodes 0 0 0 0 1 2 0 2 1 = 1 1 1 L(1) = 1 2 0 L(2) = 1 0 2 2 2 2 2 0 1 2 1 0 y0 x0 • Entire system stores total of 9 distinct data packets x2 x1 x3 y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 • Equivalent bipartite graph: 0 1 2 3 6 4 8 5 7 packets x4 (0,1,2) (0,3,6) (0,4,8) (0,5,7) (3,4,5) (6,8,7) (1,4,7) (1,3,8) (1,6,5) (2,8,5) (2,3,7) (2,6,4) storage nodes x5 x6 x7 x8 x9 x10 x11 x12 node size l=3 storage nodes v=7 total packets u=7 b0 0 1 2 b3 1 3 5 b5 2 3 4 b1 0 3 6 b4 1 4 6 b6 2 5 6 b2 0 4 5 q=2 node size y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 b0 replicas k=3 • Next iteration: xl+1 xl+2 xl+3 xl+4 xl+5 replicas y0 x2 q=2 • Some possible designs: ◦ Start with x1 total packets u = q2 + q + 1 (q)pn (q) u = pn+1q+1 ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· • Example with q = 3, so k = l = 4: x0 x0 • Example: y1 y2 y3 Constructing regular cage graphs • Construct bipartite graphs satisfying k = l = q + 1, where q is a prime number or a power of a prime number ◦ Only need to expand size of current storage nodes, and increase number of storage nodes ◦ Works because at each iteration, graph from previous iteration is subgraph of new bipartite graph • From regular cage, can recursively construct the following bipartite cages: q+1=k ◦ Bipartite graph G = (X, Y, E), where degree of X vertices is k and degree of Y vertices is l (also, u = |X| and v = |Y |) ◦ Require shortest cycle to have at least 6 vertices • When expanding storage system, no moving of existing data required pn(q) = q n + q n−1 + · · · q 2 + q + 1 • Constructions for scenarios parameterized by number of replicas, k, and storage node size, l ◦ Designs guaranteed to have minimum number of nodes ◦ Can construct designs where storage node size much larger than number of replicas (where parameters are highly unbalanced, i.e., l ≫ k) Scalability of recursive designs • Let q be any prime or prime power, and n be any positive integer. Let pn (q) = l • Parameters: variable k l v u ◦ Application of methods from graph theory Recursive construction of unbalanced cage graphs Other constructible designs • Other designs with similar properties: replicas k=q k=q node size l =q+1 l = pn(q) storage nodes v = q2 v = q n+1 total packets u = q2 + q u = q npn(q) • Some possible designs: replicas k=3 node size l=7 storage nodes v = 15 total packets u = 35 b0 0 1 2 7 8 9 10 b5 b1 b6 2 5 6 28 29 32 33 b11 9 15 17 19 20 31 32 0 3 6 11 14 15 18 2 3 4 27 30 31 34 b10 8 13 14 24 26 32 34 b2 0 4 5 12 13 16 17 b7 7 11 13 19 21 27 29 b12 9 16 18 21 22 33 34 b3 1 3 5 19 22 23 26 b8 7 12 14 20 22 28 30 b13 10 15 16 23 24 27 28 b4 1 4 6 20 21 24 25 b9 8 11 12 23 25 31 33 b14 10 17 18 25 26 29 30 Conclusion and discussion • Storage system designs where number of packet replicas, k, and storage node size, l, may be very different ◦ Repair of storage nodes always parallelizable ◦ Storage system may be readily expanded • Steiner design makes storage system resistant to multiple node failures ◦ In worst case scenario, if k nodes fail, entire storage system will have lost at most 1 data packet (instead of all data packets for some node) • Future work: system implementation or prototype replicas node size storage nodes total packets k l v u 3 3 3 3 3 3 3 3 4 13 40 121 364 1093 3280 9841 9 27 81 243 729 2187 6561 19683 12 117 1080 9801 88452 796797 7173360 64566801 Contact: Joseph C. Koo, Department of Electrical Engineering, Stanford University, Stanford, CA 94305; Email: jckoo@stanford.edu ◦ Possibly in conjunction with dist. hash tables (DHTs) References [1] S. El Rouayheb and K. Ramchandran, “Fractional repetition codes for repair in distributed storage systems,” in Proceedings of the 48th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, Sept. 29 – Oct. 1, 2010, pp. 1510–1517.