Bringing Reliability to the Cloud: Regenerating Codes for Distributed Storage Nihar B. Shah1 , K.V. Rashmi 1 , P. Vijay Kumar1 , Kannan Ramchandran2 1 Indian Institute Of Science, Bangalore, 2 University of California, Berkeley Chinese University of Hong Kong Hong Kong, Sep. 21, 2010 Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 1/43 Outline 1 Distributed Data Storage 2 Regenerating Codes Network Coding 3 Results in Perspective 4 Constructions MBR Code with d = n − 1 MISER Code The Product-Matrix Code Construction Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 2/43 Outline 1 Distributed Data Storage 2 Regenerating Codes Network Coding 3 Results in Perspective 4 Constructions MBR Code with d = n − 1 MISER Code The Product-Matrix Code Construction Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 3/43 Distributed Data Storage Network Information pertaining to a file is dispersed across data centers in the network Server Node user taps into surrounding servers to retrieve data User Node Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 3/43 Network Assumptions Nodes are prone to failure (nodes down for maintenance, nodes coming and going) Would like to minimize the total amount of storage for a desired level of performance Desirable to avoid congestion across links in the network Restrict attention to linear operations at the individual nodes Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 4/43 An Example Node: A Large Data Center Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 5/43 File Size and Alphabet Data file B over an alphabet A of of size q Eg. A = Fq for example Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 6/43 Single-Server Option All data hosted on a single server vulnerable to even a single node failure Server Node links surrounding the server node will be subject to congestion User Node Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 7/43 Replicated Server Option Data replicated across ` nodes in the network But stores `B units of data Links surrounding each server may still be congested Server Node User Node Shah, Rashmi, Kumar and Ramchandran Can tolerate ` − 1 node failures Regenerating Codes for Distributed Storage Networks 8/43 Erasure Code Option Data split into k fragments and an [n, k] MDS code used to generate n fragments which are stored in n nodes in the network each fragment represents a single symbol from the finite field Fq Stores kn B units of data Server Node User Node Shah, Rashmi, Kumar and Ramchandran Links are less likely now to be congested as data is dispersed Can tolerate n − k node failures – greater probability of network survival Regenerating Codes for Distributed Storage Networks 9/43 RAID Example A B A Disk 1 B Disk 2 A+B Disk 3 A+θB Disk 4 each node stores one symbol then data stored across the network looks like a codeword dispersed in space (4, 2) MDS code Used in RAID 6 Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 10/43 Outline 1 Distributed Data Storage 2 Regenerating Codes Network Coding 3 Results in Perspective 4 Constructions MBR Code with d = n − 1 MISER Code The Product-Matrix Code Construction Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 11/43 But How Does One Handle Node Failure ? The obvious option: A Disk 1 Connect to any k nodes, A B reconstruct entire data file, New disk 1 B B Disk 2 A+B A+B then reconstruct data stored in the node Disk 3 A+θB Disk 4 (4, 2) MDS code Used in RAID 6 But downloading B units of data to revive a node that stores B/k units of data is wasteful! Is there a better option? Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 11/43 Yes! Introducing Regenerating Codes A1 Disk 1 A2 A1 B1 A2 B2 B1 B2 2A1+2A2+B1 A2+2B1+2B2 2A1+4A2+2B1 A2+2B1+4B2 Disk 2 Disk 3 Disk 4 Store vectors as opposed to scalars at each node and use vector MDS codes. New alphabet: A = Fαq , α > 1. (in the figure, α = 2 and q = 5) Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 12/43 How Does a Vector Alphabet Help? We illustrate with an example: A1 A1 A2 A2 Disk 1 B2 1 +B 2 1 2A1+2A2+B1 A2+2B1+2B2 1 +2 A B2 Disk 2 2 +2B B1 2A A2 B1 B1 2A 1 +4A A1 Disk 3 2A1+4A2+2B1 A2+2B1+4B2 Disk 4 (download 3 half-symbols as opposed to 2 full-symbols) Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 13/43 Fractional Information Transfer node 1 new node 1 node 2 node 3 Fractional transfer on each link node k node k+1 node n-1 the key point is that when each node stores a vector with α components, it is possible to transfer a fraction of the information stored in a node to aid in node repair. node n The potential savings can be quantified using principles of network coding Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 14/43 Recap On Notation {Fq , [n, k], [α, B]} node 1 node 2 node 3 DC data collector node k node k+1 n = number of nodes in the network k= number of nodes to connect to for reconstructing file B α = number of symbols stored per node B is the file size node n α capacity nodes Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 15/43 Two Additional Parameters: (d, β) node 1 node 2 node 3 new node 1 β β α capacity node β β β d = number of nodes to connect to for regeneration of a failed node (k ≤ d ≤ n − 1) β = number of symbols downloaded from each node for regeneration (β < α) node k node k+1 node d+1 the quantity dβ is termed the “repair bandwidth” node n α capacity nodes Extended Parameter Set: {Fq , [n, k, d], [α, β, B]} Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 16/43 Network Coding Example x both sinks want all the data – an example of a “multicast” y S x min-cut capacity to each sink = 2 y a b y x MFMC theorem of commodity flow tells us each sink can get the information c x+y x y network coding tells us that in this situation, both sinks can recover the information! d x+y x+y t1 x t2 y Shah, Rashmi, Kumar and Ramchandran x y an example solution is illustrated Regenerating Codes for Distributed Storage Networks 17/43 Cut-Set Bound for Regenerating Codes The network here is the original set of n nodes plus all subsequent replacement nodes. Turns out, the min-cut capacity is given by: k−1 X min{α, (d − i)β} ≥ B i=0 DC in out cut Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 18/43 The Minimum-Storage Regeneration (MSR) Point (minimize α, then β) B ≤ k−1 X min{α, (d − i)β} i=0 B = k−1 X α, (d − i)β ≥ α, B k β= all i i=0 i.e., α = or, α = β(d − k + 1), α d −k +1 B = αk (Note: both α, B are multiples of β) Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 19/43 The Minimum-Bandwidth Regeneration (MBR) Point (minimize β, then α) B ≤ k−1 X min{α, (d − i)β} i=0 B = k−1 X (d − i)β, α ≥ (d − i)β, all i i=0 k i.e., B = [dk − ]β 2 α = dβ (Note: again, both α, B are multiples of β) (other operating points are possible; in general, there is a tradeoff) Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 20/43 The Storage-Repair Bandwidth Tradeoff Curve Repair bandwidth dβ versus storage α MSR and MBR are the two extreme points Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 21/43 Summarizing MSR and MBR Parameters (storage per node, repair bandwidth) MSR Point αmin B B = , dβmin = k k (store least possible; 1 1 − ( k−1 d ) !! download a little more) MBR Point αmin B = k ! 1 1− (store a little more; Shah, Rashmi, Kumar and Ramchandran (k−1) 2d , dβmin B = k 1 1− !! (k−1) 2d download only what you store) Regenerating Codes for Distributed Storage Networks 22/43 β = 1 and “Striping” at both MSR, MBR endpoints, α, B are multiples of β β dictates smallest size of file for which a solution exists a solution for β = 1 can be replicated to give solutions for β > 1 (“striping” of data) thus constructions for β = 1 are of greatest interest Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 23/43 Exact versus Functional Regeneration B ≤ k−1 X min{α, (d − i)β} i=0 Bound is known to be achievable for functional regeneration What if one demanded exact regeneration ? Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 24/43 Outline 1 Distributed Data Storage 2 Regenerating Codes Network Coding 3 Results in Perspective 4 Constructions MBR Code with d = n − 1 MISER Code The Product-Matrix Code Construction Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 25/43 Brief History Dimakis et al. 2007, Wu 2009 Existence: functional regenerating codes. 1 Wu et al. 2009 E-MSR [n, 2, n n-1] 1] IA Our group 2009 E-MBR E MBR (uncoded repair) [n, k, n-1] 2 AE-MSR [ k, [n, k kk+1] 1] 3 Cullina et al. 2009 E-MSR [5, 3, 4], [4, 2, 3] Computer search 4 5 LEGEND [*, *, *]: [n, k, d] : Explicit codes E : Exact AE : Approx. exact SE : Exact, only sys. nodes IA : Interference alignment Shah, Rashmi, Kumar and Ramchandran Wu 2009 S AE-MSR [n ≥2k, k, k+1] Our group 2009 SE-MSR [n ≥2k, k, n-1], non-ach of MSR d < 2k-3, β = 1 IA 6 Regenerating Codes for Distributed Storage Networks 25/43 Brief History (2) Our group 2010 SE-MBR all [n, k, d] Non-ach Non ach. of interior points Suh et al. 2009 Show regeneration of non-sys. nodes for code in (5) IA 7 8 Our group 2010 Flexible class of regenerating codes 9 Cadambe et al. 2010 E-MSR [n k, [n, k nn-1] 1], B →∞ IA LEGEND [*, *, *]: [n, k, d] : Explicit codes E : Exact AE : Approx. exact SE : Exact, only sys. nodes IA : Interference alignment Shah, Rashmi, Kumar and Ramchandran 10 11 Our group 2010 E-MSR [n, k, d ≥2k-2], E-MBR [n, k, d] Product Matrix Suh et al. 2010 E-MSR [n, k, d] , B →∞ IA 12 Regenerating Codes for Distributed Storage Networks 26/43 Our Contributions Constructions E-MBR code d = n − 1 AE-MSR code d = k + 1 SE-MSR code, d = n − 1 exact regeneration of systematic nodes Most recently, a unified product-matrix construction for MSR, MBR, all d (essentially) Non-achievability of interior points under exact regeneration non-existence of MSR code with d < 2k − 3 under exact regeneration when β = 1 the cut-set bound, existence of codes and a preliminary construction for a more flexible regeneration code set up Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 27/43 Outline 1 Distributed Data Storage 2 Regenerating Codes Network Coding 3 Results in Perspective 4 Constructions MBR Code with d = n − 1 MISER Code The Product-Matrix Code Construction Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 28/43 An Example Construction for the MBR d = n − 1 Point n = 5, k = 3, d = 4 B = 9, α = 4, β = 1 Gen Mx [10, 9] MDS Code: mt [v 1 v 2 · · · v 10 ] node 1 v1 v2 v3 v4 node 1 node 2 node 3 v1 v5 v6 v7 v4 node 4 v3 v6 v8 v10 node 5 v4 v7 v9 v10 v5 v6 v9 v10 node 2 v7 v2 v5 v8 v9 node 5 v1 v2 v3 node 3 v8 node 4 K. V. Rashmi, N. B. Shah, P. V. Kumar, and K. Ramchandran, “Explicit Construction of Optimal Exact Regenerating Codes for Distributed Storage,” in Proc. Allerton Conference on Control, Computing and Communication, Urbana-Champaign, Sep. 2009. Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 28/43 The General Case: MBR d = n − 1 Code node 1 For the general case, the MDS code would have n length = 2 k dimension = dk − 2 v4 node 5 v1 v7 v5 v6 v9 v10 node 2 v2 v3 node 3 v8 node 4 Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 29/43 MISER Code Family of MSR codes for d = n − 1 ≥ 2k − 1 that perform reconstruction optimal exact-repair of systematic nodes Based on the concept of Interference Alignment Suh-Ramchandran show that this code can perform optimal exact-repair of parity nodes as well N. B. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran,“Explicit codes minimizing repair bandwidth for distributed storage,” in Proc. ITW, Cairo, Jan. 2010. N. B. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran,“Interference Alignment in Regenerating Codes for Distributed Storage: Necessity and Code Constructions,” submitted to IEEE Trans. on Information Theory, available online at arXiv:1005.1634v2 [cs.IT]. Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 30/43 MISER MISERCode: Code: Toy ToyExample Example n = 4, k = 2, d = 3 n = 4, k = 2, d = 3 α = d − k + 1 = 2, B = kα = 4 α = d − k + 1 = 2, B = kα = 4 Field of operation: F5 Field of operation: F5 Node 1 z1 z2 Node 2 z3 z4 Node 3 2z1 + 2z2 + z3 z2 + 2z3 + 2z4 Node 4 2z1 + 4z2 + 2z3 z2 + 2z3 + 4z4 Node 1 y1 + 2y3 y1 + y2 + 4y3 + 2y4 Shah, 2y + and 3y +Ramchandran 3y + y Shah,Rashmi, Rashmi,Kumar Kumar andRamchandran z3 New Node 1 z 2z 2 + 3 2z 1 + 2 z 3` + + 4z 2 z 1 2 interference is aligned 3y 1 +y Codes for StorageStorage Networks +3 Distributed 2y + 3yRegenerating 2Codes Regenerating for Distributed 31/43 30/30 Product-Matrix Framework C = |{z} Ψ |{z} M |{z} n×α n×d d×α M : Message matrix Contains message symbols with some message symbols repeated Possesses a block-symmetry property Ψ : Encoding matrix Used to disperse information across the nodes Independent of message symbols C : Code matrix Each row represents one node i th node stores: ψ ti M K. V. Rashmi, N. B. Shah, and P. V. Kumar, “Optimal Exact-Regenerating Codes for the MSR and MBR Points via a Product-Matrix Construction,” submitted to IEEE Transactions on Information Theory, available online at arxiv:1005.4178 [cs.IT]. Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 32/43 The Product-Matrix MBR (PM-MBR) Code α=d B = kd − k 2 → B= k+1 2 + k(d − k) Let S be a (k × k) symmetric matrix with symbols k+1 2 distinct message Let T be a (k × (d − k)) matrix with k(d − k) distinct message symbols thus all message symbols are accounted for Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 33/43 Product-matrix MBR Code Message matrix M = |{z} d×d S |{z} T |{z} k×k k×(d−k) Tt |{z} 0 |{z} (d−k)×k Encoding matrix Ψ = |{z} n×d Φ |{z} n×k (symmetric) k×(d−k) ∆ |{z} n×(d−k) Φ: any k rows linearly independent Ψ: any d rows linearly independent e.g., Cauchy, Vandermonde matrix Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 34/43 Product-matrix MBR Code : Data Reconstruction Node i passes: ψ ti M Aggregator y ΨDC M (ΨDC = [ΦDC ∆DC ] is (k × d)) y M= Decoder ΦDC S + ∆DC Tt ΦDC T ↓ Ψ= S Tt Φ T 0 ∆ C = ΨM ΦDC is k × k, invertible Decode T ↓ Subtract ∆DC T t , Decode S Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 35/43 Product-matrix MBR Code : Exact Regeneration Replacement node f needs: ψ tf M Helper node i, 1 ≤ i ≤ d stores: ψ ti M ———————————————————— Helper node i passes: ψ ti Mψ f Aggregator y Ψrepair Mψ f (Ψrepair is d × d, invertible) Partial Decoder y Mψ f M= Ψ= S Tt Φ T 0 ∆ C = ΨM (M is symmetric) Re-encoder y ψ tf M Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 36/43 The Product-matrix MSR Code Parameters Here again β = 1 α=d −k +1 Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 37/43 The MSR point-Numerology d < 2k − 3 not possible with β = 1 This code is designed for d ≥ 2k − 2 Choose d = 2k − 2 first, then extend to higher d Gives k = α+1 d = 2α B = α(α + 1) S1 , S2 : (α × α) symmetric matrices with symbols each Shah, Rashmi, Kumar and Ramchandran α(α+1) 2 distinct message Regenerating Codes for Distributed Storage Networks 38/43 The Product-Matrix MSR Code Message matrix M = |{z} d×α S1 |{z} α×α S2 |{z} α×α Encoding matrix Ψ = |{z} n×d Φ |{z} n×α ΛΦ |{z} n×α Φ: any α rows linearly independent Λ: n × n diagonal matrix with the diagonal elements distinct Ψ: any d rows linearly independent e.g., Vandermonde Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 39/43 The Product-Matrix MSR Code-Data Reconstruction Node i passes: ψ ti M Aggregatory ΨDC M (ΨDC = [ΦDC ΛDC ΦDC ] is k × d) ↓ [ΦDC S1 + ΛDC ΦDC S2 ] Decodery [ΦDC S1 ΦtDC + ΛDC ΦDC S2 ΦtDC ] ↓ [P + ΛDC Q] (P and Q symmetric) M= Ψ= S1 S2 Φ ΛΦ C = ΨM ↓ (i, j) : Pij + λi Qij , (j, i) : Pij + λjQij (Solve for P and Q) ↓ Recover S1 and S2 Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 40/43 The Product-Matrix MSR Code-Exact Regeneration Replacement node f needs: ψ tf M Helper node i stores: ψ ti M ———————————————————— Helper node i passes: ψ ti Mφf Aggregator y Ψrep Mφf (Ψrep is d × d, invertible) Partial Decoder y S1 φf Mφf = S2 φf Re-encoder y M= Ψ= S1 S2 Φ ΛΦ C = ΨM φtf S1 + λf φtf S2 = ψ tf M Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 41/43 References A. G. Dimakis, P. B. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran, “Network Coding for Distributed Storage Systems,” to appear in IEEE Transactions on Information Theory, available online at arXiv:0803.0632 [cs.IT]. Y. Wu and A. Dimakis, “Reducing Repair Traffic for Erasure Coding-Based Storage via Interference Alignment,” in Proc. IEEE International Symposium on Information Theory, Jul. 2009. Y. Wu, “Existence and Construction of Capacity-Achieving Network Codes for Distributed Storage,” in Proc. IEEE International Symposium on Information Theory, Seoul, Jul. 2009. K. V. Rashmi, N. B. Shah, P. V. Kumar, and K. Ramchandran, “Explicit Construction of Optimal Exact Regenerating Codes for Distributed Storage,” in Proc. Allerton Conference on Control, Computing and Communication, Urbana-Champaign, Sep. 2009. D. Cullina, A. G. Dimakis and Tracey Ho, “Searching for Minimum Storage Regenerating Codes,” in Proc. Allerton Conference on Control, Computing and Communication, Urbana-Champaign, September 2009. N. B. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran,“Explicit Codes Minimizing Repair Bandwidth for Distributed Storage,” in Proc. IEEE Information Theory Workshop, Cairo, Jan. 2010. Y. Wu, “A Construction of Systematic MDS Codes with Minimum Repair Bandwidth,” submitted to IEEE Transactions on Information Theory, available online at arXiv:0910.2486v1 [cs.IT]. K. V. Rashmi, N. B. Shah and P. V. Kumar, “Optimal Exact-Regenerating Codes for the MSR and MBR Points via a Product-Matrix Construction,” submitted to IEEE Transactions on Information Theory, available online at arxiv:1005.4178 [cs.IT]. N. B. Shah, K. V. Rashmi, and P. V. Kumar “A Flexible Class of Regenerating Codes for Distributed Storage,” in Proc. IEEE International Symposium on Information Theory, Austin, Jun. 2010. K. V. Rashmi, N. B. Shah, P. V. Kumar, and K. Ramchandran “Explicit and Optimal Exact-Regenerating Codes for the Minimum-Bandwidth Point in Distributed Storage,” in Proc. IEEE International Symposium on Information Theory, Austin, Jun. 2010. Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 42/43 References C. Suh and K. Ramchandran, “Interference Alignment Based Exact Regeneration Codes for Distributed Storage,” in Proc. IEEE International Symposium on Information Theory, Austin, Jun. 2010. S. Pawar, S. El Rouayheb and K. Ramchandran, “On Secure Distributed Data Storage Under Repair Dynamics,” in Proc. IEEE International Symposium on Information Theory, Austin, Jun. 2010. V. R. Cadambe, S. A. Jafar, and H. Maleki, “Distributed Data Storage with Minimum Storage Regenerating Codes Exact and Functional Repair are Asymptotically Equally Efficient,” available online at arXiv:1004.4299v1 [cs.IT]. C. Suh and K. Ramchandran, “On the Existence of Optimal Exact-Repair MDS Codes for Distributed Storage,” available online at arXiv:1004.4663v1 [cs.IT]. Dennis S. Bernstein, Matrix mathematics: Theory, facts, and formulas with application to linear systems theory, Princeton University Press, Princeton, NJ, p.119, 2005. S. Rhea, P. Eaton, D. Geels, H. Weatherspoon, B. Zhao, and J. Kubiatowicz, “Pond:the OceanStore prototype,” in Proc. USENIXFile and Storage Technologies (FAST), 2003. R. Bhagwan, K. Tati, Yu Chung Cheng, S. Savage, and G. M. Voelker, “Total recall: System support for automated availability management,” in NSDI, 2004. R. Koetter and M. Medard, “An algebraic approach to network coding,” IEEE/ACM Transactions on Networking, v.11 n.5, p.782-795, Oct. 2003. Thanks! Shah, Rashmi, Kumar and Ramchandran Regenerating Codes for Distributed Storage Networks 43/43