Coding for Modern Distributed Storage Systems: Part 2. Locally Repairable Codes Parikshit Gopalan Windows Azure Storage, Microsoft. Rate-distance-locality tradeoffs Def: An π, π, π π linear code has locality π if each co-ordinate can be expressed as a linear combination of π other coordinates. What are the tradeoffs between π, π, π, π? [G.-Huang-Simitci-Yekhanin’12]: In any linear code with information locality r, π+1 π ≥ π + π − 2. π ο΄ Algorithmic proof using linear algebra. ο΄ [Papailiopoulus-Dimakis’12] Replace rank with entropy. ο΄ [Prakash-Lalitha-Kamath-Kumar’12] Generalized Hamming weights. ο΄ [Barg-Tamo’13] Graph theoretic proof. Generalizations ο΄ Non-linear codes [Papailiopoulos-Dimakis, Forbes-Yekhanin]. ο΄ Vector codes [Papailoupoulos-Dimakis, Silberstein-Rawat-Koyluoglu-Vishwanath, KamathPrakash-Lalitha-Kumar] ο΄ Codes over bounded alphabets [Cadambe-Mazumdar] ο΄ Codes with short local MDS codes [Prakash-Lalitha-Kamath-Kumar, Silberstein-Rawat-Koyluoglu-Vishwanath] Explicit codes with all-symbol locality. [Tamo-Papailiopoulos-Dimakis’13] ο΄ Optimal length codes with all-symbol locality for π = exp(π). ο΄ Construction based on RS code, analysis via matroid theory. [Silberstein-Rawat-Koyluoglu-Vishwanath’13] ο΄ Optimal length codes with all-symbol locality for π = 2π . ο΄ Construction based on Gabidulin codes (aka linearized RS codes). [Barg-Tamo’ 14] ο΄ Optimal length codes with all-symbol locality for π = π(π). ο΄ Construction based on Reed-Solomon codes. Stronger notions of locality ο΄ Codes with local Regeneration [Silberstein-Rawat-Koyluoglu-Vishwanath, Kamath-Prakash-Lalitha-Kumar…] ο΄ Codes with short local MDS codes [Prakash-Lalitha-Kamath-Kumar, Silberstein-Rawat-Koyluoglu-Vishwanath] Avoids the slowest node bottleneck [Shah-Lee-Ramachandran] ο΄ Sequential local recovery [Prakash-Lalitha-Kumar] ο΄ Multiple disjoint local parities [Wang-Zhang, Barg-Tamo] Can serve multiple read requests in parallel. Problem: Consider an π, π π linear code where even after π arbitrary failures, every (information) symbol has locality π. How large does π need to be? [Barg-Tamo’14] might be a good starting point. Tutorial on LRCs Part 1.1: Locality 1. Locality of codeword symbols. 2. Rate-distance-locality tradeoffs: lower bounds and constructions. Part 1.2: Reliability 1. Beyond minimum distance: Maximum recoverability. 2. Constructions of Maximally Recoverable LRCs. Beyond minimum distance? Is minimum distance the right measure of reliability? Two types of failures: ο΄ Large correlated failures Power outage, upgrade. Whole data center offline. ο΄ Can assume further failures are independent. Beyond minimum distance? 4 Racks 6 Machines per Rack ο΄ Machines fail independently with probability π. ο΄ Racks fail independently with probability π ≈ π3 . ο΄ Some 7 failure patterns are more likely than 5 failure patterns. Beyond minimum distance 4 Racks 6 Machines per Rack Want to tolerate 1 rack failure + 3 additional machine failures. Beyond minimum distance ο΄ Want to tolerate 1 rack + 3 more failures (9 total). Solution 1: Use a [24,15,10] Reed-Solomon code. Corrects any 9 failures. Poor locality after a single failure. Beyond minimum distance ο΄ Want to tolerate 1 rack + 3 more failures (9 total). [Plank-Blaum-Hafner’13]: Sector-Disk (SD) codes. Solution 1: Use [24, 15, 6] LRCs derived from Gabidulin codes. Rack failure gives a 18, 15, 4 MDS code. Stronger guarantee than minimum distance. Beyond minimum distance ο΄ Want to tolerate 1 rack + 3 more failures (9 total). [Plank-Blaum-Hafner’13]: Partial MDS codes. Solution 1: Use [24, 15, 6] LRCs derived from Gabidulin codes. Rack failure gives a 18, 15, 4 MDS code. Stronger guarantee than minimum distance. Maximally Recoverable Codes [Chen-Huang-Li’07, G.-Huang-Jenkins-Yekhanin’14] Code has a topology that decides linear relations between symbols (locality). Any erasure with sufficiently many (independent) constraints is correctible. [G-Huang-Jenkins-Yekhanin’14]: Let πΌ1 , … , πΌπ‘ be variables. 1. Topology is given by a parity check matrix, where each entry is a linear function in the πΌπ s. 2. A code is specified by choice of πΌπ s. 3. The code is Maximally Recoverable if it corrects every error pattern that its topology permits. ο΄ Relevant determinant is non-singular. ο΄ There is some choice of πΌs that corrects it. Example 1: MDS codes β global equations: π πΌπ,π ππ = 0. π=1 Reed-Solomon codes are Maximally Recoverable. Example 2: LRCs (PMDS codes) Assume π|π, (π + 1)|π. Want length π codes satisfying 1. Local constraints: Parity of each column is 0. 2. β Global constraints: Linear constraints over all symbols. The code is MR if puncturing one entry per column gives an π + β, π The code is SD if puncturing any row gives an π + β, π Known constructions require fairly large field sizes. π MDS code. π MDS code. Example 3: Tensor Codes Assume π|π, (π + 1)|π. Want length π codes satisfying 1. Column constraints: Parity of each column is 0. 2. β constraints per row: Linear constraints over symbols in the row. Problem: When is an error pattern correctible? Tensor of Reed-Solomon with Parity is not necessarily MR. Maximally Recoverable Codes [Chen-Huang-Li’07, G.-Huang-Jenkins-Yekhanin’14] Let πΌ1 , … , πΌπ‘ be variables. 1. Each entry in the parity check matrix is a linear function in the πΌπ s. 2. A code is specified by choice of πΌπ s. 3. The code is Maximally Recoverable if it corrects every error pattern possible given its topology. [G-Huang-Jenkins-Yekhanin’14] For any topology, random codes over sufficiently large fields are MR codes. Do we need explicit constructions? ο΄ Verifying a given construction is good might be hard. ο΄ Large field size is undesirable. How encoding works Encoding a file using an π, π π code πΆ. Ideally field elements are byte vectors, so π = 28π . 1. Break file into π equal sized parts. 2. Treat each part as a long stream over πΉπ . 3. Encode each row (of π elements) using πΆ, to create π − π more streams. 4. Distribute them to the right nodes. a z a j d r b c d g f t b f n v v y g g g x b j How encoding works Encoding a file using an π, π π code πΆ. Ideally field elements are byte vectors, so π = 28π . 1. Break file into π equal sized parts. 2. Treat each part as a long stream over πΉπ . 3. Encode each row (of π elements) using πΆ, to create π − π more streams. 4. Distribute them to the right nodes. Step 3 requires finite field arithmetic over πΉπ . ο΄ Can use log tables up to 224 (a few Gb). ο΄ Speed up via specialized CPU instructions. ο΄ Beyond that, matrix vector multiplication (dimension = bit-length). Field size matters even at encoding time. How decoding works Decoding from erasures = solving a linear system of equations. ο΄ Whether an erasure pattern is correctible can be deduced from the generator matrix. ο΄ If correctible, each missing stream is a linear combination of the available streams. Random codes are as “good” as explicit codes for a given field size. a z d r a j b c f t d g b f v y n v g g b j g x Maximally Recoverable Codes [Chen-Huang-Li’07, G.-Huang-Jenkins-Yekhanin’14] Thm: For any topology, random codes over sufficiently large fields are MR codes. ο΄ Large field size is undesirable. ο΄ Is there a better analysis of the random construction? [Kopparty-Meka’13]: Random π + π − 1, π probability exp −π for π ≤ π π−1. π codes are MDS only with Random codes are MR with constant probability for π = O(d ⋅ π π ). Could explicit constructions require smaller field size? Maximally Recoverable LRCs 1. Local constraints: Parity of each column is 0. 2. β Global constraints. The code is MR if puncturing one entry per column gives an π + β, π π MDS code. 1. Random gives MR LRCs for π = O πβ π π ⋅ π , SD for q = π πβ . 2. [Silberstein-Rawat-Koylouglu-Vishwanath’13] Explicit MR LRCs with π = 2π . [G.-Huang-Jenkins-Yekhanin] ο΄ Basic construction: Gives π = π π β . ο΄ Product construction: Gives π = π π 1−π β for suitable β, π. Open Problems: ο΄ Are there MR LRCs over fields of size π π ? ο΄ When is a tensor code MR? Explicit constructions? ο΄ Are there natural topologies for which MR codes only exist over exponentially large fields? Super-linear sized fields? Thank you ο΄ The Simons institute, David Tse, Venkat Guruswami. ο΄ Azure Storage + MSR: Brad Calder, Cheng Huang, Aaron Ogus, Huseyin Simitci, Sergey Yekhanin. ο΄ My former colleagues at MSR-Silicon Valley.