Tutorial on Erasure Coding for Storage Applications (Part II) Erasure Coding for Storage Applications (Part II) Cheng Huang Microsoft Research Tutorial at USENIX FAST 2013 Erasure Coding in Cloud Storage 2 Cheng Huang, Microsoft Research 1 Tutorial on Erasure Coding for Storage Applications (Part II) 3 4 Cheng Huang, Microsoft Research 2 Tutorial on Erasure Coding for Storage Applications (Part II) Performance good perf, minimize cost Storage Cost Reliability 5 a=2 replication Reed-Solomon coding a=2 a=2 b=3 b=3 a=2 b=3 a=2 b=3 reconstruction a⊕b • storage 2x 1.5x • reconstruction 1 2 6 Cheng Huang, Microsoft Research 3 Tutorial on Erasure Coding for Storage Applications (Part II) Reed-Solomon coding permanent failure a=2 temporary unavailability (90+%) b=3 hot storage nodes rolling update reconstruction on critical path and frequent enough a=2 reconstruction a+b • storage 2x 1.5x • reconstruction 1 2 7 high reconstruction cost – inevitable price for erasure coding 8 Cheng Huang, Microsoft Research 4 reconstruction cost Tutorial on Erasure Coding for Storage Applications (Part II) Reed-Solomon codes replication storage overhead 9 Pyramid Codes 10 Cheng Huang, Microsoft Research 5 Tutorial on Erasure Coding for Storage Applications (Part II) reconstruction cost: 12 data nodes d1 …... d2 parity nodes C1 d6 d7 C2 C3 …... d11 d12 12 3 Reed-Solomon 12 + 3 11 data nodes d1 …... d2 parity nodes C1 d6 d7 C2 C3 …... d11 d12 12 3 Pyramid Codes Construction: • take an arbitrary ReedSolomon (RS) code C1,1 C1,2 • split one RS parity into multiple local parities • 12 + 3 RS 12 + 4 Pyramid 12 Cheng Huang, Microsoft Research 6 Tutorial on Erasure Coding for Storage Applications (Part II) reconstruction cost: 6 d1 d2 d3 d4 d5 d6 C1,1 d7 d8 d9 d10 d11 d12 C1,2 C2 C3 13 d1 d2 d3 d4 d5 d6 C1,1 d7 d8 d9 d10 d11 d12 C1,2 C2 C3 CASE I: recover d5 from c1,1 recover d8 and d12 from c2 and c3 14 Cheng Huang, Microsoft Research 7 Tutorial on Erasure Coding for Storage Applications (Part II) d1 d2 d3 d4 d5 d6 C1,1 d7 d8 d9 d10 d11 d12 C1,2 C1 C2 C3 CASE II: combine C1,1 and C1,2 C1 convert 12 + 4 Pyramid code back to 12 + 3 RS code recover the 3 failures (d8, d11 and d12) in the RS code 115 C1 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 C2 C3 reconstruction cost of d1 3 Cheng Huang, Microsoft Research 16 8 Tutorial on Erasure Coding for Storage Applications (Part II) C1 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 C2 C3 reconstruction cost of d1 and d2 6 C1 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 17 C2 C3 decoding analogous to climbing up Pyramid Cheng Huang, Microsoft Research 18 9 reconstruction cost Tutorial on Erasure Coding for Storage Applications (Part II) Reed-Solomon codes Pyramid Codes replication storage overhead 19 Maximal Recoverability 20 Cheng Huang, Microsoft Research 10 Tutorial on Erasure Coding for Storage Applications (Part II) d1 d2 d3 d4 d5 d6 C1,1 d7 d8 d9 d10 d11 d12 C1,2 C2 C3 21 d1 d2 d3 d4 d5 d6 C1,1 d7 d8 d9 d10 d11 d12 C1,2 C2 C3 Recoverability Theorem: recoverable full matching Decoding Tanner graph Left: failed data nodes Right: survival parity nodes d5 C1,1 d6 C1,2 d8 C2 d12 C3 decoding Tanner graph contains full matching 22 Cheng Huang, Microsoft Research 11 Tutorial on Erasure Coding for Storage Applications (Part II) d1 C1,1 d2 d1 d2 d3 d4 d5 d6 C1,1 C1,2 d7 d8 d9 C2 C3 d10 d11 d12 C1,2 d5 C3 d6 C4 decoding Tanner graph contains no full matching 23 • First class of MR codes • MR codes in cloud deployment (Windows Azure Storage) 24 Cheng Huang, Microsoft Research 12 Tutorial on Erasure Coding for Storage Applications (Part II) LRC in Windows Azure Storage 25 sealed extent ( 3 GB ) sealed extent ( 3 GB ) sealed extent ( 3 GB ) p1 d0 d1 d2 d3 d4 d5 p2 Reed-Solomon 6 + 3 • storage overhead 3x 1.5x • reconstruction cost 6 • used in Google GFS II (as of 2012) p3 26 Cheng Huang, Microsoft Research 13 Tutorial on Erasure Coding for Storage Applications (Part II) sealed extent ( 3 GB ) overhead d0 (6+3)/6 = 1.5x d1 d2 d3 d4 d5 p0 p1 p2 27 sealed extent ( 3 GB ) overhead d0 (6+3)/6 = 1.5x (12+4)/12 = 1.33x d0 d1 d1 d2 d3 d2 d4 d5 d3 d6 d7 d4 d8 d9 d5 d10 d11 p0 p1 p2 p0 p1 p2 p3 28 Cheng Huang, Microsoft Research 14 Tutorial on Erasure Coding for Storage Applications (Part II) p0 d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 p1 p2 reconstruction twice more expensive requiring 12 fragments (12 disk I/Os, 12 net transfers) p3 29 Conventional Reed-Solomon Coding Storage Overhead Reconstruction Cost 1.5x 6 reads sealed extent ( 3 GB ) p1 d0 d1 d2 d3 d4 d5 p2 LRC p3 sealed extent ( 3 GB ) d0 d1 d2 d3 d4 d5 d6 d7 p1 d8 d9 d10 d11 p2 p3 1.33x 12 reads p4 30 Cheng Huang, Microsoft Research 15 Tutorial on Erasure Coding for Storage Applications (Part II) sealed extent ( 3 GB ) x0 x1 x2 x3 x4 x5 y0 y1 y2 y3 y4 y5 • LRC12+2+2: 12 data fragments, 2 local parities and 2 global parities • storage overhead: (12 + 2 + 2) / 12 = 1.33x • Local parity: reconstruction requires only 6 fragments 31 • LRC12+2+2: • reliability: RS12+4 > LRC12+2+2 > RS6+3 32 Cheng Huang, Microsoft Research 16 Tutorial on Erasure Coding for Storage Applications (Part II) 33 RS12+4 12 RS10+4 Reed-Solomon LRC (12+2+2) Reconstruction Read Cost 10 LRC same cost 1.5x 1.33x 8 RS6+3 same overhead 6 half cost (6 3) 4 2 LRC (12+4+2) • RS10+4: HDFS-RAID at Facebook • RS6+3: GFS II (Colossus) at Google 0 1.2 1.3 1.4 1.5 1.6 1.7 Storage Overhead 1.8 1.9 2 34 Cheng Huang, Microsoft Research 17 Tutorial on Erasure Coding for Storage Applications (Part II) RS (6 + 3) RS (14 + 4) LRC (14 + 2 + 2) reconstruction cost = 6 reconstruction cost = 14 reconstruction cost = 7 14% savings millions of $ savings! 35 PMDS and SD Codes 36 Cheng Huang, Microsoft Research 18 Tutorial on Erasure Coding for Storage Applications (Part II) 37 • What is the most storage efficient solution? 38 Cheng Huang, Microsoft Research 19 Tutorial on Erasure Coding for Storage Applications (Part II) n=7 PMDS Codes d0 d1 d2 d3 d4 d5 p0 • m rows, n columns d6 d7 d8 d9 d10 d11 p1 n drives, m x n sectors d12 d13 d14 d15 d16 d17 p2 d18 d19 d20 d21 qy40 qy51 p3 m=4 • r row parities in each row s=2 r=1 • s global parities • tolerate r failures per row and s additional failures anywhere 39 n=7 d0 d1 d2 d3 d4 d5 p0 d6 d7 d8 d9 d10 d11 p1 d12 d13 d14 d15 d16 d17 p2 d18 d19 d20 d21 qy40 qy51 p3 recoverable case I m=4 s=2 • r = 1 drive (column) failure • s = 2 additional sector failures anywhere r=1 40 Cheng Huang, Microsoft Research 20 Tutorial on Erasure Coding for Storage Applications (Part II) n=7 d0 d1 d2 d3 d4 d5 p0 d6 d7 d8 d9 d10 d11 p1 d12 d13 d14 d15 d16 d17 p2 d18 d19 d20 d21 qy40 qy51 p3 recoverable case II m=4 s=2 • r = 1 failures per row • s = 2 additional failures anywhere r=1 41 n=7 d0 d1 d2 d3 d4 d5 p0 d6 d7 d8 d9 d10 d11 p1 d12 d13 d14 d15 d16 d17 p2 recoverable case II m=4 d18 d19 d20 d21 qy40 qy51 s=2 p3 • d11 and d19 recoverable from their row parities • 4 parities for the remaining 4 failures similar to LRC r=1 PMDS codes are Maximally Recoverable (MR) codes 42 Cheng Huang, Microsoft Research 21 Tutorial on Erasure Coding for Storage Applications (Part II) case I case II • What if restricting to only case I? • r s 43 n=7 SD Codes d0 d1 d2 d3 d4 d5 p0 • m rows, n columns d6 d7 d8 d9 d10 d11 p1 n drives, m x n sectors m=4 • r row parities in each row d12 d13 d14 d15 d16 d17 p2 d18 d19 d20 d21 qy40 qy51 p3 s=2 r=1 • s global parities • tolerate r column failures and s additional failures anywhere 44 Cheng Huang, Microsoft Research 22 Tutorial on Erasure Coding for Storage Applications (Part II) case I case II SD codes handle case I, but not case II There are many constructions which are valid as SD codes, but not PMDS codes. 45 Efficient Repair of MDS Codes 46 Cheng Huang, Microsoft Research 23 Tutorial on Erasure Coding for Storage Applications (Part II) a1 b1 a1⊕b1 a1⊕b2 a2 b2 a2⊕b2 a2⊕b1⊕b2 47 a1 b1 a1⊕b1 a1⊕b2 a2 b2 a2⊕b2 a2⊕b1⊕b2 48 Cheng Huang, Microsoft Research 24 Tutorial on Erasure Coding for Storage Applications (Part II) a1 b1 a1⊕b1 a1⊕b2 a2 b2 a2⊕b2 a2⊕b1⊕b2 49 50 Cheng Huang, Microsoft Research 25 Tutorial on Erasure Coding for Storage Applications (Part II) Efficient Repair of Existing Codes 51 52 Cheng Huang, Microsoft Research 26 Tutorial on Erasure Coding for Storage Applications (Part II) 53 54 Cheng Huang, Microsoft Research 27 Tutorial on Erasure Coding for Storage Applications (Part II) 55 • ~20+% savings in general 56 Cheng Huang, Microsoft Research 28 Tutorial on Erasure Coding for Storage Applications (Part II) Theoretical Bound on Efficient Repair 57 ßmin = 1 one unit of information from each of the 3 surviving nodes 58 Cheng Huang, Microsoft Research 29 Tutorial on Erasure Coding for Storage Applications (Part II) • Efficient repair: 1.83x 69% savings! 59 Single Failure Repair of 6 + 6 MDS Code Reed-Solomon Coding Regenerating Coding # of nodes participating in repair 6 11 # of network transfers 6x 1.83x # of disk I/Os 6x up to 11x 60 Cheng Huang, Microsoft Research 30 Tutorial on Erasure Coding for Storage Applications (Part II) network transfer: 3 (optimal), disk I/O: 4 (no saving) a1 b1 a1⊕b1 a1⊕b2 a2 b2 a2⊕b2 a2⊕b1⊕b2 ⊕ XOR before transmitting b2 a1⊕b1⊕a2⊕b2 a1 a1⊕b2 a2⊕b1⊕b2 Regenerating Codes may require more disk I/Os than network transfers. Unfortunately, most RC papers do not discuss the difference! 61 storage overhead single failure repair ≥ 2x ≥ 1.5x < 1.5x network optimal optimal optimal disk I/O = network = network = network network optimal data node parity node unknown disk I/O > network 62 Cheng Huang, Microsoft Research 31 Tutorial on Erasure Coding for Storage Applications (Part II) parity nodes parity nodes all nodes (data and parity) 63 Simple Regenerating Codes 64 Cheng Huang, Microsoft Research 32 Tutorial on Erasure Coding for Storage Applications (Part II) not 65 (n=6, k=4, f=2)-SRC MDS precode placement node1 node2 node3 node4 node5 node6 (6,4)-RS (6,4)-RS • tolerating arbitrary two failures • any chunk recoverable with 2 I/Os • overhead: 3/2 * 6/4 = 2.25x 66 Cheng Huang, Microsoft Research 33 Tutorial on Erasure Coding for Storage Applications (Part II) • single failure recovered efficiently • 2 I/Os for each chunk • 6 I/Os in total for all three chunks • disk I/O = network I/O in repair What’s difference from Weaver codes? • overhead of Weaver codes always ≥ 2x • overhead of SRC can be smaller < 2x 67 Rotated Reed-Solomon Codes 68 Cheng Huang, Microsoft Research 34 Tutorial on Erasure Coding for Storage Applications (Part II) 69 • ~23% savings for MDS 6 + 3 codes 70 Cheng Huang, Microsoft Research 35