Erasure Coding for Storage Applications (Part II) Erasure Coding in Cloud Storage

advertisement
Tutorial on Erasure Coding for Storage Applications (Part II)
Erasure Coding for Storage
Applications (Part II)
Cheng Huang
Microsoft Research
Tutorial at USENIX FAST 2013
Erasure Coding in Cloud Storage
2
Cheng Huang, Microsoft Research
1
Tutorial on Erasure Coding for Storage Applications (Part II)
3
4
Cheng Huang, Microsoft Research
2
Tutorial on Erasure Coding for Storage Applications (Part II)
Performance
good perf, minimize cost
Storage Cost
Reliability
5
a=2
replication
Reed-Solomon coding
a=2
a=2
b=3
b=3
a=2
b=3
a=2
b=3
reconstruction
a⊕b
• storage 2x 1.5x
• reconstruction 1 2
6
Cheng Huang, Microsoft Research
3
Tutorial on Erasure Coding for Storage Applications (Part II)
Reed-Solomon coding
permanent failure
a=2
temporary unavailability (90+%)
b=3
hot storage nodes
rolling update
reconstruction on critical path
and frequent enough
a=2
reconstruction
a+b
• storage 2x 1.5x
• reconstruction 1 2
7
high reconstruction cost –
inevitable price for erasure coding
8
Cheng Huang, Microsoft Research
4
reconstruction cost
Tutorial on Erasure Coding for Storage Applications (Part II)
Reed-Solomon codes
replication
storage overhead
9
Pyramid Codes
10
Cheng Huang, Microsoft Research
5
Tutorial on Erasure Coding for Storage Applications (Part II)
reconstruction cost: 12
data nodes
d1
…...
d2
parity nodes
C1
d6
d7
C2
C3
…...
d11
d12
12
3
Reed-Solomon 12 + 3
11
data nodes
d1
…...
d2
parity nodes
C1
d6
d7
C2
C3
…...
d11
d12
12
3
Pyramid Codes Construction:
• take an arbitrary ReedSolomon (RS) code
C1,1
C1,2
• split one RS parity into
multiple local parities
• 12 + 3 RS 12 + 4 Pyramid
12
Cheng Huang, Microsoft Research
6
Tutorial on Erasure Coding for Storage Applications (Part II)
reconstruction cost: 6
d1
d2
d3
d4
d5
d6
C1,1
d7
d8
d9
d10
d11
d12
C1,2
C2
C3
13
d1
d2
d3
d4
d5
d6
C1,1
d7
d8
d9
d10
d11
d12
C1,2
C2
C3
CASE I:
recover d5 from c1,1
recover d8 and d12 from c2 and c3
14
Cheng Huang, Microsoft Research
7
Tutorial on Erasure Coding for Storage Applications (Part II)
d1
d2
d3
d4
d5
d6
C1,1
d7
d8
d9
d10
d11
d12
C1,2
C1
C2
C3
CASE II:
combine C1,1 and C1,2 C1
convert 12 + 4 Pyramid code back to 12 + 3 RS code
recover the 3 failures (d8, d11 and d12) in the RS code
115
C1
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
d11
d12
C2
C3
reconstruction cost of d1 3
Cheng Huang, Microsoft Research
16
8
Tutorial on Erasure Coding for Storage Applications (Part II)
C1
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
d11
d12
C2
C3
reconstruction cost of d1 and d2 6
C1
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
d11
d12
17
C2
C3
decoding analogous to climbing up Pyramid
Cheng Huang, Microsoft Research
18
9
reconstruction cost
Tutorial on Erasure Coding for Storage Applications (Part II)
Reed-Solomon codes
Pyramid Codes
replication
storage overhead
19
Maximal Recoverability
20
Cheng Huang, Microsoft Research
10
Tutorial on Erasure Coding for Storage Applications (Part II)
d1
d2
d3
d4
d5
d6
C1,1
d7
d8
d9
d10
d11
d12
C1,2
C2
C3
21
d1
d2
d3
d4
d5
d6
C1,1
d7
d8
d9
d10
d11
d12
C1,2
C2
C3
Recoverability Theorem:
recoverable  full matching
Decoding Tanner graph
Left: failed data nodes
Right: survival parity nodes
d5
C1,1
d6
C1,2
d8
C2
d12
C3
decoding Tanner graph
contains full matching
22
Cheng Huang, Microsoft Research
11
Tutorial on Erasure Coding for Storage Applications (Part II)
d1
C1,1
d2
d1
d2
d3
d4
d5
d6
C1,1
C1,2
d7
d8
d9
C2
C3
d10
d11
d12
C1,2
d5
C3
d6
C4
decoding Tanner graph
contains no full matching
23
•
First class of MR codes
•
MR codes in cloud deployment (Windows Azure Storage)
24
Cheng Huang, Microsoft Research
12
Tutorial on Erasure Coding for Storage Applications (Part II)
LRC in Windows Azure Storage
25
sealed extent ( 3 GB )
sealed extent ( 3 GB )
sealed extent ( 3 GB )
p1
d0
d1
d2
d3
d4
d5
p2
Reed-Solomon 6 + 3
•
storage overhead 3x 1.5x
•
reconstruction cost 6
•
used in Google GFS II (as of 2012)
p3
26
Cheng Huang, Microsoft Research
13
Tutorial on Erasure Coding for Storage Applications (Part II)
sealed extent ( 3 GB )
overhead
d0
(6+3)/6 = 1.5x
d1
d2
d3
d4
d5
p0
p1
p2
27
sealed extent ( 3 GB )
overhead
d0
(6+3)/6 = 1.5x
(12+4)/12 = 1.33x
d0
d1
d1
d2
d3
d2
d4
d5
d3
d6
d7
d4
d8
d9
d5
d10
d11
p0
p1
p2
p0 p1 p2 p3
28
Cheng Huang, Microsoft Research
14
Tutorial on Erasure Coding for Storage Applications (Part II)
p0
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
d11
p1
p2
reconstruction twice more expensive
requiring 12 fragments (12 disk I/Os, 12 net transfers)
p3
29
Conventional Reed-Solomon Coding
Storage
Overhead
Reconstruction
Cost
1.5x
6 reads
sealed extent ( 3 GB )
p1
d0
d1
d2
d3
d4
d5
p2
LRC
p3
sealed extent ( 3 GB )
d0
d1
d2
d3
d4
d5
d6
d7
p1
d8
d9
d10
d11
p2
p3
1.33x
12 reads
p4
30
Cheng Huang, Microsoft Research
15
Tutorial on Erasure Coding for Storage Applications (Part II)
sealed extent ( 3 GB )
x0
x1
x2
x3
x4
x5
y0
y1
y2
y3
y4
y5
• LRC12+2+2: 12 data fragments, 2 local parities and 2 global parities
•
storage overhead: (12 + 2 + 2) / 12 = 1.33x
• Local parity: reconstruction requires only 6 fragments
31
• LRC12+2+2:
• reliability: RS12+4 > LRC12+2+2 > RS6+3
32
Cheng Huang, Microsoft Research
16
Tutorial on Erasure Coding for Storage Applications (Part II)
33
RS12+4
12
RS10+4
Reed-Solomon
LRC (12+2+2)
Reconstruction Read Cost
10
LRC
same cost
1.5x 1.33x
8
RS6+3 same overhead
6
half cost (6 3)
4
2
LRC (12+4+2)
•
RS10+4: HDFS-RAID
at Facebook
•
RS6+3: GFS II (Colossus)
at Google
0
1.2
1.3
1.4
1.5
1.6
1.7
Storage Overhead
1.8
1.9
2
34
Cheng Huang, Microsoft Research
17
Tutorial on Erasure Coding for Storage Applications (Part II)
RS (6 + 3)
RS (14 + 4)
LRC (14 + 2 + 2)
reconstruction cost = 6
reconstruction cost = 14
reconstruction cost = 7
14% savings
millions of $ savings!
35
PMDS and SD Codes
36
Cheng Huang, Microsoft Research
18
Tutorial on Erasure Coding for Storage Applications (Part II)
37
• What is the most storage efficient solution?
38
Cheng Huang, Microsoft Research
19
Tutorial on Erasure Coding for Storage Applications (Part II)
n=7
PMDS Codes
d0
d1
d2
d3
d4
d5
p0
• m rows, n columns
d6
d7
d8
d9
d10
d11
p1
n drives, m x n sectors
d12
d13
d14
d15
d16
d17
p2
d18
d19
d20
d21
qy40
qy51
p3
m=4
• r row parities in each row
s=2
r=1
• s global parities
• tolerate r failures per row and
s additional failures anywhere
39
n=7
d0
d1
d2
d3
d4
d5
p0
d6
d7
d8
d9
d10
d11
p1
d12
d13
d14
d15
d16
d17
p2
d18
d19
d20
d21
qy40
qy51
p3
recoverable case I
m=4
s=2
• r = 1 drive (column) failure
• s = 2 additional sector
failures anywhere
r=1
40
Cheng Huang, Microsoft Research
20
Tutorial on Erasure Coding for Storage Applications (Part II)
n=7
d0
d1
d2
d3
d4
d5
p0
d6
d7
d8
d9
d10
d11
p1
d12
d13
d14
d15
d16
d17
p2
d18
d19
d20
d21
qy40
qy51
p3
recoverable case II
m=4
s=2
• r = 1 failures per row
• s = 2 additional failures
anywhere
r=1
41
n=7
d0
d1
d2
d3
d4
d5
p0
d6
d7
d8
d9
d10
d11
p1
d12
d13
d14
d15
d16
d17
p2
recoverable case II
m=4
d18
d19
d20
d21
qy40
qy51
s=2
p3
• d11 and d19 recoverable
from their row parities
• 4 parities for the remaining
4 failures similar to LRC
r=1
PMDS codes are Maximally Recoverable (MR) codes
42
Cheng Huang, Microsoft Research
21
Tutorial on Erasure Coding for Storage Applications (Part II)
case I
case II
• What if restricting to only case I?
• r
s
43
n=7
SD Codes
d0
d1
d2
d3
d4
d5
p0
• m rows, n columns
d6
d7
d8
d9
d10
d11
p1
n drives, m x n sectors
m=4
• r row parities in each row
d12
d13
d14
d15
d16
d17
p2
d18
d19
d20
d21
qy40
qy51
p3
s=2
r=1
• s global parities
• tolerate r column failures and
s additional failures anywhere
44
Cheng Huang, Microsoft Research
22
Tutorial on Erasure Coding for Storage Applications (Part II)
case I
case II
SD codes handle case I, but not case II
There are many constructions which are valid as SD codes, but not PMDS codes.
45
Efficient Repair of MDS Codes
46
Cheng Huang, Microsoft Research
23
Tutorial on Erasure Coding for Storage Applications (Part II)
a1
b1
a1⊕b1
a1⊕b2
a2
b2
a2⊕b2
a2⊕b1⊕b2
47
a1
b1
a1⊕b1
a1⊕b2
a2
b2
a2⊕b2
a2⊕b1⊕b2
48
Cheng Huang, Microsoft Research
24
Tutorial on Erasure Coding for Storage Applications (Part II)
a1
b1
a1⊕b1
a1⊕b2
a2
b2
a2⊕b2
a2⊕b1⊕b2
49
50
Cheng Huang, Microsoft Research
25
Tutorial on Erasure Coding for Storage Applications (Part II)
Efficient Repair of Existing Codes
51
52
Cheng Huang, Microsoft Research
26
Tutorial on Erasure Coding for Storage Applications (Part II)
53
54
Cheng Huang, Microsoft Research
27
Tutorial on Erasure Coding for Storage Applications (Part II)
55
• ~20+% savings in general
56
Cheng Huang, Microsoft Research
28
Tutorial on Erasure Coding for Storage Applications (Part II)
Theoretical Bound on Efficient Repair
57
ßmin = 1 one unit of information from each of the 3 surviving nodes
58
Cheng Huang, Microsoft Research
29
Tutorial on Erasure Coding for Storage Applications (Part II)
• Efficient repair: 1.83x 69% savings!
59
Single Failure Repair
of 6 + 6 MDS Code
Reed-Solomon Coding
Regenerating Coding
# of nodes
participating in repair
6
11
# of network transfers
6x
1.83x
# of disk I/Os
6x
up to 11x
60
Cheng Huang, Microsoft Research
30
Tutorial on Erasure Coding for Storage Applications (Part II)
network transfer: 3 (optimal), disk I/O: 4 (no saving)
a1
b1
a1⊕b1
a1⊕b2
a2
b2
a2⊕b2
a2⊕b1⊕b2
⊕ XOR before transmitting
b2
a1⊕b1⊕a2⊕b2
a1
a1⊕b2
a2⊕b1⊕b2
Regenerating Codes may require more disk I/Os than network transfers.
Unfortunately, most RC papers do not discuss the difference!
61
storage overhead
single failure repair
≥ 2x
≥ 1.5x
< 1.5x
network
optimal
optimal
optimal
disk I/O
= network
= network
= network
network
optimal
data node
parity node
unknown
disk I/O
> network
62
Cheng Huang, Microsoft Research
31
Tutorial on Erasure Coding for Storage Applications (Part II)
parity nodes
parity nodes
all nodes (data and parity)
63
Simple Regenerating Codes
64
Cheng Huang, Microsoft Research
32
Tutorial on Erasure Coding for Storage Applications (Part II)
not
65
(n=6, k=4, f=2)-SRC
MDS precode
placement
node1
node2
node3
node4
node5
node6
(6,4)-RS
(6,4)-RS
•
tolerating arbitrary two failures
•
any chunk recoverable with 2 I/Os
•
overhead: 3/2 * 6/4 = 2.25x
66
Cheng Huang, Microsoft Research
33
Tutorial on Erasure Coding for Storage Applications (Part II)
• single failure recovered efficiently
• 2 I/Os for each chunk
• 6 I/Os in total for all three chunks
• disk I/O = network I/O in repair
What’s difference from Weaver codes?
•
overhead of Weaver codes always ≥ 2x
•
overhead of SRC can be smaller < 2x
67
Rotated Reed-Solomon Codes
68
Cheng Huang, Microsoft Research
34
Tutorial on Erasure Coding for Storage Applications (Part II)
69
• ~23% savings for
MDS 6 + 3 codes
70
Cheng Huang, Microsoft Research
35
Download
Study collections