Hitchhiker1031023

advertisement
A “Hitchhiker’s” Guide to Fast and
Efficient Data Reconstruction in
Erasure-coded Data Centers
K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D.
Borthakur, K. Ramchandran
UC Berkeley, Facebook
ACM SIGCOMM 2014
1
A Solution to the Network Challenges of Data
Recovery in Erasure-coded Distributed
Storage Systems : A Study on the Facebook
Warehouse Cluster
K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur, K.
Ramchandran
UC Berkeley, Facebook
The 5th USENIX Workshop on Hot Topics in File and Storage Technologies,
HotStorage 2013
http://www.camdemy.com/media/11869
2
Outline
•
•
•
•
Introduction
Hitchhiker’s erasure code
Evaluation results
Conclusion
3
Need for Redundant Storage in Data
Centers
• Frequent unavailability events in data centers
– Unreliable components
– Software glitches, maintenance shutdowns, power
failures, etc.
• Redundancy necessary for reliability and
availability
4
Popular Approach for Redundant
Storage: Replication
• Distributed file systems used in data centers
store multiple copies of data on different
machines
• Machines typically chosen on different racks
– to tolerate rack failures
• E.g., Hadoop Distributed File System (HDFS)
stores 3 replicas by default
5
• HDFS
6
Massive Data Sizes: Need Alternative
to Replication
• Small to moderately sized data: disk storage is
inexpensive
– Replication viable
• No longer true for massive scales of operation
– e.g., Facebook data warehouse cluster stores
multiple tens of Petabytes (PBs)
“Erasure codes” are an alternative
7
Erasure Codes in Data Centers
• Facebook data warehouse cluster
– Uses Reed-Solomon(RS) codes instead of 3
replication on a portion of the data
– Savings of multiple Petabytes of storage space
8
Erasure Codes
Replication
block 1
a
Reed-Solomon (RS) code
a
block 1
data blocks
block 2
a
block 2
b
block 3
b
block 3
a+b
parity blocks
block 4
Redundancy
b
2x
http://www.camdemy.com/media/11869
block 4
a+2b
2x
Erasure Codes
Replication
block 1
a
Reed-Solomon (RS) code
a
block 1
data blocks
block 2
a
block 2
b
block 3
b
block 3
a+b
parity blocks
block 4
Redundancy
b
2x
First order tolerates any one failure
comparison:
http://www.camdemy.com/media/11869
block 4
a+2b
2x
tolerates any two failures
Erasure Codes
Replication
block 1
a
Reed-Solomon (RS) code
a
block 1
data blocks
block 2
a
block 2
b
block 3
b
block 3
a+b
parity blocks
block 4
Redundancy
b
2x
First order tolerates any one failure
comparison:
http://www.camdemy.com/media/11869
block 4
a+2b
2x
tolerates any two failures
Erasure Codes
Replication
block 1
a
Reed-Solomon (RS) code
a
block 1
data blocks
block 2
a
block 2
b
block 3
b
block 3
a+b
parity blocks
block 4
Redundancy
b
2x
First order tolerates any one failure
comparison:
http://www.camdemy.com/media/11869
block 4
a+2b
2x
tolerates any two failures
Erasure Codes
Replication
block 1
a
Reed-Solomon (RS) code
a
block 1
data blocks
block 2
a
block 2
b
block 3
b
block 3
a+b
parity blocks
block 4
Redundancy
b
2x
First order tolerates any one failure
comparison:
http://www.camdemy.com/media/11869
block 4
a+2b
2x
tolerates any two failures
Erasure Codes
Replication
block 1
a
Reed-Solomon (RS) code
a
block 1
data blocks
block 2
a
block 2
b
block 3
b
block 3
a+b
parity blocks
block 4
Redundancy
b
2x
block 4
a+2b
2x
First order
Tolerates any one failure
comparison:
Tolerates any two failures
In general:
Order of magnitude higher MTTDL
with much lesser storage
Lower MTTDL (Mean Time To
Data Loss),
High storage requirement
Erasure Codes
• Using RS codes instead of 3-replication on
less-frequently accessed data has led to
savings of multiple Petabytes in the Facebook
warehouse cluster
• Facebook warehouse cluster employs an
(k=10, r=4) RS code, thus resulting in a 1.4x
storage requirement
http://www.camdemy.com/media/11869
15
Reed-Solomon (RS) Codes
Example: (2, 2) RS code
• (#data, #parity) RS code:
–tolerates failure of any #parity blocks
–these (#data + #parity) blocks constitute a
“stripe”
a
b
a+b
• Facebook warehouse cluster uses a (10, 4)
a+2b
RS code
4 blocks
In a stripe
http://www.camdemy.com/media/11869
#data = 2
(data blocks)
#parity = 2
(parity blocks)
Existing Systems
• Need additional storage
– Huang et al. (Windows Azure) 2012, Sathiamoorthy et
al. (Xorbas) 2013, Esmaili et al. (CORE) 2013
• Add additional parities to reduce download
– Hu et al. (NCFS 2011)
• Highly restricted parameters
– Khan et al. (Rotated-RS) 2012: #parity≤3
– Xiang et al., Wang et al. 2010, Hu (NCCloud) et al.
2012: #parity≤2
– Hitchhiker performs as good or better for these
restricted settings as well
17
Erasure codes in Data Centers:
HDFS-RAID
Borthakur, “HDFS and Erasure Codes (HDFS-RAID)”
Fan, Tantisiriroj, Xiao and Gibson, “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW 09
18
Erasure codes in Data Centers:
HDFS-RAID
(10, 4) Reed-Solomon code
• Any 10 blocks sufficient
• Can tolerate any 4 failures
Borthakur, “HDFS and Erasure Codes (HDFS-RAID)”
Fan, Tantisiriroj, Xiao and Gibson, “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW 09
19
Impact on Data Center Network
20
Impact on Data Center Network
RS codes significantly increase network
usage during reconstruction
21
Impact on Data Center Network
Burdens the already oversubscribed Top Of
Rack(TOR) switch and higher Router
22
Machine Unavailability Events
• From HDFS Name‐Node logs http://hadoop.apache.org/
• Logged when no heart-beat for > 15min
– machines unavailable for more than 15 minutes in
a day
– 15 minutes is the default wait-time of the cluster
to flag a machine as unavailable
• Blocks marked unavailable, periodic recovery
process
• The period 22nd Jan. to 24th Feb. 2013
Rashmi et al., “A Solution to the Network Challenges of Data Recovery in Erasure-coded Storage: A Study on the
Facebook Warehouse Cluster”, Usenix HotStorage Workhsop 2013. http://www.camdemy.com/media/11869 23
Machine Unavailability Events
Median of ≈50 machine-unavailability events logged per day
http://www.camdemy.com/media/11869
24
Missing blocks per stripe
Dominant scenario: Single block recovery
http://www.camdemy.com/media/11869
25
Facebook Data Warehouse Cluster
•
Median of 180 TB transferred across racks per day for recovery operations
•
•
Reduction of more than 50TB of cross-rack traffic per day
Around 5 times that under 3-replication
http://www.camdemy.com/media/11869
26
RS codes: The Good and The Bad
• Maximum possible fault tolerance for given
storage overhead
– Storage capacity optimal
– (“maximum-distance-separable” in coding theory
parlance)
• Flexibility in choice of parameters
– Supports any number of data and parity blocks
• Not designed to handle reconstruction operations
efficiently
– Negative impact on the network
27
Goal
• To build a system with:
28
Hitchhiker
• Is a system with:
29
At an Abstract Level
HITCHHIKER
30
Hitchhiker’s Erasure Code: Toy Example
Take a (2, 2) Reed-Solomon code
data
blocks
parity
blocks
Block 1
a1
b1
Block 2
a2
b2
Block 3
a1+a2
b1+b2
Block 4
a1+2a2
b1+2b2
1 Byte
http://www.camdemy.com/media/11869
1 Byte
Hitchhiker’s Erasure Code: Toy Example
(In (2,2) RS code: recovery download & IO = 4 Bytes)
Block 1
a1
b1
Block 2
a2
b2
Block 3
a1+a2
b1+b2
Block 4
a1+2a2
b1+2b2
a2
a1+a2
b2
b1+b2
Hitchhiker’s Erasure Code: Toy Example
Add information from first group on to parities of the second
group
Block 1
a1
b1
Block 2
a2
b2
Block 3
a1+a2
b1+b2
Block 4
a1+2a2
b1+b2 + a1
No additional storage!
Fault-Tolerance (Toy Example)
Same fault tolerance as RS code: can tolerate failure of
any 2 nodes
Block 1
a1
b1
Block 2
a2
b2
Block 3
a1+a2
b1+b2
Block 4
a1+2a2
b1+2b2+a1
Fault-Tolerance (Toy Example)
Same fault tolerance as RS code: can tolerate failure of
any 2 nodes
Block 1
a1
b1
Block 2
a2
b2
Block 3
a1+a2
b1+b2
Block 4
a1+2a2
b1+2b2+a1
a1
a2
Fault-Tolerance (Toy Example)
Same fault tolerance as RS code: can tolerate failure of
any 2 nodes
Block 1
a1
b1
Block 2
a2
b2
Block 3
a1+a2
b1+b2
Block 4
a1+2a2
b1+2b2+a1
a1
a2
subtract
Fault-Tolerance (Toy Example)
Same fault tolerance as RS code: can tolerate failure of
any 2 nodes
Block 1
a1
b1
Block 2
a2
b2
Block 3
a1+a2
b1+b2
Block 4
a1+2a2
b1+2b2+a1
a1
a2
b1
b2
Efficient Reconstruction
Data transferred (Download & IO) only 3 Bytes
(instead of 4 Bytes as in RS)
Block 1
a1
b1
Block 2
a2
b2
Block 3
a1+a2
b1+b2
Block 4
a1+2a2
b1+2b2+a1
Efficient Reconstruction
Data transferred (Download & IO) only 3 Bytes
(instead of 4 Bytes as in RS)
b2
Block 1
a1
b1
b1+b2
b1+2b2+a1
Block 2
a2
b2
Block 3
a1+a2
b1+b2
Block 4
a1+2a2
b1+2b2+a1
Efficient Reconstruction
Data transferred (Download & IO) only 3 Bytes
(instead of 4 Bytes as in RS)
Subtract
b2
Block 1
a1
b1
b1+b2
b1+2b2+a1
Block 2
a2
b2
Block 3
a1+a2
b1+b2
Block 4
a1+2a2
b1+2b2+a1
Efficient Reconstruction
Data transferred (Download & IO) only 3 Bytes
(instead of 4 Bytes as in RS)
b2
Block 1
a1
b1
b1+b2
b1+2b2+a1
Block 2
a2
b2
Block 3
a1+a2
b1+b2
Block 4
a1+2a2
b1+2b2+a1
Subtract
Hitchhiker’s Erasure Code
• Builds on top of RS codes
• Uses our theoretical framework of
“Piggybacking”*
• Three versions
– XOR
– XOR+
– Non-XOR
* K.V. Rashmi, Nihar Shah, K. Ramchandran, “A Piggybacking Design Framework for Read-and Download
efficient Distributed Storage Codes”, in IEEE International Symposium on Information Theory, 2013.
42
Hop and couple (disk layout)
• Way of choosing which bytes to mix
– couples bytes farther apart in block
– to minimize the degree of discontinuity in disk
reads during data reconstruction
• Translate savings in network-transfer to
savings in disk-IO as well
– By making reads contiguous
43
RS vs Hitchhiker from the Network’s Perspective…
44
Data Transfer during Reconstruction
in RS-based System
Transfer: 10 full blocks
Connect to 10 machines
45
Data Transfer during Reconstruction
in Hitchhiker
Reconstruction of data blocks 1-9:
Transfer: 2 full blocks + 9 half blocks (=6.5 blocks total)
46
Connect to 11 machines
Data Transfer during Reconstruction
in Hitchhiker
Reconstruction of data block 10:
Transfer: 13 half blocks (=6.5 blocks total)
Connect to 13 machines
47
Hop-and-Couple
• Technique to pair bytes under Hitchhiker’s
erasure code
• Makes disk reads during reconstruction
contiguous
48
Hop-and-Couple
hop length = 1 byte = 1
Figure 7: Two ways of coupling bytes to form stripes for Hitchhiker's erasure code. The shaded bytes
are read and downloaded for the reconstruction of the first unit. While both methods require the
same amount of data to be read, the reading is discontiguous in (a), while (b) ensures that the data to
49
be read is contiguous.
Hop-and-Couple
hop length = half the size of a unit
Figure 7: Two ways of coupling bytes to form stripes for Hitchhiker's erasure code. The shaded bytes
are read and downloaded for the reconstruction of the first unit. While both methods require the
same amount of data to be read, the reading is discontiguous in (a), while (b) ensures that the data to
50
be read is contiguous.
Hitchhiker’s Erasure Code
Figure 2: Two stripes of a (k=10, r=4) Reed-Solomon (RS)
code. Ten units of data (first ten rows) are encoded using the
RS code to generate four parity units (last four rows).
51
Hitchhiker’s Erasure Code
Figure 2: Two stripes of a (k=10, r=4) Reed-Solomon (RS)
code. Ten units of data (first ten rows) are encoded using
the RS code to generate four parity units (last four rows).
Figure 3: The theoretical framework of
Piggybacking [22] for parameters (k=10, r=4).
Each row represents one unit of data.
52
Hitchhiker-XOR
• The XOR-only feature of these erasure codes
significantly reduces the computational
complexity of decoding, making degraded
reads and failure recovery faster.
• Hitchhiker's erasure code optimizes only the
reconstruction of data units; reconstruction of
parity units is performed as in RS codes.
53
Hitchhiker-XOR
Figure 4: Hitchhiker-XOR code for (k=10, r=4).
Each row represents one unit of data.
54
Hitchhiker-XOR+
• Hitchhiker-XOR+ reduces the amount of data
required for reconstruction and employs only
additional XOR operations.
• This property, which we term the all-XORparity property, requires at least one parity
function of the RS code to be an XOR of all the
data units.
55
Hitchhiker-XOR+
Figure 5: Hitchhiker-XOR+ for (k=10, r=4). Parity 2 of the
underlying RS code is all-XOR.
56
Hitchhiker-nonXOR
• Hitchhiker-nonXOR guarantees the same
savings as Hitchhiker-XOR+ even when the
underlying RS code does not possess the allXOR-parity property, but at the cost of
additional finite-field arithmetic.
• Hitchhiker-nonXOR can be built on top of any
RS code.
57
Hitchhiker-nonXOR
Figure 6: Hitchhiker-nonXOR code for (k=10, r=4). This can be built on any RS
code. Each row is one unit of data.
58
Implementation & Evaluation Setup(1)
• Implemented on top of HDFS-RAID
– Erasure coding module in HDFS based on RS
– Used in the Facebook data warehouse cluster
• Deployed and tested on a 60-machine test
cluster at Facebook
– Verified 35% reduction in the network transfers
during reconstruction
59
Implementation & Evaluation Setup(2)
• Evaluation of timing metrics on the Facebook
data warehouse cluster in production
– under real-time production traffic and workloads
– using Map-Reduce to run encoding and
reconstruction jobs, just as HDFS-RAID
60
Decoding Time
36% reduction
• RS decoding on only half portion of the blocks
• Faster computation for degraded reads and recovery
• XOR versions: 25% lesser than non-XOR
61
Read &
Transfer
Time
35% less
• Read & transfer time 30% lower in Hitchhiker (HH)
• Similar reduction for other block sizes(4、64MB) as well
62
Median
The 95th percentile of
the read time
63
Median
The 95th percentile of
the read time
64
Encoding Time
72% higher
Benefits outweigh higher encoding cost in many systems (e.g., HDFS):
• encoding of raw data into erasure-coded data is one time operation
• often run as a background job
65
Hitchhiker
66
Conclusion
• We present Hitchhiker, an new erasure-coded
storage system.
• Hitchhiker reduces both network and disk
traffic during reconstruction by 25% to 45% as
RS-based systems.
67
References
•
•
•
•
•
•
•
[6] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran.
Network coding for distributed storage systems. IEEE Trans. Inf. Th., Sept. 2010.
[17] S. Mahesh, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and
D. Borthakur. Xoring elephants: Novel erasure codes for big data. In VLDB, 2013.
[19] D. Papailiopoulos, A. Dimakis, and V. Cadambe. Repair optimal erasure codes
through hadamard designs. IEEE Trans. Inf. Th., May 2013.
[20] K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran. A
solution to the network challenges of data recovery in erasure-coded distributed
storage systems: A study on the Facebook warehouse cluster. In Proc. USENIX
HotStorage, June 2013.
[21] K. V. Rashmi, N. B. Shah, and P. V. Kumar. Optimal exact-regenerating codes for
the MSR and MBR points via a product-matrix construction. IEEE Trans. Inf. Th.,
2011.
[22] K. V. Rashmi, N. B. Shah, and K. Ramchandran. A piggybacking design
framework for read-and download-ecient distributed storage codes. In IEEE
International Symposium on Information Theory, 2013.
[26] N. Shah, K. Rashmi, P. Kumar, and K. Ramchandran. Distributed storage codes
with repair-by-transfer and non-achievability of interior points on the storage68
bandwidth tradeoff. IEEE Trans. Inf. Theory, 2012.
Figure 9: Data read patterns during reconstruction of blocks 1,
4 and 10 in Hitchhiker-XOR+: the shaded bytes are read and
downloaded.
69
Download