Jerasure Michael Jugan

advertisement
Multi-threaded erasure coding in Jerasure
Michael Jugan
mjugan1@utk.edu
Erasure Coding
a technique used in
computer systems to
handle data loss
and corruption
• network data transmission
• file archiving
• bar code reliability

D0 

Goals
Applications:
increase performance
provide an intuitive user interface
Update Jearasure to utilize multiple processor cores while coding
Start with K blocks of data.
Implementation
Encode to get M additional blocks of coding data.
Decoding recovers the data when up to M blocks are lost.
W
Performs matrix multiplication over Galois field GF(2 )
0
1
0
b
e
0
0
1
c
f
x
Original Data
Original Data
Original Data
Coding
=
Original Data 
D
Original Data  0
2
WPD
+1≥K+M
Parallel programming
Hydra
Cetus
CPU
Intel Xeon X5550
2.66 GHz Quad Core
Intel Core 2 Quad Q9300
2.55 GHz Quad Core
Bus speed
3200 MHz
1333 MHz
256 KB
256 KB
L2 Cache
1 MB
6 MB
L3 Cache
8 MB
-
OS
Ubuntu 10.04 64-bit
Ubuntu 10.04 64-bit
Ram
12GB (1066 MHz DDR3)
4GB (800 MHz DDR2)
Memcpy() speed
6.33 GB/s
2.77 GB/s
XOR speed
4.53 GB/s
1.67 GB/s
Core i5
Core 2 Duo
Pentium 4
1000
Pentium III
Pentium
Pentium II
0
1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
Sources: [0],[1]
dual-core
Hydra, W = 8
700
Core 2 Quad
2000
Speed (MB/sec)
slices->NumberOfCores = 4;
slices->MultiThreadMethod = “disks”;
slices->Encode();
• Modern processor advancements
focus heavily on multi-core CPUs.
• Software must be specially programmed
to take advantage of the multiple cores.
WPD packetcolumns
Cauchy Reed-Solomon
• WPD = 5, (the smallest valid value for
all tested values of K and M
• K = 10
• M varied from 1 to 8
• NumberOfCores increased from 1 to 4
500
500
1 core
400
400
2 cores
300
300
3 cores
200
200
4 cores
100
100
0
0
1
2
3
4
5
M
6
7
8
1
2
3
4
M
5
Hydra
4
600
6
7
8
• Single-threaded encoding is extremely fast when M = 1.
- The first row of the Reed-Solomon coding matrix contains all ones.
- Therefore, Galois multiplication is not used to encode the first
coding drive; the data is simply copied and XOR’d.
800
800
600
600
400
400
200
200
0
0
1
2
3
1 core, disks
4
M
3
2
2
1
1
0
0
3 cores
4 cores
Number of cores
6
disks
7
8
1
packet_rows
2
3
4
M
5
6
packet_cols
7
8
packets
Hydra
3
3
2
2
1
1
0
0
3 cores
2 cores
3 cores
Cetus
4
4 cores
2 cores
3 cores
4 cores
Number of cores
• Max speedup of 3.62x using 4 cores on Hydra
• Slower than the single-threaded case when large
packet-sizes are used
Conclusion
• Increased performance
- 2 cores
almost 2x speedup
- depends heavily on the chosen coding parameters
• Easy to use
- requires two additional lines of code.
• Future work
- automate the process of finding optimal settings
Cetus
4
3
2 cores
5
Number of cores
• W = 8, 16, 32
600
1000
2 cores
Reed-Solomon
Both
Cetus
1000
4
• Encoded 100 MB with Reed-Solomon and Cauchy Reed-Solomon
• -O3 compiler optimizations
• Averaged 5 runs where each run averaged 10 encodes
Cetus, W = 8
700
One packetcolumn
Disks - fast when M is a multiple of NumberOfCores.
Otherwise, performance suffers because some
threads independently encode a second disk.
Packet_cols - consistently performs well when PS = 2KB
Reed-Solomon
Speed (MB/sec)
Frequency (MHz)
3000
Users add two additional lines of code before calling Encode().
Coding Parameters:
L1 Cache
In recent years , CPU speeds have leveled off due
to device limitations such as power consumption.
Core 2 Extreme
4 cores
Hydra
Performance tests and results
Cauchy Reed-Solomon Coding:
K=1
WPD = 2
3 cores
4 cores, 2KB PS
Testing Methods
Requirement:
2 cores
packets (PPS x M)
Each color represents the data coded by a single thread
D
Original Data
0
Original Data D1
Original Data D2
C

0
Data 
C

1
• Special case for W = 1
• Galois field arithmetic no longer needed
only memcpy() and XOR
• New data-layout parameter: words-per-drive (WPD)
400
Packet Layout
Average Performance
Relative to One Core
1
K = 3 0
0
a
M = 2
d
1 core
2KB PS
packet_rows (WPD x M) packet_cols (PPS / WPD)
Two new public variables were added to class JER_Slices.
- int NumberOfCores
- string MultiThreadMethod
As the number of
cores is increased,
it becomes vital to
use small packets.
600
0
C0 

Reed-Solomon coding:
K = 2 PacketSize PacketSize D
0
PPS = 2 PacketSize PacketSize D1
800
200


C1 

PacketSize (PS)
x PacketsPerSlice (PPS)
BlockSize
1000
method name (max # of threads supported)
Jerasure - an open-source erasure coding library
Data is organized in blocks.
Average speeds for all tested parameters (Hydra)
Four methods for splitting the work among threads:
disks (M)
2 KB PacketSize
One packet-column WPD packet-columns
Average Performance
Relative to One Core
Defined:
Tested with three different data layouts
Speed (MB/sec)
Background
Cauchy Reed-Solomon
4 cores
Number of cores
• The performance for a set number of
cores varied depending upon the value of M.
• Altering W did not significantly change
the relative speedup.
References
[0] "Intel Microprocessor Quick Reference Guide - Product Family,"
http://www.intel.com/pressroom/kits/quickreffam.htm.
[1] "Ark | Your source for information on Intel products," http://ark.intel.com.
Download