Multi-threaded erasure coding in Jerasure Michael Jugan mjugan1@utk.edu Erasure Coding a technique used in computer systems to handle data loss and corruption • network data transmission • file archiving • bar code reliability D0 Goals Applications: increase performance provide an intuitive user interface Update Jearasure to utilize multiple processor cores while coding Start with K blocks of data. Implementation Encode to get M additional blocks of coding data. Decoding recovers the data when up to M blocks are lost. W Performs matrix multiplication over Galois field GF(2 ) 0 1 0 b e 0 0 1 c f x Original Data Original Data Original Data Coding = Original Data D Original Data 0 2 WPD +1≥K+M Parallel programming Hydra Cetus CPU Intel Xeon X5550 2.66 GHz Quad Core Intel Core 2 Quad Q9300 2.55 GHz Quad Core Bus speed 3200 MHz 1333 MHz 256 KB 256 KB L2 Cache 1 MB 6 MB L3 Cache 8 MB - OS Ubuntu 10.04 64-bit Ubuntu 10.04 64-bit Ram 12GB (1066 MHz DDR3) 4GB (800 MHz DDR2) Memcpy() speed 6.33 GB/s 2.77 GB/s XOR speed 4.53 GB/s 1.67 GB/s Core i5 Core 2 Duo Pentium 4 1000 Pentium III Pentium Pentium II 0 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Sources: [0],[1] dual-core Hydra, W = 8 700 Core 2 Quad 2000 Speed (MB/sec) slices->NumberOfCores = 4; slices->MultiThreadMethod = “disks”; slices->Encode(); • Modern processor advancements focus heavily on multi-core CPUs. • Software must be specially programmed to take advantage of the multiple cores. WPD packetcolumns Cauchy Reed-Solomon • WPD = 5, (the smallest valid value for all tested values of K and M • K = 10 • M varied from 1 to 8 • NumberOfCores increased from 1 to 4 500 500 1 core 400 400 2 cores 300 300 3 cores 200 200 4 cores 100 100 0 0 1 2 3 4 5 M 6 7 8 1 2 3 4 M 5 Hydra 4 600 6 7 8 • Single-threaded encoding is extremely fast when M = 1. - The first row of the Reed-Solomon coding matrix contains all ones. - Therefore, Galois multiplication is not used to encode the first coding drive; the data is simply copied and XOR’d. 800 800 600 600 400 400 200 200 0 0 1 2 3 1 core, disks 4 M 3 2 2 1 1 0 0 3 cores 4 cores Number of cores 6 disks 7 8 1 packet_rows 2 3 4 M 5 6 packet_cols 7 8 packets Hydra 3 3 2 2 1 1 0 0 3 cores 2 cores 3 cores Cetus 4 4 cores 2 cores 3 cores 4 cores Number of cores • Max speedup of 3.62x using 4 cores on Hydra • Slower than the single-threaded case when large packet-sizes are used Conclusion • Increased performance - 2 cores almost 2x speedup - depends heavily on the chosen coding parameters • Easy to use - requires two additional lines of code. • Future work - automate the process of finding optimal settings Cetus 4 3 2 cores 5 Number of cores • W = 8, 16, 32 600 1000 2 cores Reed-Solomon Both Cetus 1000 4 • Encoded 100 MB with Reed-Solomon and Cauchy Reed-Solomon • -O3 compiler optimizations • Averaged 5 runs where each run averaged 10 encodes Cetus, W = 8 700 One packetcolumn Disks - fast when M is a multiple of NumberOfCores. Otherwise, performance suffers because some threads independently encode a second disk. Packet_cols - consistently performs well when PS = 2KB Reed-Solomon Speed (MB/sec) Frequency (MHz) 3000 Users add two additional lines of code before calling Encode(). Coding Parameters: L1 Cache In recent years , CPU speeds have leveled off due to device limitations such as power consumption. Core 2 Extreme 4 cores Hydra Performance tests and results Cauchy Reed-Solomon Coding: K=1 WPD = 2 3 cores 4 cores, 2KB PS Testing Methods Requirement: 2 cores packets (PPS x M) Each color represents the data coded by a single thread D Original Data 0 Original Data D1 Original Data D2 C 0 Data C 1 • Special case for W = 1 • Galois field arithmetic no longer needed only memcpy() and XOR • New data-layout parameter: words-per-drive (WPD) 400 Packet Layout Average Performance Relative to One Core 1 K = 3 0 0 a M = 2 d 1 core 2KB PS packet_rows (WPD x M) packet_cols (PPS / WPD) Two new public variables were added to class JER_Slices. - int NumberOfCores - string MultiThreadMethod As the number of cores is increased, it becomes vital to use small packets. 600 0 C0 Reed-Solomon coding: K = 2 PacketSize PacketSize D 0 PPS = 2 PacketSize PacketSize D1 800 200 C1 PacketSize (PS) x PacketsPerSlice (PPS) BlockSize 1000 method name (max # of threads supported) Jerasure - an open-source erasure coding library Data is organized in blocks. Average speeds for all tested parameters (Hydra) Four methods for splitting the work among threads: disks (M) 2 KB PacketSize One packet-column WPD packet-columns Average Performance Relative to One Core Defined: Tested with three different data layouts Speed (MB/sec) Background Cauchy Reed-Solomon 4 cores Number of cores • The performance for a set number of cores varied depending upon the value of M. • Altering W did not significantly change the relative speedup. References [0] "Intel Microprocessor Quick Reference Guide - Product Family," http://www.intel.com/pressroom/kits/quickreffam.htm. [1] "Ark | Your source for information on Intel products," http://ark.intel.com.