PROC. OF ANNUAL PRODUCT CONFERENCE–2016 I NDIAN I NSTITUTE OF S CIENCE Design Architecture and Implementation of a High Throughput Low Latency Reed - Solomon Decoder Y Vamshikrishna, T SV Satyannarayana, Shayan G. Srinivasa, DESE, IISc A BSTRACT Error correcting codes (ECC) are universally used in all communication systems involving transmission and storage. The needs of today’s ECC systems not only include efficient code designs to guarantee low error rates, but also architectures mindful of area, power and through- put for a viable system-on-chip (SoC). Reed-Solomon (RS) code is one of the most powerful and versatile non-binary algebraic error correcting codes with guaranteed burst error correction capability. High-speed data trans- mission techniques such as fiber-optical networking systems requires efficient error correcting codes supporting high throughput to meet the continuing demands of higher data rates. In this paper, we propose a novel low complexity architecture for hard decision decoding (HDD) of RS codes with least latency and maximum throughput. We use our design towards the construction of soft decision based Chase decoder for correcting symbol errors beyond the minimum distance decoding. I I NTRODUCTION Reed-Solomon codes were invented in 1960. It was not until the late sixties when Berlekamp and Massey invented an efficient algorithm for decoding them. Today, RS codes are ubiquitous (part of data storage centres, storage media and transmission channels) and billions of dollars are invested into products, having such error-correcting encoders and decoders, and peta bytes of data are being decoded each minute using such codes. It is no exaggeration to say the least that threequarters of the codes used today are Reed-Solomon codes. Reed-Solomon codes have many interesting properties, such as their guaranteed random-error-correction capability, bursterror-correction capability, and erasure-recovery capability, making them very appealing for many applications [1]. There are two main categories of Reed-Solomon decoding. They are the hard-decision decoders (HDD) and the algebraic soft-decision decoding methods (ASD). ASD provide better error correction performance than the HDD with higher complexity. The hard decision RS decoder can be implemented with two popular algorithms, namely the Berlekamp-Massey (BM) algorithm and the modified Euclidean (ME) algorithm to solve the key equation [1] towards computing the error locator polynomial. The ME algorithm is easier to understand and implement, whereas, the BM algorithm is efficient in terms of computational complexity and implementation. ASD methods such as, Guruswami-Sudan (GS) algorithm, Koetter-Vardy (KV) and the bit-level generalized minimum distance (BGMD) decoding involve interpolation and factorization, where the interpolation stage involves a major part of the computations. These are very complex decoders and impractical. On the other hand, the low-complexity Chase (LCC) decoder requires a maximum multiplicity of one. LCC decoding with eight test c DESE-D EPT. E LECTRONIC S YSTEMS E NGINEERING , IIS C , BANGALORE vectors achieves the same frame error rate (FER) performance as the KV algorithm with multiplicity four. Due to its advantage of a maximum multiplicity amounting to one, and less interpolation and factorization complexities, LCC decoding is one of the preferred ASD methods. We embarked on this project motivated by a flexible rate RS architecture with provably efficient design from a throughput and latency perspective applicable to bursty optical communication. RS codes have a flexible range of codeword lengths and code rate. Besides the traditional applications, such as magnetic and optical recording, digital televisions, cable modems, etc., RS codes are also finding their way into emerging applications for forward error correcting codes in medical implants II F UNCTIONAL ASPECTS The primary objective of the proposed project is to implement a RS decoder using a hard decision decoding algorithm with an eye towards hardware efficient soft decision decoding circuit. Soft-decision decoding algorithm is implemented to enhance the correction capability beyond bounded distance decoding. The implementation is targeted for maximizing the throughput and minimizing the latency. The other important aspects like power dissipation and area have to be carefully minimized for a feasible design. III H ARD DECISION DECODING The parameters of the RS (n, k,t) code include code length (n), message length (k) and error correcting capability (t). Let a code c(x) be corrupted by an additive error e(x) resulting in r(x). Suppose υ denotes the number of errors. Then, e(x) has the following form υ e(x) = ∑ yi xi , 0 ≤ i ≤ n − 1, (1) i=1 where yi is the error magnitude at error location i. The syndromes can be calculated as υ S j = r(α j ) = e(α j ) = ∑ Yi Xij , 1 ≤ j ≤ 2t, (2) i=1 where Yi = yi and Xi = α i . The aim is to solve the above 2t equations to get the pairs (Xi ,Yi ). Let us define a polynomial known as the error locator polynomial Λ(x) υ ∏(1 + Xi x) = Λ0 + Λ1 x + · · · + Λυ−1 xυ−1 + Λυ xυ . (3) i=1 1/5 PROC. OF ANNUAL PRODUCT CONFERENCE–2016 The Xi values are evaluated using inverse roots of the above equation. Given the values Xi , the equations (2) are linear in Yi , and can be solved. The error correction involves a 4-step procedure. Step 1: Calculation of the syndromes S j : Syndromes can be evaluated as per (2) from the r(x). Step 2: Calculation of the Λi from the S j : (Berlekamp) We can compute Λ(x) iteratively in 2t steps [1]. Let Λ(µ) (x) denote the error locator polynomial at µ th step of the iteration. To find Λ(x) iteratively, we start with the initialized Table I shown and proceed to fill out the table. Let lµ be the degree of Λ(µ) (x). Assuming that we have filled out the µ th row, we find (µ + 1)th row using the procedure shown below lµ (µ) = 0, then Λ(µ+1) (x) = Λ(µ) (x) and 1. If dµ = ∑ Sµ+1−i Λi lµ+1 = i=0 lµ . 2. If dµ 6= 0, then we search another row ρ prior to the µ th row where dρ 6= 0 and the number ρ − lρ in the last column of the Table I has the largest value. Λ(µ+1) (x) and lµ+1 is repeated as Λ(µ+1) (x) = Λ(µ) (x) − dµ dρ−1 xµ−ρ Λρ (x), (4) lµ+1 = max[lµ , lρ + µ − ρ]. (5) Table 1: Berlekamp’s iterative procedure for finding the error locator polynomial Λ(x) of a RS code [ courtesy [1], chap. 6, page 210 ]. µ Λ(µ) (x) dµ lµ µ − lµ −1 1 1 0 −1 0 1 S1 0 0 1 .. . 1 − S1 x 2t Step 3: Calculation of Xi from the Λi : (Chien’s search) If α −i is a root of Λ(x), the error is present at location i. Step 4: Calculation of Yi : (Forney’s formula) To evaluate the error magnitudes, we use Forney’s formula Yi = − Ω(Xi −1 ) Λ0 (Xi −1 ) (6) where Ω(x) = S1 + (S2 + Λ1 S1 )x + (S3 + Λ1 S2 + Λ2 S1 )x2 + · · · + (Sυ + Λ1 Sυ−1 + · · · + Λυ−1 S1 )xυ−1 . The final step is to add e(x) obtained from Xi and Yi to r(x) to get the decoded codeword polynomial c(x). I NDIAN I NSTITUTE OF S CIENCE computational complexity. ASD decoding facilitates the correction of errors beyond bounded distance decoding by using the channel reliability information. Among the ASD methods, the GS and KV algorithms give better performance at the expense of higher complexity. On the other hand, Chase decoding offers a low complexity solution with comparable performance. So, we have proceeded with the Chase decoder implementation. The Chase decoder uses a HDD decoder to a set of test vectors and is easy to implement without compromising on the performance. It corrects codewords within a decoding radius η t > dmin 2 . LCC decoding is based on generating 2 test vectors from the received vector, based on the symbol reliability information, where η symbols are selected as the least reliable symbols out of n symbols, for which hard decision or second most reliable decision is employed. To create the test vectors a ratio between the probability of the hard-decision symbol and the second-best decision is established. This ratio indicates how good the hard decision is. The desired probability ratio for the received message polynomial r(x) is P(r0 | r0 HD ) P(r(n−1) | r(n−1) HD ) ∏ = P(r0 | r0 2HD ) · · · P(r(n−1) | r(n−1) 2HD ) (7) 1×n where ri HD is the hard decision of the symbol ri , and ri 2HD its second most reliable decision. Corresponding to the η points with the worst probability ratio (between the hard decision and second-best decision), a set of 2η combinations called test vectors are created by selecting the hard-decision symbol (ri ) or the second-best decision (ri 2HD ) in the η less reliable points. Second-best decision is obtained based on information of message symbol probabilities and it is complex to compute also. For the sake of implementation, we follow a reasonable and simple method of generating second-best decision and test vectors. Let us consider the following setting to generate symbol reliabilities and second most reliable decisions. Setting: BPSK Modulation over the AWGN channel. Consider a codeword c = [c1 c2 · · · cn ], where ci = [ci1 ci2 · · · cim ] is an m−bit vector. ci j is modulated to xi j and xi j = 1 − 2ci j . Let ri j denote the real-valued channel observation corresponding to ci j . ri j = ci j +ni j , where ni j are Gaussian random noise samples. At the receiver, a hard slicing is done and the received vec[HD] tor thus formed is y[HD] . A symbol decision yi , is made on each transmitted symbol ci (from observations ri1 , · · · , rim ). The channel provides reliability of each bit received. Among the m-bits in a symbol, the least reliable bit defines the worst case reliability of that symbol. Define the ith symbol reliability λi as [2] λi , min ri j . 1≤ j≤m IV S OFT DECISION DECODING Algebraic soft decision decoding (ASD) of RS codes provides higher coding gain over conventional HDD, but involves high c DESE-D EPT. E LECTRONIC S YSTEMS E NGINEERING , IIS C , BANGALORE The value λi indicates the confidence on the symbol decision [HD] yi . The higher the value of λi , the greater the reliability and [2HD] vice-versa. Second most reliable decision yi is obtained 2/5 PROC. OF ANNUAL PRODUCT CONFERENCE–2016 I NDIAN I NSTITUTE OF S CIENCE [HD] from yi , by complementing or flipping the bit that achieves the minimum in above equation. The key steps in the LCC algorithm are given as follows Received symbols • The set of 2η test vectors are generated using the relations as follows, ( [HD] y , for i ∈ /I yi = i [HD] [2HD] {yi , yi }, for i ∈ I • Each test vector is passed on serially to HDD explained in Section III and the vector, for which the decoding failure does not result, that will be taken as estimated codeword at the receiver. V D ESIGN AND IMPLEMENTATION We proceeded with the BM algorithm, explained in Section III, to implement a hard decision decoder (HDD) circuit on FPGA. Pipelining is employed to RS decoder to achieve higher throughput. Popular three-stage pipelined RS decoder is shown in Figure 1. In a pipelined architecture, the overall throughput is decided by the slowest pipelined stage and its computation time. Hence, to increase the throughput with efficient area utilization, each pipelined stage should complete its computations in about the same amount of time. The key equation solver (KES) stage that computes the error locator polynomial using the BM algorithm requires 2t iterations. Parallelism can be employed to adjust the number of clock cycles required for the syndrome computation (SC) and the Chien search and error magnitude computation stages. In addition, we use a delay buffer to buffer the received symbols. The size of the delay buffer depends on the latency of the decoder. Received symbols Syndrome computation Key equation solver Chien search & Error magnitude computation Estimated Codeword symbols Delay buffer Figure 1: Three-stage pipelined decoder with the delay buffer and each stage is separated using pipelined registers [ courtesy [3], chap. 4, page 70 ] Estimated Codeword symbols Key equation solver • Sort the n−symbol reliabilities λ1 , λ2 , · · · , λn in an increasing order, i.e., λi1 ≤ λi2 ≤ · · · ≤ λin . • Form an index set I , {i1 , i2 , · · · , iη } denoting η-smallest reliability values. Chien search & Error magnitude computation Syndrome computation Delay buffer Figure 2: A two-stage low latency pipelined decoder in which the syndrome computation and the KES are merged into a single stage. LCC (255,239) decoder based on the HDD decoder is shown in Figure 3. The main steps for LCC decoding are (i) multiplicity assignments, (ii) creation of the test vectors, and (iii) selection of the correct test vector and decoding. The LCC decoding process creates 2η different test vectors using the reliability information from the received points, where η < 2t. It is proved in [4] that for η > 2t the decoder complexity is so high that its implementation is impractical. If one of 2η test vectors has less than t + 1 errors, then the decoder will be able to correct them in all cases. We use the BMA based HD twostage piplelined RS (255,239) decoder that we designed within this architecture. For the BMA based RS decoder, a decoder failure happens if the received message generates an error locator polynomial σ (x) of degree higher than the number of roots obtained during the Chien search procedure. In such a case, the number of errors is more than t, and therefore the decoder cannot recover the message correctly. Specifications of the LCC decoder are as follows. Decoding failure detector Channel info r(x) Multiplicity assignment Test vector generation Syndrome Computation Degree counter Root counter Key equation solver Chien search Decode failure Forney's e(x) algorithm c(x) Delay buffer Figure 3: LCC soft-decoder based on decoding failure • RS (255,239) decoder over GF(28 ) with η = 3. • Number of bits for channel reliability information is 4bits. • Inputs to decoder: y[HD] , reliability information λi of each [HD] symbol yi , Information about the bit that needs to be flipped in a symbol. . Latency of the decoder can be reduced significantly without affecting the throughput much, if both the SC and the KES are merged into a single pipeline stage as shown in Figure 2. This is possible because SC stage outputs one syndrome every clock cycle and KES stage takes one syndrome as input for every clock cycle from an array of syndromes. c DESE-D EPT. E LECTRONIC S YSTEMS E NGINEERING , IIS C , BANGALORE VI R ESULTS AND COMPARISONS The proposed hard and soft RS decoders were implemented using VHDL on a Kintex- 7 FPGA KC705 Evaluation Kit. It is functionally verified using a ModelSim simulator and Chipscope-pro. The outputs from the VHDL coded architecture were validated using Matlab and a C-coded model. 3/5 PROC. OF ANNUAL PRODUCT CONFERENCE–2016 Throughput and latency equations are as follows for both hard and soft decision RS decoders respectively. Table 2: Clock cycles required for each stage of the RS decoder. Stage Syndrome Computation Key Equation Solver Chien and Error Magnitude Computation Clock cycles 2t 2t t + nJ Table 2 shows the clock cycles taken by each pipeline stage in HDD. The throughput of the HDD as a function of error correction capability is given by Throughput(t) = n × m × fmax (Gbps). max{2t,t + nJ } (8) I NDIAN I NSTITUTE OF S CIENCE Table 4: Comparison of RS (255,239) LCC designs with η = 3. LUTs FFs BRAMs Max.freq (MHz) Max.Throughput (Gbps) Latency (cycles) Platform Xilinx ISE version Existing [6] 5114 5399 150.5 0.71 1312 Virtex-V 9.2i tool Designed 18584 16621 0 161.29 1.29 559 Kintex-7 14.6 tool The LCC soft decision decoder performs better than HDD as shown in Figure 4. The LCC decoder achieves o.25 dB of coding gain over HDD and corrects the errors beyond bounded distance. where n, m, t and fmax denote the code length, symbol width, error correcting capability and maximum operating frequency respectively. The total latency of the designed 3-stage HDD in Figure 1 is given by sum of the cycles required for each stage as per Table 2 n n (9) Latency = 2t + 2t + t + = 5t + (cycles). J J The latency is reduced further by the proposed 2-stage HDD architecture in Figure 2. Latency ≈ 3t + n (cycles). J (10) RS decoder of block length n = 255 with J = 10 is implemented over GF(28 ). Table 3 shows the comparison between our decoder with J = 10 and the RS decoder v9.0 Xilinx IP for RS (255,239) on Kintex-7 FPGA. No BRAMs are used in our decoder as all the needed Galois field arithmetic are implemented using LUTs and FFs. Xilinx IP [5] utilized huge memory in terms of two 18k BRAMs and one 36k BRAM, which saved the utilization of LUTs and FFs compared to our decoder. If we consider BRAMs, LUTs and FFs altogether, we feel that our decoder and Xilinx IP utilized almost the same hardware resources. The gain achieved in terms of throughput is almost five times with our architecture which is a significant improvement. The latency of our decoder is almost nine times better than the Xilinx IP. Table 3: Comparison of RS (255,239) designs. LUTs FFs 36k BRAMs 18k BRAMs Max.freq (MHz) Max.Throughput (Gbps) Latency (cycles) Xilinx IP 1177 1169 1 2 292 2.33 470 Proposed 8774 5272 0 0 200 12 55 Similarly, Table 4 shows the comparison between the existing LCC decoder [6] and our designed LCC decoder. c DESE-D EPT. E LECTRONIC S YSTEMS E NGINEERING , IIS C , BANGALORE Figure 4: FER performance for a RS (255,239) code over an AWGN channel with BPSK modulation VII C ONCLUDING REMARKS We designed a high-speed low-complexity Reed Solomon decoder architecture based on the Berlekamp-Massey hard decision decoding algorithm. Our design consists of 3 modules namely syndrome computation, key equation solver (KES) and error magnitude computation. We proposed a low complexity KES architecture that takes exactly 2t cycles. This architecture results in a 5 times improvement in throughput. We use an efficient error locator and magnitude computation architecture presented in [3] to find error locations and magnitudes simultaneously. We also propose and employ a two stage pipelined RS decoder by merging the SC and KES blocks resulting in a 9 times reduction in the latency of the decoder at the same throughput. The number of slice LUTs and slice registers are increased in HDD as we used logic gates for Galois Field arithmetic instead of BRAMs. The proposed design can run at a maximum frequency of 200 MHz and can correct a maximum 4/5 PROC. OF ANNUAL PRODUCT CONFERENCE–2016 I NDIAN I NSTITUTE OF S CIENCE of 8 symbol errors within a 255 symbol codeword. As mentioned in Section 2.5, the user has the freedom to change the error correcting capability of the decoder which we designed as a programmable parameter. We also implemented a soft decision based Chase decoder that also take the channel reliability information as input. The HDD designed is used as a module within the Chase decoder. The Chase decoder can correct more than 8 symbol errors within a 255 symbol codeword. It results in approximately two times increase in throughput and two times reduction in latency compared to the Chase decoder presented in [6]. It also results in 0.25 dB coding gain over hard decision based decoder. R EFERENCES [1] Shu Lin and D. J. Costello, Error Control Coding (2nd Edition), Prentice Hall (2004). [2] Wei An, “Complete VLSI Implementation of Improved Low Complexity Chase Reed-Solomon Decoders”, Thesis, Massachusetts Institute of Technology, September, 2010. [3] Zhang and Xinmiao, VLSI architectures for modern errorcorrecting codes, CRC Press (2015). [4] J. Bellorado, Low-complexity soft decoding algorithms for ReedSolomon codes. Ph.D. thesis, Harvard University, 2006. [5] Reed-Solomon Decoder v9.0 LogiCORE IP Product Guide, Vivado Design Suite, PG107 November 18, 2015. Available: http://www.xilinx.com [6] F. G. Herrero et al. “High-Speed RS (255, 239) Decoder Based on LCC Decoding”, Circuits Syst Signal Process, 30:1643-1669 DOI 10.1007/s00034-011-9327-4, June, 2011. c DESE-D EPT. E LECTRONIC S YSTEMS E NGINEERING , IIS C , BANGALORE 5/5