Design Architecture and Implementation of a High Throughput Low

advertisement
PROC. OF ANNUAL PRODUCT CONFERENCE–2016
I NDIAN I NSTITUTE OF S CIENCE
Design Architecture and Implementation of a High
Throughput Low Latency Reed - Solomon Decoder
Y Vamshikrishna, T SV Satyannarayana, Shayan G. Srinivasa, DESE, IISc
A BSTRACT
Error correcting codes (ECC) are universally used in all communication systems involving transmission and storage. The needs of
today’s ECC systems not only include efficient code designs to guarantee low error rates, but also architectures mindful of area, power
and through- put for a viable system-on-chip (SoC). Reed-Solomon
(RS) code is one of the most powerful and versatile non-binary algebraic error correcting codes with guaranteed burst error correction capability. High-speed data trans- mission techniques such as
fiber-optical networking systems requires efficient error correcting
codes supporting high throughput to meet the continuing demands
of higher data rates. In this paper, we propose a novel low complexity architecture for hard decision decoding (HDD) of RS codes with
least latency and maximum throughput. We use our design towards
the construction of soft decision based Chase decoder for correcting
symbol errors beyond the minimum distance decoding.
I
I NTRODUCTION
Reed-Solomon codes were invented in 1960. It was not until the late sixties when Berlekamp and Massey invented an
efficient algorithm for decoding them. Today, RS codes are
ubiquitous (part of data storage centres, storage media and
transmission channels) and billions of dollars are invested into
products, having such error-correcting encoders and decoders,
and peta bytes of data are being decoded each minute using
such codes. It is no exaggeration to say the least that threequarters of the codes used today are Reed-Solomon codes.
Reed-Solomon codes have many interesting properties, such
as their guaranteed random-error-correction capability, bursterror-correction capability, and erasure-recovery capability,
making them very appealing for many applications [1].
There are two main categories of Reed-Solomon decoding.
They are the hard-decision decoders (HDD) and the algebraic
soft-decision decoding methods (ASD). ASD provide better
error correction performance than the HDD with higher complexity. The hard decision RS decoder can be implemented
with two popular algorithms, namely the Berlekamp-Massey
(BM) algorithm and the modified Euclidean (ME) algorithm
to solve the key equation [1] towards computing the error locator polynomial. The ME algorithm is easier to understand
and implement, whereas, the BM algorithm is efficient in terms
of computational complexity and implementation. ASD methods such as, Guruswami-Sudan (GS) algorithm, Koetter-Vardy
(KV) and the bit-level generalized minimum distance (BGMD)
decoding involve interpolation and factorization, where the interpolation stage involves a major part of the computations.
These are very complex decoders and impractical. On the
other hand, the low-complexity Chase (LCC) decoder requires
a maximum multiplicity of one. LCC decoding with eight test
c
DESE-D
EPT. E LECTRONIC S YSTEMS E NGINEERING , IIS C , BANGALORE
vectors achieves the same frame error rate (FER) performance
as the KV algorithm with multiplicity four. Due to its advantage of a maximum multiplicity amounting to one, and less
interpolation and factorization complexities, LCC decoding is
one of the preferred ASD methods.
We embarked on this project motivated by a flexible rate RS
architecture with provably efficient design from a throughput
and latency perspective applicable to bursty optical communication. RS codes have a flexible range of codeword lengths and
code rate. Besides the traditional applications, such as magnetic and optical recording, digital televisions, cable modems,
etc., RS codes are also finding their way into emerging applications for forward error correcting codes in medical implants
II
F UNCTIONAL ASPECTS
The primary objective of the proposed project is to implement
a RS decoder using a hard decision decoding algorithm with an
eye towards hardware efficient soft decision decoding circuit.
Soft-decision decoding algorithm is implemented to enhance
the correction capability beyond bounded distance decoding.
The implementation is targeted for maximizing the throughput
and minimizing the latency. The other important aspects like
power dissipation and area have to be carefully minimized for
a feasible design.
III
H ARD DECISION DECODING
The parameters of the RS (n, k,t) code include code length (n),
message length (k) and error correcting capability (t).
Let a code c(x) be corrupted by an additive error e(x) resulting in r(x). Suppose υ denotes the number of errors. Then,
e(x) has the following form
υ
e(x) = ∑ yi xi , 0 ≤ i ≤ n − 1,
(1)
i=1
where yi is the error magnitude at error location i.
The syndromes can be calculated as
υ
S j = r(α j ) = e(α j ) = ∑ Yi Xij , 1 ≤ j ≤ 2t,
(2)
i=1
where Yi = yi and Xi = α i . The aim is to solve the above 2t
equations to get the pairs (Xi ,Yi ). Let us define a polynomial
known as the error locator polynomial Λ(x)
υ
∏(1 + Xi x) = Λ0 + Λ1 x + · · · + Λυ−1 xυ−1 + Λυ xυ .
(3)
i=1
1/5
PROC. OF ANNUAL PRODUCT CONFERENCE–2016
The Xi values are evaluated using inverse roots of the above
equation. Given the values Xi , the equations (2) are linear in
Yi , and can be solved.
The error correction involves a 4-step procedure.
Step 1: Calculation of the syndromes S j :
Syndromes can be evaluated as per (2) from the r(x).
Step 2: Calculation of the Λi from the S j : (Berlekamp)
We can compute Λ(x) iteratively in 2t steps [1]. Let Λ(µ) (x)
denote the error locator polynomial at µ th step of the iteration.
To find Λ(x) iteratively, we start with the initialized Table I
shown and proceed to fill out the table. Let lµ be the degree of
Λ(µ) (x). Assuming that we have filled out the µ th row, we find
(µ + 1)th row using the procedure shown below
lµ
(µ)
= 0, then Λ(µ+1) (x) = Λ(µ) (x) and
1. If dµ = ∑ Sµ+1−i Λi
lµ+1 =
i=0
lµ .
2. If dµ 6= 0, then we search another row ρ prior to the µ th
row where dρ 6= 0 and the number ρ − lρ in the last column of the Table I has the largest value. Λ(µ+1) (x) and
lµ+1 is repeated as
Λ(µ+1) (x) = Λ(µ) (x) − dµ dρ−1 xµ−ρ Λρ (x),
(4)
lµ+1 = max[lµ , lρ + µ − ρ].
(5)
Table 1: Berlekamp’s iterative procedure for finding the error locator
polynomial Λ(x) of a RS code [ courtesy [1], chap. 6, page 210 ].
µ
Λ(µ) (x)
dµ
lµ
µ − lµ
−1
1
1
0
−1
0
1
S1
0
0
1
..
.
1 − S1 x
2t
Step 3: Calculation of Xi from the Λi : (Chien’s search)
If α −i is a root of Λ(x), the error is present at location i.
Step 4: Calculation of Yi : (Forney’s formula)
To evaluate the error magnitudes, we use Forney’s formula
Yi = −
Ω(Xi −1 )
Λ0 (Xi −1 )
(6)
where Ω(x) = S1 + (S2 + Λ1 S1 )x + (S3 + Λ1 S2 + Λ2 S1 )x2 +
· · · + (Sυ + Λ1 Sυ−1 + · · · + Λυ−1 S1 )xυ−1 . The final step is to
add e(x) obtained from Xi and Yi to r(x) to get the decoded
codeword polynomial c(x).
I NDIAN I NSTITUTE OF S CIENCE
computational complexity. ASD decoding facilitates the correction of errors beyond bounded distance decoding by using
the channel reliability information. Among the ASD methods, the GS and KV algorithms give better performance at the
expense of higher complexity. On the other hand, Chase decoding offers a low complexity solution with comparable performance. So, we have proceeded with the Chase decoder implementation.
The Chase decoder uses a HDD decoder to a set of test vectors and is easy to implement without compromising on the
performance. It corrects codewords within a decoding radius
η
t > dmin
2 . LCC decoding is based on generating 2 test vectors
from the received vector, based on the symbol reliability information, where η symbols are selected as the least reliable
symbols out of n symbols, for which hard decision or second
most reliable decision is employed. To create the test vectors a
ratio between the probability of the hard-decision symbol and
the second-best decision is established. This ratio indicates
how good the hard decision is. The desired probability ratio
for the received message polynomial r(x) is
P(r0 | r0 HD )
P(r(n−1) | r(n−1) HD )
∏ = P(r0 | r0 2HD ) · · · P(r(n−1) | r(n−1) 2HD )
(7)
1×n
where ri HD is the hard decision of the symbol ri , and ri 2HD
its second most reliable decision. Corresponding to the η
points with the worst probability ratio (between the hard decision and second-best decision), a set of 2η combinations called
test vectors are created by selecting the hard-decision symbol
(ri ) or the second-best decision (ri 2HD ) in the η less reliable
points. Second-best decision is obtained based on information
of message symbol probabilities and it is complex to compute
also. For the sake of implementation, we follow a reasonable
and simple method of generating second-best decision and test
vectors. Let us consider the following setting to generate symbol reliabilities and second most reliable decisions.
Setting: BPSK Modulation over the AWGN channel.
Consider a codeword c = [c1 c2 · · · cn ], where ci =
[ci1 ci2 · · · cim ] is an m−bit vector. ci j is modulated to xi j and
xi j = 1 − 2ci j . Let ri j denote the real-valued channel observation corresponding to ci j . ri j = ci j +ni j , where ni j are Gaussian
random noise samples.
At the receiver, a hard slicing is done and the received vec[HD]
tor thus formed is y[HD] . A symbol decision yi , is made
on each transmitted symbol ci (from observations ri1 , · · · , rim ).
The channel provides reliability of each bit received. Among
the m-bits in a symbol, the least reliable bit defines the worst
case reliability of that symbol.
Define the ith symbol reliability λi as [2]
λi , min ri j .
1≤ j≤m
IV
S OFT DECISION DECODING
Algebraic soft decision decoding (ASD) of RS codes provides
higher coding gain over conventional HDD, but involves high
c
DESE-D
EPT. E LECTRONIC S YSTEMS E NGINEERING , IIS C , BANGALORE
The value λi indicates the confidence on the symbol decision
[HD]
yi . The higher the value of λi , the greater the reliability and
[2HD]
vice-versa. Second most reliable decision yi
is obtained
2/5
PROC. OF ANNUAL PRODUCT CONFERENCE–2016
I NDIAN I NSTITUTE OF S CIENCE
[HD]
from yi , by complementing or flipping the bit that achieves
the minimum in above equation. The key steps in the LCC
algorithm are given as follows
Received
symbols
• The set of 2η test vectors are generated using the relations
as follows,
( [HD]
y
,
for i ∈
/I
yi = i [HD] [2HD]
{yi , yi
}, for i ∈ I
• Each test vector is passed on serially to HDD explained in
Section III and the vector, for which the decoding failure
does not result, that will be taken as estimated codeword
at the receiver.
V
D ESIGN AND IMPLEMENTATION
We proceeded with the BM algorithm, explained in Section III, to implement a hard decision decoder (HDD) circuit
on FPGA. Pipelining is employed to RS decoder to achieve
higher throughput. Popular three-stage pipelined RS decoder
is shown in Figure 1. In a pipelined architecture, the overall
throughput is decided by the slowest pipelined stage and its
computation time. Hence, to increase the throughput with efficient area utilization, each pipelined stage should complete its
computations in about the same amount of time. The key equation solver (KES) stage that computes the error locator polynomial using the BM algorithm requires 2t iterations. Parallelism
can be employed to adjust the number of clock cycles required
for the syndrome computation (SC) and the Chien search and
error magnitude computation stages. In addition, we use a delay buffer to buffer the received symbols. The size of the delay
buffer depends on the latency of the decoder.
Received
symbols
Syndrome
computation
Key
equation
solver
Chien search &
Error magnitude
computation
Estimated
Codeword
symbols
Delay buffer
Figure 1: Three-stage pipelined decoder with the delay buffer and
each stage is separated using pipelined registers [ courtesy [3], chap.
4, page 70 ]
Estimated
Codeword
symbols
Key
equation
solver
• Sort the n−symbol reliabilities λ1 , λ2 , · · · , λn in an increasing order, i.e.,
λi1 ≤ λi2 ≤ · · · ≤ λin .
• Form an index set I , {i1 , i2 , · · · , iη } denoting η-smallest
reliability values.
Chien search &
Error magnitude
computation
Syndrome
computation
Delay buffer
Figure 2: A two-stage low latency pipelined decoder in which the
syndrome computation and the KES are merged into a single stage.
LCC (255,239) decoder based on the HDD decoder is shown
in Figure 3. The main steps for LCC decoding are (i) multiplicity assignments, (ii) creation of the test vectors, and (iii)
selection of the correct test vector and decoding. The LCC
decoding process creates 2η different test vectors using the reliability information from the received points, where η < 2t.
It is proved in [4] that for η > 2t the decoder complexity is so
high that its implementation is impractical. If one of 2η test
vectors has less than t + 1 errors, then the decoder will be able
to correct them in all cases. We use the BMA based HD twostage piplelined RS (255,239) decoder that we designed within
this architecture. For the BMA based RS decoder, a decoder
failure happens if the received message generates an error locator polynomial σ (x) of degree higher than the number of
roots obtained during the Chien search procedure. In such a
case, the number of errors is more than t, and therefore the decoder cannot recover the message correctly. Specifications of
the LCC decoder are as follows.
Decoding
failure
detector
Channel
info
r(x)
Multiplicity
assignment
Test vector
generation
Syndrome
Computation
Degree
counter
Root
counter
Key
equation
solver
Chien
search
Decode
failure
Forney's e(x)
algorithm
c(x)
Delay buffer
Figure 3: LCC soft-decoder based on decoding failure
• RS (255,239) decoder over GF(28 ) with η = 3.
• Number of bits for channel reliability information is 4bits.
• Inputs to decoder: y[HD] , reliability information λi of each
[HD]
symbol yi , Information about the bit that needs to be
flipped in a symbol.
.
Latency of the decoder can be reduced significantly without
affecting the throughput much, if both the SC and the KES
are merged into a single pipeline stage as shown in Figure 2.
This is possible because SC stage outputs one syndrome every
clock cycle and KES stage takes one syndrome as input for
every clock cycle from an array of syndromes.
c
DESE-D
EPT. E LECTRONIC S YSTEMS E NGINEERING , IIS C , BANGALORE
VI
R ESULTS AND COMPARISONS
The proposed hard and soft RS decoders were implemented
using VHDL on a Kintex- 7 FPGA KC705 Evaluation Kit.
It is functionally verified using a ModelSim simulator and
Chipscope-pro. The outputs from the VHDL coded architecture were validated using Matlab and a C-coded model.
3/5
PROC. OF ANNUAL PRODUCT CONFERENCE–2016
Throughput and latency equations are as follows for both hard
and soft decision RS decoders respectively.
Table 2: Clock cycles required for each stage of the RS decoder.
Stage
Syndrome Computation
Key Equation Solver
Chien and Error Magnitude Computation
Clock cycles
2t
2t
t + nJ
Table 2 shows the clock cycles taken by each pipeline stage
in HDD. The throughput of the HDD as a function of error
correction capability is given by
Throughput(t) =
n × m × fmax
(Gbps).
max{2t,t + nJ }
(8)
I NDIAN I NSTITUTE OF S CIENCE
Table 4: Comparison of RS (255,239) LCC designs with η = 3.
LUTs
FFs
BRAMs
Max.freq (MHz)
Max.Throughput (Gbps)
Latency (cycles)
Platform
Xilinx ISE version
Existing [6]
5114
5399
150.5
0.71
1312
Virtex-V
9.2i tool
Designed
18584
16621
0
161.29
1.29
559
Kintex-7
14.6 tool
The LCC soft decision decoder performs better than HDD
as shown in Figure 4. The LCC decoder achieves o.25 dB of
coding gain over HDD and corrects the errors beyond bounded
distance.
where n, m, t and fmax denote the code length, symbol width,
error correcting capability and maximum operating frequency
respectively.
The total latency of the designed 3-stage HDD in Figure 1 is
given by sum of the cycles required for each stage as per Table
2
n
n
(9)
Latency = 2t + 2t + t + = 5t + (cycles).
J
J
The latency is reduced further by the proposed 2-stage HDD
architecture in Figure 2.
Latency ≈ 3t +
n
(cycles).
J
(10)
RS decoder of block length n = 255 with J = 10 is implemented over GF(28 ). Table 3 shows the comparison between
our decoder with J = 10 and the RS decoder v9.0 Xilinx IP
for RS (255,239) on Kintex-7 FPGA. No BRAMs are used in
our decoder as all the needed Galois field arithmetic are implemented using LUTs and FFs. Xilinx IP [5] utilized huge
memory in terms of two 18k BRAMs and one 36k BRAM,
which saved the utilization of LUTs and FFs compared to our
decoder. If we consider BRAMs, LUTs and FFs altogether, we
feel that our decoder and Xilinx IP utilized almost the same
hardware resources. The gain achieved in terms of throughput
is almost five times with our architecture which is a significant
improvement. The latency of our decoder is almost nine times
better than the Xilinx IP.
Table 3: Comparison of RS (255,239) designs.
LUTs
FFs
36k BRAMs
18k BRAMs
Max.freq (MHz)
Max.Throughput (Gbps)
Latency (cycles)
Xilinx IP
1177
1169
1
2
292
2.33
470
Proposed
8774
5272
0
0
200
12
55
Similarly, Table 4 shows the comparison between the existing LCC decoder [6] and our designed LCC decoder.
c
DESE-D
EPT. E LECTRONIC S YSTEMS E NGINEERING , IIS C , BANGALORE
Figure 4: FER performance for a RS (255,239) code over an AWGN
channel with BPSK modulation
VII
C ONCLUDING REMARKS
We designed a high-speed low-complexity Reed Solomon decoder architecture based on the Berlekamp-Massey hard decision decoding algorithm. Our design consists of 3 modules
namely syndrome computation, key equation solver (KES) and
error magnitude computation. We proposed a low complexity
KES architecture that takes exactly 2t cycles. This architecture
results in a 5 times improvement in throughput. We use an efficient error locator and magnitude computation architecture
presented in [3] to find error locations and magnitudes simultaneously. We also propose and employ a two stage pipelined
RS decoder by merging the SC and KES blocks resulting in
a 9 times reduction in the latency of the decoder at the same
throughput. The number of slice LUTs and slice registers are
increased in HDD as we used logic gates for Galois Field arithmetic instead of BRAMs. The proposed design can run at a
maximum frequency of 200 MHz and can correct a maximum
4/5
PROC. OF ANNUAL PRODUCT CONFERENCE–2016
I NDIAN I NSTITUTE OF S CIENCE
of 8 symbol errors within a 255 symbol codeword. As mentioned in Section 2.5, the user has the freedom to change the
error correcting capability of the decoder which we designed
as a programmable parameter.
We also implemented a soft decision based Chase decoder
that also take the channel reliability information as input. The
HDD designed is used as a module within the Chase decoder.
The Chase decoder can correct more than 8 symbol errors
within a 255 symbol codeword. It results in approximately
two times increase in throughput and two times reduction in
latency compared to the Chase decoder presented in [6]. It
also results in 0.25 dB coding gain over hard decision based
decoder.
R EFERENCES
[1] Shu Lin and D. J. Costello, Error Control Coding (2nd Edition),
Prentice Hall (2004).
[2] Wei An, “Complete VLSI Implementation of Improved Low
Complexity Chase Reed-Solomon Decoders”, Thesis, Massachusetts Institute of Technology, September, 2010.
[3] Zhang and Xinmiao, VLSI architectures for modern errorcorrecting codes, CRC Press (2015).
[4] J. Bellorado, Low-complexity soft decoding algorithms for ReedSolomon codes. Ph.D. thesis, Harvard University, 2006.
[5] Reed-Solomon Decoder v9.0 LogiCORE IP Product Guide,
Vivado Design Suite, PG107 November 18, 2015. Available:
http://www.xilinx.com
[6] F. G. Herrero et al. “High-Speed RS (255, 239) Decoder Based
on LCC Decoding”, Circuits Syst Signal Process, 30:1643-1669
DOI 10.1007/s00034-011-9327-4, June, 2011.
c
DESE-D
EPT. E LECTRONIC S YSTEMS E NGINEERING , IIS C , BANGALORE
5/5
Download