International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 3 – March 2015 Adaptive Turbo Decoder with FSM Based Interleaver Address Generator Varsha Ramesh#1, M.Thangamani#2 #1 II Year M.E Student, #2 Assistant Professor #1, #2 ECE Department,Dhanalakshmi Srinivasan College of Engineering, Coimbatore,Tamilnadu,India. Anna University Abstract---The fast spread of wireless data communication systems and the ever increasing demand for faster data rates requires fast design, implementation and test of new wireless algorithms and architectures for data communications. The most popular communication decoder, the Turbo decoder, requires an exponential increase in hardware complexity to achieve greater decodeing accuracy. Interleaver is a critical component of a turbo decoder. In this work first utilize the balanced scheduling scheme to avoid memory reading conflicts. Then, based on the statistical property of memory conflicts, the other critical parameters access timing, power and area are reduced when using another alternative for the IAG which is FSM based IAG(Interleaver Addres Generator). This IAG is the efficient and fast parallel interleaver architecture supporting both interleaving and deinterleaving modes. Also describe about the analysis and implementation of a reduced-complexity decodes approach, the adaptive turbo decoding architecture (ATA). Our ATA design takes full advantage of algorithm parallelism and specialization. To achieve improved decoder performance, Run-time dynamic reconfiguration is used in response to changing channel noise conditions. Implementation parameters for the decoder have been determined through simulation and the decoder has been implemented on a Xilinx XC4036 software. Keywords--- Forward Error Correction, decoder, Trellis tree diagram, Turbo decoder. I. Turbo INTRODUCTION Growth of high-performance wireless communication systems has been drastically increased over the last few years. Due to rapid advancements and changes in radio communication systems, there is always a need of flexible and general purpose solutions for processing the data [1]. The ISSN: 2231-5381 solution not only requires adopting the variances within a particular standard but also needs to cover a range of standards to enable a true multimode environment. To handle the fast transition between different standards, fast andf accurate platform is needed in both mobile devices and especially in base stations. Including symbol processing, one of the challenging area will be the provision of flexible subsystems for forward error correction (FEC). FEC subsystems can further be divided in two categories, channel coding / decoding and interleaving/ deinterleaving. Among these categories, interleavers and deinterleavers appeared to be more silicon consuming due to the silicon cost of the permutation tables used in conventional approaches. Therefore, the hardware reuse among different interleaver modules to support multimode processing platform is of significance. This paper introduces flexible and low-cost hardware interleaver architecture and adaptive turbo architecture for trellis tree structure which will covers a range of interleavers admitted in various communication standards. II. BACKGROUND Error correcting codes [9] can be used to detect and correct data transmission errors in communication channels. The encoding is accomplished through the addition of redundant bits to transmitted information symbols. These redundant bits provide decoders with the capability to correct transmission errors. In convolution coding, the encoded output of a transmitter is (encoder) depend not only on the set of encoder inputs received within a particular time step, but also on set of the inputs received within a previous span of K-1 time units, where K will be greater than 1. The parameter K is constraint length of the code. A convolutional encoder is represented by the number of output bits per input bit (v), the number of input bits accepted at a time (b), and the constraint length (K), leading to representation (v, b, K). A (2, 1, 3) convolutional http://www.ijettjournal.org Page 144 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 3 – March 2015 encoder since the encoder accepts one input bit per time step and generates two output bits. The two output bits are dependent on the present input and the previous two input bits. The constraint length K indicates the number of times each input bit has an effect on producing output bits. Larger constraint lengths, i.e. K = 9 or higher, are preferable since they allow for more accurate error correction. The operation of the encoder is represented by using a state diagram. Nodes represent the present state of the shift register while edges represent the output sequence and point to the next state of transition. The diagram is a time-ordered mapping of encoder state with each possible state represented with a point on the vertical axis. A node shows the present state of the shift register at specific points in time while edges represent the output sequence and point to the next state of transition. The upper branch leaving a node implies an input of 0 while the lower branch implies an input of 1. The function of the decoder is that attempt to reconstruct the input sequence transmitted by the encoder by evaluating the received channel output. Values received by the decoder may differ from values sent from the encoder due to channel noise. The interaction between states represented by the trellis diagram is used by a decoder to determine the likely transmitted data sequence [10] as v-bit symbols are received. At each node, the cumulative cost or path metric of the path is determined. After a series of time steps, that is known as the truncation length (TL), the lowest-cost path or minimum distance path is determined and that will identify the mostlikely transmitted symbol sequence. The value of the truncation length depends on the noise in the channel and it has been empirically found to be 3-5 times the constraint length [9]. Each path in the trellis diagram represented by a unique set of inputs corresponding to the lowest-cost input sequence b = (0110). The performance of a decoder is characterized by number of decoded output bits which are in error, the Bit Error Rate or BER. Ratio of number of bits in error to the total number of bits transmitted is known as BER. For accurate communication fidelity it is desirable to achieve a low BER. The most popular decoding approach for convolutional codes, the turbo algorithm [9] determines a minimum distance path with regards to Hamming distances applied to each received symbol. A limiting factor in turbo decoder implementations is the need to preserve candidate paths at all 2K−1 trellis states with each received symbol. This will leads to an exponential growth in the amount of computation performed and the amount of path storage retained as constraint length K grows. The architecture of the ISSN: 2231-5381 Turbo algorithm [9] is split into three parts: the branch metric generators (BMG), add-compare select (ACS) units, and the survivor memory unit. A BMG unit determines Hamming distances between received and expected symbols. An ACS unit determines path costs and identifies lowest-cost paths. The survivor memory stores lowest cost bit-sequence paths based on decisions made by the ACS units. III. RELATED WORK An efficient parallel decoding approach called the segmented sliding window (SSW) approach was introduced Zhongfeng. The idea is to divide a decoding frame into many sliding blocks and assign these sliding blocks to several segments. Each segment consists of consecutive sliding blocks and adjacent segments have an overlap of exactly two sliding blocks. When employing the sliding window approach for decoding each segment, the performance of the parallel decoder is expected to be almost the same as using the sliding window approach for the whole frame [9]. Maurizio [10] A flexible UMTS/WiMax turbo decoder architecture has been presented together with a parallel WiMax interleaver architecture. Compared to a single-mode, parallel WiMax architecture the proposed one exhibits a limited complexity overhead. Moreover, compared to separated dual mode UMTS/WiMax turbo decoder architecture, it achieves the 17.1% logic reduction and the 27.3% memory reduction. Besides this, in order to cope with severe transmission environments, typical of wireless systems, channel codes ought to be adopted. The first dedicated approach that finds conflict free memory mapping for every type of codes and for every type of parallelism in polynomial time is presented [11]. The implementation of this highly efficient algorithm shows significant improvement in terms of computational time compared to state of the art approaches. This could enable memory mapping algorithm to be embedded on chips and executed on the fly to support multiple block lengths and standards. Vosoughi propose a novel algorithm and architecture for on-the-fly parallel interleaved address generation in UMTS / HSPA+ standard that is highly scalable. Algorithm generates an interleaved memory address from an original input address without building the complete interleaving pattern or storing it; Ahmad Sghaier et al. [7] have described a look up table based method for address generation of the interleaver used in http://www.ijettjournal.org Page 145 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 3 – March 2015 IEEE 802.11 WLAN. In [6] special matrix based architecture for multimode WLAN block interleaver is presented. Algebraic constructions are the particular interest because it will include analytical designs and simple practical hardware implementation. Sun and Takeshita shown that the class of quadratic permutation polynomials over integer rings provides excellent performance for the turbo codes. Interleaving method first introduced in this structure. The developed method is on-the-fly IAG. To eliminate the disadvantages of on-the-fly IAG new Interleaver architecture developed that is FSM based IAG is described in the follwing part. Several reconfigurable implementations of Turbo decoders have been reported. Although these systems are FPGA based, none of them use run-time reconfiguration to achieve performance improvement. Unlike our approach, this implementation does not evaluate all trellis states in parallel, resulting in slower decoding operation. Racer, a constraint length 14 Turbo decoder, is described. The system uses 36 XC4010 FPGAs and seven processor cards and employs a novel approach to implementing survivor memory. Due to the use of a sizable number of FPGAs and significant inter-chip communication, system area is large. Racer exhibits significant parallelism, although some add-compare-select hardware is multiplexed across multiple trellis states per received symbol. Candidate paths are stored in memory external to the FPGAs. Our ATA approach achieves fully parallel implementation on a single, large FPGA that contains significantly less total logic than the board used in [16] for the same constraint length (K=14). In [5], a Turbo decoder of constraint length 7 using four XC4028EX FPGAs is described. The decoder is partitioned so that 64 ACS units fit into two of the FPGAs and the remaining two FPGAs house the survivor memory and its corresponding controller. The main issue with this approach involves data transfer between FPGAs. Although [5] allowed for parallel trellis evaluation, the limited data rate of 12 Kbps was achieved for a relatively small constraint length of 7. This reduced rate was primarily due to inter chip data transfer overhead. No dynamic reconfiguration was performed. IV. permutation ensures that adjacent coded bits are mapped alternately onto less or more significant bits of constellation, and avoiding long runs of lowly reliable bits, d represents number of columns of the block Interleaver which is typically chosen to 16. mk is the output after first level of permutation and k varies from 0 to Ncbps -1. S is a parameter defined as s= max {1, Ncpc/2}, where Ncpc is the number of coded bits per subcarrier. A. Address Generator Our proposed design of address generator block is described in the form of schematic diagram in Fig. 1. Bulk of the circuitry is used for generation of write address. It contains three multiplexers (muxs): mux-1 and mux-2 implements the unequal increments required in 16-QAM and 64-QAM whereas mux-3 routes the outputs received from mux-1 and mux-2 along with equal increments of BPSK and QPSK. The select input of mux-1 controlled by a T-flip-flop named qam16_sel whereas that of mux-2 is controlled by a mod-3 counter, qam64_sel. The two lines of mod_typ (modulation type) are used as select input of the mux-3. The 6-bit output from the mux-3 acts as one of the input of the 9-bit adder after zero padding. The other input of adder comes from the accumulator, which will hold previous address. After addition of a new address is written in the accumulator. The preset logic is a hierarchical FSM whose principal function is to generate the correct beginning addresses for all subsequent iterations. This block contains a 4-bit counter keeping track of end of states during the iteration. The FSM enters into the first state (SF) with clr = 1. FINITE STATE MECHINE BASED INTERLEAVER ADDRESS GENERATOR The Interleaver is defined with a two-step permutation. The first will ensures that adjacent coded bits are mapped onto non adjacent subcarriers. The second ISSN: 2231-5381 http://www.ijettjournal.org Page 146 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 3 – March 2015 versa. Each memory module receives either write address or read address with the help of the mux connected to their address input lines (A) and sel line.At the beginning RAM-1 receives the read address and RAM-2 gets the write address with write enable (WE) signal of RAM-2 active. After a particular memory block is read / written up to the desired location the status of sel line changes and the operation is reversed. The mux at the output of the memory modules routes the interleaved data stream from the read memory block to the output. Figure 1. Schematic diagram of address generator Based on the value in mod_typ it makes transition to one of the four possible next states (SMT0, SMT1, SMT2 or SMT3). Each state in this level represents one of the possible modulation schemes. The FSM there after makes transition to the next states (e.g. S000, S001 and so on) based on the value in the accumulator. When FSM at this level reaches to the terminal value of that iteration it makes transition to a state (e.g. S000) in which it loads the accumulator with the initial value (e.g. preset=1) of next iteration. This will continues till all of the interleaver addresses are generated for selected mod_typ. If no changes take place within the values of mod_typ, the FSM will follow the same route of transition and the same set of interleaver addresses will be continually be generated. Any change in mod_typ value causes the interleaver to follow an alternate path. In order to facilitate the address generator with on the fly address computation feature, we have made the circuit to respond to clr input followed by mod_typ inputs at any stage of the FSM. With clr=1 it will comes back to SF state irrespective of its current position and there after transits to the desired states in response of new value in mod_typ. B. Interleaver Memory The interleaver memory block comprises of two memory modules (RAM-1 and RAM-2), three muxs and an inverter as shown in Fig. 2. In block interleaving when one memory block is being written the other one is read and vice- ISSN: 2231-5381 Figure 2. Schematic view of Interleaver Memory block V. ADAPTIVE TURBO ALGORITHM The adaptive turbo algorithm [4] is introduces with a goal of reducing the average computation and the path storage required by the Turbo decoding algorithm. Instead of computing and retaining all the 2K−1 paths, only those paths which satisfy certain cost conditions are retained for each received symbol at each state node. The path retention based on some criteria and that is, a threshold T indicates that a path is retained if its path metric is less than dm + T, where dm is the minimum cost among all surviving paths in the previous trellis stage. The total number of survivor paths per trellis stage is limited to a fixed number, Nmax, which is pre-set prior to the start of communication. The first criteria allows high-cost paths that do not represent the transmitted data to be eliminated from consideration early in the decoding process. Where many paths with similar cost, the second criterion is restrict to the number of paths to Nmax. A trellis diagram of the adaptive Turbo algorithm of constraint length 3 is shown in Figure with a Threshold value T = 1. http://www.ijettjournal.org Page 147 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 3 – March 2015 on BER as T. If a small value of Nmax chosen such that, the paths which satisfy the threshold condition may be discarded, potentially leading to a large BER. To demonstrate the benefit of the adaptive Turbo algorithm we have developed interleaver and adaptive algorithm. This architecture takes advantage of parallelization and specialization of hardware for specific constraint lengths and dynamic reconfiguration to adapt decoder hardware to changing channel noise characteristics. A. Description of the Architecture Figure 3. Trellis diagram for a hard-decision adaptive Turbo decoder with T = 1 and Nmax = 3 At each stage, the minimum cost (path metric) of the previous stage dm, threshold T, and maximum survivors Nmax are used to prune the number of surviving paths. Initially, at t=0, the decoder state is set as 00. Two branches flow out from state 00 to states 00 and 10 at t=1 representing encoded transmission 0 and 1 values respectively by the encoder. If the received value when t=0 is 00, it is appeared to be b = 0, v = 00 was transmitted rather than a b = 1, v = 11 value since both bits of the latter v would have been corrupted by noise Since state 00 is the only state at t=0, dm is the path metric of state 00, which is 0. As a result, dm + T are 1. when t=1, the path leading to state 10, does not survive because the current path metric of state 10, is greater than 1, the value of dm + T. As a result, only one branch, the branch leading to state 00 survives at t=1. The new dm used at t=2 is the minimum among metrics of all surviving paths at t=1. Since only one path survives at t=1, the path metric of state 00 and dm will be 0. This can result in an increased BER since the decision on the most likely path has to be taken from a reduced number of possible paths. When the large value of T selected, BER reduces and average number of survivor paths will be increases. Then the increased decode accuracy comes at the expense of additional computation and a larger path storage memory. The value of T should be selected so that the BER is allowable limits while matching the resource capabilities of the hardware. Nmax denote that the maximum number of survivor paths to be retained at any trellis stage. The maximum pertrellis stage number of survivor paths Nmax has similar effect ISSN: 2231-5381 The architecture of the implemented adaptive Turbo decoder is shown in Figure 4 for the encoder with parameters, (2, 1, 3). The branch metric generator determines the difference between the received v-bit value and 2v possible expected values. This difference is the Hamming distance between the values. A total of 2v branch metrics are determined by the branch metric generator. For v=2 these metrics are labeled b00, b01, b10 and b11. At each trellis stage, the minimum-value surviving path metric among all path metrics for the preceding trellis stage dm, is computed. New path metrics are compared to the sum dm+ T to identify path metrics with excessive cost. As shown in the left of Figure 4, the path metrics for all the potential next state paths that is di, is computed by the ACS unit. Comparators will then used to determine the life of each path based on the threshold, T. Then the threshold condition is not satisfied with path metric dm + T, then the corresponding path will be discarded. Present and next state values for the trellis are stored in two column arrays, Present state and NEXTSTATE of dimensions Nmax and 2Nmax respectively, as shown in Figure 4. There will be at the most Nmax survivor paths at any Stage . Since each path is associated with a state then the number of present states will be Nmax. Each path can potentially create two child paths before pruning as there are two possible branches for each present state based on a received 0 or 1 symbol. Entries in the NEXTSTATE array need not be in the same row as their respective source present states. In order to correlate the next state paths and next states located in the NEXT STATE array, an array of size 2Nmax, called Path Identify, is used. For each next state element, this array also indicates the corresponding row in path storage (survivor) memory for the path. Once the paths that meet the threshold conditions are determined, the lowest cost Nmax paths will be selected. To avoid the need for the sorting circuit described in [6] for the http://www.ijettjournal.org Page 148 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 3 – March 2015 M-algorithm, here developed a nw path pruning approach.Sorting circuitry will be eliminated by making feedback adjustments to the parameter T. If the number of paths that survive and the threshold is less than Nmax, no sorting is required. For stages,the number of paths surviving the threshold condition is greater than Nmax, T will be iteratively reduced by 2 for the current trellis stage until the number of paths surviving the threshold condition will be equal to or less than Nmax. Figure 4: Adaptive Turbo decoder architecture VI. EXPERIMENTAL RESULTS The Modelsim version 6 software is used to model the entire system process that is encoding, interleaving, transmitting the bits, receving the bits decoding and deinterlaving based on the mod_typ value. Figure 5.shows the combined form for the entire process. ModelSim SE 6.3f, our entry-level simulator offers VHDL, Verilog, or mixedlanguage simulation. Model Technology’s award-winning Single Kernel Simulation (SKS) technology enables transparent mixing of VHDL and Verilog in a design. ModelSim’s architecture allows platform independent compile along with the outstanding performance of native compiled code. Figure 5. Combined output for the entire process ISSN: 2231-5381 http://www.ijettjournal.org Page 149 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 3 – March 2015 Figure 7. Timing Summary Figure 8.Power Summary Figure 6 shows design summary that is total gate count. it will depends on the area required for the on board implementation. The speed of FSM based turbo decoder with adaptive turbo architecture is shown in figure 7.minimum period is known as the time taken to complete the entire process. The estimated power is given in figure 8.The entire process takes less power as compared to any other system. Power is major concern in designing any 3G/4G system. Low power consumption is an important advantage for any case for equipment used in wireless communication as they are to run on battery power. Figure 6. Design Summary VII. PERFORMANCE COMPARISON The FSM based approach and ATA architecture provides higher operating frequency and excellent FPGA resource utilization. Use of FPGA’s memory offers advantages like reduced access time, lesser amount of area required on circuit board and lower power consumption than external memory based techniques. The table 1 given below gives the comparison between parallel architecture of the turbo decoeder with on-the fly interleaver and adaptive turbo decoder with FSM based IAG. ISSN: 2231-5381 http://www.ijettjournal.org Page 150 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 3 – March 2015 Parameter Parallel architecture with on the fly IAG Adaptive Turbo decoder with FSM based IAG Total gate count Timing Summary (ns) Power 109,870 14.220 3429 (mW) [4]. [5]. 810 9.325 169 Table 1.Comparison between parallel architecture with on the fly IAG and adaptive turbo decoder with FSM based IAG. VIII. CONCLUSION In novel FSM based technique with adaptive turbo decoder used in IEEE 802.11a and IEEE 802.11g based WLAN has been presented. In this work first the interleaver address generator preset logic is changed to the finite state mechine based address generator. Then, based on the statistical property of memory conflicts, the other critical parameters access timing, power and area are reduced when using another alternative for the IAG which is FSM based IAG. This IAG is the efficient unified parallel interleaver architecture supporting both interleaving and deinterleaving modes. The FSM based IAG hardware model of the interleaver is completely implemented in Spartan-3 FPGA. Then the trellis tree diagram used in decoding was modified by using adaptive turbo decoding architecture. Assigning the threshold value for each state in the state reduce the buffer size so that area is reduced. Decoding will be more accurate by the adaptive architecture. Critical analysis of implementation results of both approaches has been made to ease the decision making of a system designer regarding the technique to adopt. F.Speziali and J. Zory, “Scalable and area efficient concurrent interleavre for high throughput turbodecoders,”inProc.Euromicro Symp. Digit. Syst. Des. (DSD),Aug 2004,pp. 334-341. Vosoughi,G. Wang, H. Shen,J. R. Cavallaro,Y .Guo “ Highly Scalable on-the-fly interleaved address Generation for UMTS/HSPA parallel turbo decoder,” In Proc. IEEE Int. Conf.ASAP,Jun.2013,pp.356-362. [6]. F. Chan and D. Haccoum “Adaptive Turbo Decoding of Convolutional codes over memoryless channels”,IEEE Trans. On Communications, pp. 1389-1400,Nov. 2001. [7]. S.J. Simmons. “Breath-first trellis decoding with Adaptive effort”, IEEE Trans. On Communications, Vol. 38, pp. 3-12, Jan. 1990. [8]. S.Swaminathan. “An FPGA-based Adaptive Turbo Decoder”, Master’s thesis,University of Massachusets Amherts, Dept. of Electrical and Computer EngineerIng, 2001. [9]. M.Kivioja, J. B. Anderson. “M-algorithm decoding of Channel convolutional codes”,Int. Conf. of Info. Science and systems, pp. 362-366,Mar 1986. [10]. Y. Sun and J. R. Cavallaro, “Efficient hardware Implementation of a highly parallel 3GPP LTE/LTEAdvance turbo decoder,”VLSI J.Integr.,vol. 44, no. 4, pp. 305-315,2011. [11]. A. Nimbalkr, T. Blankenship,B. Classon,T. Fuja and D. Costello, “Contention-free interleavers for high Throughput turbo decoding,”IEEE Trans. Commun., Vol. 56,no. 8,pp. 2701-2704. REFERENCES [1]. [2]. [3]. Guohui Wang, Hao Shen, Joseph R. Cavallaro, Aida Vasoughi,(2014), “Parallel interleaver design for a high throughput HSPA+/LTE multi-standard turbo decoder”, in IEEE Transactions on circuits and systems, Volume. 61, No. 5, pp.1376-1389. Z.Wang, Z.Chi and K.K Parhi, “Area-efficient highspeed decoding schemes for turbo decoders,”IEEE Trans.VLSI Syst., vol.10,no.6,pp.902-912,2002. C.Benkeser, A.Burg, T.Cupaiulo,and Q. Huang, “Design and optimization of an HSDPA turbo decoder ASIC”, IEEE J.Solid-State Circuits, vol. 44, no. 1,pp. 98-106. ISSN: 2231-5381 http://www.ijettjournal.org Page 151