International Journal of Engineering Trends and Technology- Volume4Issue1- 2013 Design of Memory Based Implementation Using LUT Multiplier Charan Kumar .k1, S. Vikrama Narasimha Reddy2, Neelima Koppala3 1,2 M.Tech(VLSI) Student, 3Assistant Professor, ECE Department, Sree Vidyanikethan Engineering College(Autonomous), A.Rangampet, Tirupati. Abstract - Multiplication is major arithmetic operation in signal processing. In ALU’s the multiplier uses look- operation of the these devices is very fast which consumes less power, less area, reduces time of up-table (LUT) as memory for their computations. We operation & become more efficient with respect to do not find any significant work on LUT optimization the several factors such as reliability, flexibility, for memory-based multiplication. In this project, the scaling etc. therefore it leads to significant growth & anti symmetric product coding (APC) and odd-multiple improvement of these devices become cheaper. The storage (OMS) are used for lookup-table (LUT) design semiconductors have embedded memory which for memory-based multipliers used in the signal processing applications like filter design. Each of this technique results in the reduction of LUT size by a factor of two. A different form of APC and modified OMS scheme can be combined for efficient memory results in dominating presence in the SOC’s exceeding 90% of the total soc [2]. When compared to logical components, the semiconductor memory devices has high transistor packing density with implementation which reduces LUT size to one-fourth increasing fast rate [1]. Apart from that, memory of the conventional LUT. The proposed design of LUT- based computing structures offers more other based multiplier involves less area-delay product for advantages rather than multiply accumulate structures higher word sizes due to operand decomposition than such as greater potential for high throughput, low the canonical-signed-digit (CSD)-based multipliers. The latency implementation and less dynamic power coding is proposed to be done in Veriolg HDL and synthesized using XillinxISE10.1i and implemented using Spartan3E FPGA. consumption. Memory-based computing is well suited for many digital signal processing (DSP) algorithms, which involve multiplication with a fixed Key words- digital signal processing (DSP) chip, lookup- set of coefficients. The following block diagram table shows the conventional look up table based multiplier (LUT)-based computing, memory-based computing, very large scale integrations (VLSI). I. in fig1 INTRODUCTION Due to the rapid development of increasing technology, now a day’s semiconductor devices has become more prominent usage in every field. The ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 46 International Journal of Engineering Trends and Technology- Volume4Issue1- 2013 Fig. 1: Conventional LUT-based multiplier. manner such that the input address & LUT output Whereas X is an input address & A is a multiplier to the input X with fixed coefficient then resulting product is taken as output. Suppose X is a positive binary number of word length L, it provides 2L possible values of X in which corresponding resultant product as C=A X for possible values of X. In memory based multiplication, for all possible values of X, A conventional LUT having word length 2L Provides pre-computed product values. For an LUT, Xi is an input address with a L bit binary digit then the corresponding product A. XI is as its output. Therefore the product A. XI is stored in the location XI for 0 ≤ XI ≤ 2L −1. In earlier implementation of days, for memory DSP algorithms based involving could always be transformed into odd integers. When OMS scheme is combined with APC approach [3], it does not provide efficient output since APC functions [4] for odd multiples only. So therefore, for efficient memory based multiplication a modified form of OMS scheme is combined with different form of APC. A modified OMS [4] scheme & combined OMS–APC approaches are discussed in section 2 where as the implementation of combined schemes is described in section 3 and the design of LUT based multiplier is described in section 4. Finally the conclusion and the synthesizing results of proposed multiplier presented in section5. II. LUT OPTIMIZATIONS FOR MEMORYBASED MULTIPLICATION orthogonal transforms & digital filters [5]-[12] had This section describes about the APC technique and reported by several architectures but they could not its optimization by combining it with a modified find any significant work for LUT optimization. form of OMS. Recently we introduced a new approach for LUT optimization in which only the odd multiples of fixed A. APC for LUT Optimization: coefficient are to be stored which is termed as oddmultiple-storage-scheme (OMS) [3]. An LUT size can also be reduced to half by another approach known as anti-symmetric product coding (APC) scheme where as the product words are termed as anti For our convenience, we assume both X and A is to be positive integers to simply the operation. The above table1 shows the product values for different values of input X for L=5 as shown. symmetric pairs [4]. In APC approach, even it reduces the LUT size by a factor of two but for LUT output it takes more time & space for performing the 2s complement operation for sign modification to the corresponding input. We find that, by combining the techniques of APC & OMS scheme the 2s complement operations could be simplified in a ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 47 International Journal of Engineering Trends and Technology- Volume4Issue1- 2013 Table I the 4-bit LUT address values and corresponding APC words for L=5 with different input coded words respectively. here the product values representation is derived from the anti-symmetric behavior of the products, so we can term it as antisymmetric product code. The 4-bit address X’ = (x3’x2’x1’x0’) of the APC word is given by X’ = XL, XL’ if x4 = 1 if x4 = 0 (2) where XL = (x3x2x1x0) is the four less significant bits of X and XL’ is the 2s complement of X. the required product could be obtained by adding or For X= (1 0 0 0 0), the encoded word to be stored is subtracting the stored value (v − u) to or from the fixed value 16A when x4 is 1or 0, respectively, i.e., 16A. Product word = 16A + (sign value) X (APC word) From the above table it is clear that for every input word X in the third column of each row resembles the 2s complement of every input word X on the first column of the same row. In addition, the sum of (3) Where sign value = 1 for x4 = 1 Sign value = −1 for x4 = 0. and product values of two input values on the same row is 32A. Let u & v be the product values of second and The product value for X = (10000) corresponds to fourth columns of each row respectively. Therefore APC value “zero,” which could be derived by we can write resetting the LUT output, instead of storing that in the LUT. u=[(u + v)/2−(v − u)/2] and v=[(u + v)/2 + (v − u)/2] B. Modified OMS for LUT Optimization for (u + v) = 32A, We have u=16A–[(v− u)/2] and v=16 A + [(v − u)/2] (1) As the name OMS itself specifies that it stores only odd multiples of fixed coefficient. The multiplication from the above terms, the product values of the of a binary of binary word X of word size L with second and fourth columns of the table 1 shows fixed coefficient A, instead of storing all possible 2L negative- mirror symmetry. Therefore from the above product values, LUT stores only 2L/2 words symmetry of the product words of those two columns corresponding to odd multiples of A. While all even reduces LUT size, whereas instead of storing u and v, multiples of A can be converted into odd multiples by only [(v − u)/2] is stored for a pair of input on a given left shift operations .from the above assumptions, the row. The fifth and sixth columns of the table shows LUT for the multiplication of an L-bit input with a ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 48 International Journal of Engineering Trends and Technology- Volume4Issue1- 2013 W-bit coefficient could be designed by the following multiples of Aare derived from barrel shifter which strategy. produces maximum of three left shifts. 1) A memory unit of [(2L/2) + 1] words of (W As eq(3) states that the word to be stored for + L)-bit width is used to store the product X = (00000) is not 0 but 16A, which we can obtain values, where the first (2L/2) words are odd from A by four left shifts using a barrel shifter. multiples of A, and the last word is zero. However, if 16A is not derived from A, only a 2) A barrel shifter for producing a maximum of maximum of three left shifts is required to obtain all (L − 1) left shifts is used to derive all the other even multiples of A. a two-stage logarithmic even multiples of A. barrel shifter operates only for a maximum of 3 shifts 3) The L-bit input word is mapped to the (L − while for a four shift operations it requires a 3 stage 1)-bit address of the LUT by an address barrel shifter. For input X = (00000), this modified encoder, and control bits for the barrel OMS scheme is more efficient to store 2A such that shifter are derived by a control circuit. the product 16A can be obtained by three arithmetic left shifts. Table 2 shows that eight odd multiples, A × (2i + 1) Table3 shows that the product values and are stored in eight memory locations as pi for i= 0, 1…7. The even multiples 2A, 4A, and 8A are derived encoded words for input words X = (00000) and (10000) respectively. For X = (00000), the required by left-shift encoded word 16A is obtained by 3-bit left shifts Table II operations of 2A [stored at address (1000)]. For X = OMS-Based design of LUT of APC words for L=5 (10000), the APC word “0” is derived by resetting the LUT output, by an active-high RESET signal given by RESET = (x0 + x1 + x2 + x3) ・ x4. (4) Table III Products and encoded words for X= (00000) and (10000) operations of A. Similarly, 6A and 12A are derived by left shifting 3A, while 10A and 14A are derived by left shifting 5A and 7A, respectively. All even ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 49 International Journal of Engineering Trends and Technology- Volume4Issue1- 2013 From Tables II and III it shows that that the 5-bit Fig 2 shows that the structure and function of the input word X can be mapped into a 4-bit LUT address LUT-based multiplier for L = 5 using the APC (d3d2d1d0), by a simple set of mapping relations technique. It consists of a four-input LUT of 16 words to store the APC values of product words as di = x i+1, for i = 0, 1, 2 and d3 = x0 (5) given in the sixth column of Table I, except on the where X = (x3 x2 x1 x0) is generated by shifting-out last row, where 2A is stored for input X = (00000) all the leading zeros of X by an arithmetic right shift instead of storing a “0” for input X = (10000). followed by address mapping, i.e., Besides, it consists of an address-mapping circuit and an add/subtract circuit. The address-mapping circuit X= YL, if x4 = 1 Y’ L, if x4 = 0 generates the desired address (X’3, X’2, X’1, X0) (6) according to (2). A straightforward implementation of address mapping can be done by multiplexing XL Where YL and Y’L are derived by circularly shifting- and XL using x4 as the control bit. The address- out all the leading zeros of XL and XL, respectively. mapping circuit, can be optimized by the realization of three XOR gates, three AND gates, two OR gates, and a NOT gate, as shown in fig 2. According to eq (4) RESET can be generated by a control circuit (not III. IMPLEMENTATION OF THE LUT-BASED MULTIPLIER USING THE PROPOSED LUT OPTIMIZATION SCHEME This section deals with the implementation of the LUT-based multiplier using the proposed scheme, where the LUT is optimized by a combination of the shown in fig). The output of the LUT is added with or subtracted from 16A, for x4 = 1 or 0, respectively, according to (3) by the add/subtract cell. Hence, x4 is used as the control for the add/subtract cell. B. Implementation of the Optimized LUT Using Modified OMS APC scheme and a modified OMS technique. A. Implementation of the LUT Multiplier Using APC for L = 5 Fig.: 3 APC-OMS combined LUT design for multiplication of W-bit fixed coefficient. Fig.: 2 LUT based multiplier using APC technique for L=5. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 50 International Journal of Engineering Trends and Technology- Volume4Issue1- 2013 Fig 3 shows that the combined schemes of proposed As noted in Table-II and Table-III control signals are APC–OMS design of an LUT for L = 5 for any 2-bit binary equivalent for required number of shifts. coefficient width W. It consists of an LUT of nine Alternative of reset signal for (4) is generated as (d3 words of (W + 4)-bit width, a four-to-nine-line AND x4). In Fig. 4(b) generation of control signals address decoder, a barrel shifter, an address and reset signal is shown. According to (5) and (6) generation circuit, and a control circuit for generating address-generator circuit receives the input operand the RESET signal and control word (s1,s0) for the X as 5-bit and maps that onto the 4-bit address word barrel shifter. (d3d2d1d0). IV. Results and Discussion Table IV Comparison factors No. of word size for LUT’s 4-bit 5-bit 6-bit 15 10 14 No. of slices (4656) 8 6 8 No. of IO’s 46 60 67 No. of bonded IO’s 19 30 56 7.376 6.736 6.736 No. of 4-input LUT’s (9312) (a) (b) Fig.: 4(a) four-to-nine-line decoder. (b) Control circuit (232) Delay (ns) The pre-computed values of A × (2i + 1) are stored in stored in Table II as Pi, for i = 0, 1, 2, . . . , 7, in the eight consecutive locations of the memory array, while for input X= (00000) is stored for 2A at LUT address “1000,” as mentioned in Table III. The decoder generates the nine-word select lines by taking 4-bit address lines, to select the required word from the LUT multiplier. With simple modification of 3-to-8 decoder we are getting 4-to-9-line decoder as shown in Fig. 4(a). To produce desired number of shifts in barrel shifter control signals S0 and S1 are used according to the relations. s0 =x0 + (x1 + x2) Fig. (5) Simulated results for L=4 From the above fig. (5) We are applying the input bit (7a) sequences for X=4’h0 and getting the output response for q=8’h03. s1 =(x0 + x1) (8b) ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 51 International Journal of Engineering Trends and Technology- Volume4Issue1- 2013 Fig. (7) Simulated results for L=6 Fig. (6) Simulated results for L=5 From the above fig. (6) We are applying the input bit sequences for X=5’h00 and getting the output response for q=9’h003. From the above fig. (7) We are applying the input bit derived the possibility of using LUT sequences for X=6’h00 and getting the output multipliers for the constant implement of operations response for q=10’h003. As shown in the above table like multiplication especially for DSP applications. IV, for the increase in the word size in the LUT Future scope for this will be implementation of multiplier, there is a gradual degradation of delay for derived OMS–APC-based LUTs for higher input L=4 and L=5 and for L=6 there is no delay change sizes for suitable area-delay product with different with respect to L=5 with optimum utilization of forms of decompositions. memory. The LUT multiplier for L=W=4, 5 and 6 REFERENCES bits are coded in Verilog HDL and synthesized using Xillinx ISE 10.1i environment by using SPARTAN [1] 3E FPGA fg320 package, device used is XC3S500e with speed grade of ‘-5’. [2] IV CONCLUSION [3] The LUTs are implemented as arrays of constants for [4] efficient utilization of area-delay product. The area and delay complexities of the multipliers estimated from the synthesis results are listed in Table IV. It is [5] [6] found that the proposed LUT design involves comparable area and time complexities for a word size of 4 bits, but for higher word sizes, it has comparatively less delay factor. In this brief, we have based [7] Pramod Kumar Meher,“LUT Optimization for Memory-Based Computation”IEEE Transactions on circuits and systems—ii: express briefs, vol. 57, no. 4, april 2010 International Technology Roadmap for Semiconductors. [Online]. Available: http://public.itrs.net/ P. K. Meher, “New approach to LUT implementation and accumulation for memory-based Multiplication,” in Proc. IEEE ISCAS, May 2009, pp. 453–456. P. K. Meher, “New look-up-table optimizations for memory-based multiplication,” in Pro. Int. Symp.Integr. Circuits (ISIC’09), Dec. 2009, to be published. P. K. Meher, “Memory-based Hardware for resourceconstrained digital signal processing systems,” inProc. 6th Int Conf. ICICS, Dec.2007, pp.1–4. P. K. Meher, “Systolic designs for DCT using a lowcomplexity Concurrent convolutional formulation,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 9, pp. 1041–1050, Sep. 2006. D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, “Systolic algorithms and a memory-based design approach for a unified architecture for the computation of DCT/DST/IDCT/IDST,IEEE Trans. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 52 International Journal of Engineering Trends and Technology- Volume4Issue1- 2013 Circuits Syst. I, Reg. Papers, vol. 52, no. 6, pp. 1125– 1137, Jun. 2005. [8] H.-C. Chen, J.-I. Guo, T.-S. Chang, and C.-W. Jen, “A memory-efficient realization of cyclic convolution and its application to discrete cosine transform,” IEEE Trans. Circuits Syst.Video Technol., vol. 15, no. 3, pp. 445–453, Mar. 2005. [9] A. K. Sharma, Advanced Semiconductor Memories: Architectures,Designs,and Applications. Piscataway, NJ: IEEE Press, 2003. [10] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, “A Systolic array architecture for the discrete sine transform,” IEEE Trans. Signal Process., vol. 50, no. 9, pp. 2347–2354, Sep. 2002. [11] H.-R. Lee, C.-W. Jen, and C.-M. Liu, “On the design automation of The memory-based VLSI architectures for FIR filters,” IEEE Trans. Consum. Electron., vol. 39, no. 3, pp. 619–629, Aug. 1993. [12] J.-I. Guo, C.-M. Liu, and C.-W. Jen, “The efficient memory-based VLSI array design for DFT and DCT,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 39, no. 10, pp. 723–733, Oct. 1992. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 53