Design of Memory Based Implementation Using LUT Multiplier Charan Kumar .k

advertisement
International Journal of Engineering Trends and Technology- Volume4Issue1- 2013
Design of Memory Based Implementation
Using LUT Multiplier
Charan Kumar .k1, S. Vikrama Narasimha Reddy2, Neelima Koppala3
1,2
M.Tech(VLSI) Student, 3Assistant Professor, ECE Department,
Sree Vidyanikethan Engineering College(Autonomous), A.Rangampet, Tirupati.
Abstract
-
Multiplication is major arithmetic operation
in signal processing. In ALU’s the multiplier uses look-
operation of the these devices is very fast which
consumes less power, less area, reduces time of
up-table (LUT) as memory for their computations. We
operation & become more efficient with respect to
do not find any significant work on LUT optimization
the several factors such as reliability, flexibility,
for memory-based multiplication. In this project, the
scaling etc. therefore it leads to significant growth &
anti symmetric product coding (APC) and odd-multiple
improvement of these devices become cheaper. The
storage (OMS) are used for lookup-table (LUT) design
semiconductors have embedded memory which
for memory-based multipliers used in the signal
processing applications like filter design. Each of this
technique results in the reduction of LUT size by a
factor of two. A different form of APC and modified
OMS scheme can be combined for efficient memory
results in dominating presence in the SOC’s
exceeding 90% of the total soc [2]. When compared
to logical components, the semiconductor memory
devices has high transistor packing density with
implementation which reduces LUT size to one-fourth
increasing fast rate [1]. Apart from that, memory
of the conventional LUT. The proposed design of LUT-
based computing structures offers more other
based multiplier involves less area-delay product for
advantages rather than multiply accumulate structures
higher word sizes due to operand decomposition than
such as greater potential for high throughput, low
the canonical-signed-digit (CSD)-based multipliers. The
latency implementation and less dynamic power
coding is proposed to be done in Veriolg HDL and
synthesized using XillinxISE10.1i and implemented
using Spartan3E FPGA.
consumption. Memory-based computing is well
suited for many digital signal processing (DSP)
algorithms, which involve multiplication with a fixed
Key words- digital signal processing (DSP) chip, lookup-
set of coefficients. The following block diagram
table
shows the conventional look up table based multiplier
(LUT)-based
computing,
memory-based
computing, very large scale integrations (VLSI).
I.
in fig1
INTRODUCTION
Due to the rapid development of increasing
technology, now a day’s semiconductor devices has
become more prominent usage in every field. The
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 46
International Journal of Engineering Trends and Technology- Volume4Issue1- 2013
Fig. 1: Conventional LUT-based multiplier.
manner such that the input address & LUT output
Whereas X is an input address & A is a multiplier to
the input X with fixed coefficient then resulting
product is taken as output. Suppose X is a positive
binary number of word length L, it provides 2L
possible values of X in which corresponding resultant
product as C=A X for possible values of X. In
memory based multiplication, for all possible values
of X, A conventional LUT having word length 2L
Provides pre-computed product values. For an LUT,
Xi is an input address with a L bit binary digit then
the corresponding product A. XI is as its output.
Therefore the product A. XI is stored in the location
XI for 0 ≤ XI ≤ 2L −1.
In
earlier
implementation
of
days,
for
memory
DSP
algorithms
based
involving
could always be transformed into odd integers. When
OMS scheme is combined with APC approach [3], it
does not provide efficient output since APC functions
[4] for odd multiples only. So therefore, for efficient
memory based multiplication a modified form of
OMS scheme is combined with different form of
APC. A modified OMS [4] scheme & combined
OMS–APC approaches are discussed in section 2
where as the implementation of combined schemes is
described in section 3 and the design of LUT based
multiplier is described in section 4. Finally the
conclusion and the synthesizing results of proposed
multiplier presented in section5.
II. LUT OPTIMIZATIONS FOR MEMORYBASED MULTIPLICATION
orthogonal transforms & digital filters [5]-[12] had
This section describes about the APC technique and
reported by several architectures but they could not
its optimization by combining it with a modified
find any significant work for LUT optimization.
form of OMS.
Recently we introduced a new approach for LUT
optimization in which only the odd multiples of fixed
A. APC for LUT Optimization:
coefficient are to be stored which is termed as oddmultiple-storage-scheme (OMS) [3]. An LUT size
can also be reduced to half by another approach
known as anti-symmetric product coding (APC)
scheme where as the product words are termed as anti
For our convenience, we assume both X and
A is to be positive integers to simply the operation.
The above table1 shows the product values for
different values of input X for L=5 as shown.
symmetric pairs [4].
In APC approach, even it reduces the LUT
size by a factor of two but for LUT output it takes
more time
& space for performing the 2s
complement operation for sign modification to the
corresponding input. We find that, by combining the
techniques of APC & OMS scheme the 2s
complement operations could be simplified in a
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 47
International Journal of Engineering Trends and Technology- Volume4Issue1- 2013
Table I
the 4-bit LUT address values and corresponding
APC words for L=5 with different input
coded
words
respectively.
here
the
product
values
representation is derived from the anti-symmetric
behavior of the products, so we can term it as antisymmetric product code. The 4-bit address X’ =
(x3’x2’x1’x0’) of the APC word is given by
X’ =
XL,
XL’
if x4 = 1
if x4 = 0
(2)
where XL = (x3x2x1x0) is the four less significant
bits of X and XL’ is the 2s complement of X. the
required product could be obtained by adding or
For X= (1 0 0 0 0), the encoded word to be stored is
subtracting the stored value (v − u) to or from the
fixed value 16A when x4 is 1or 0, respectively, i.e.,
16A.
Product word = 16A + (sign value) X (APC word)
From the above table it is clear that for every input
word X in the third column of each row resembles
the 2s complement of every input word X on the first
column of the same row. In addition, the sum of
(3)
Where sign value = 1 for
x4 = 1
Sign value = −1 for
x4 = 0.
and
product values of two input values on the same row is
32A. Let u & v be the product values of second and
The product value for X = (10000) corresponds to
fourth columns of each row respectively. Therefore
APC value “zero,” which could be derived by
we can write
resetting the LUT output, instead of storing that in
the LUT.
u=[(u + v)/2−(v − u)/2] and v=[(u + v)/2 + (v − u)/2]
B. Modified OMS for LUT Optimization
for (u + v) = 32A, We have
u=16A–[(v− u)/2] and v=16 A + [(v − u)/2]
(1)
As the name OMS itself specifies that it stores only
odd multiples of fixed coefficient. The multiplication
from the above terms, the product values of the
of a binary of binary word X of word size L with
second and fourth columns of the table 1 shows
fixed coefficient A, instead of storing all possible 2L
negative- mirror symmetry. Therefore from the above
product values, LUT stores only 2L/2 words
symmetry of the product words of those two columns
corresponding to odd multiples of A. While all even
reduces LUT size, whereas instead of storing u and v,
multiples of A can be converted into odd multiples by
only [(v − u)/2] is stored for a pair of input on a given
left shift operations .from the above assumptions, the
row. The fifth and sixth columns of the table shows
LUT for the multiplication of an L-bit input with a
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 48
International Journal of Engineering Trends and Technology- Volume4Issue1- 2013
W-bit coefficient could be designed by the following
multiples of Aare derived from barrel shifter which
strategy.
produces maximum of three left shifts.
1) A memory unit of [(2L/2) + 1] words of (W
As eq(3) states that the word to be stored for
+ L)-bit width is used to store the product
X = (00000) is not 0 but 16A, which we can obtain
values, where the first (2L/2) words are odd
from A by four left shifts using a barrel shifter.
multiples of A, and the last word is zero.
However, if 16A is not derived from A, only a
2) A barrel shifter for producing a maximum of
maximum of three left shifts is required to obtain all
(L − 1) left shifts is used to derive all the
other even multiples of A. a two-stage logarithmic
even multiples of A.
barrel shifter operates only for a maximum of 3 shifts
3) The L-bit input word is mapped to the (L −
while for a four shift operations it requires a 3 stage
1)-bit address of the LUT by an address
barrel shifter. For input X = (00000), this modified
encoder, and control bits for the barrel
OMS scheme is more efficient to store 2A such that
shifter are derived by a control circuit.
the product 16A can be obtained by three arithmetic
left shifts.
Table 2 shows that eight odd multiples, A × (2i + 1)
Table3 shows that the product values and
are stored in eight memory locations as pi for i= 0,
1…7. The even multiples 2A, 4A, and 8A are derived
encoded words for input words X = (00000) and
(10000) respectively. For X = (00000), the required
by left-shift
encoded word 16A is obtained by 3-bit left shifts
Table II
operations of 2A [stored at address (1000)]. For X =
OMS-Based design of LUT of APC words for L=5
(10000), the APC word “0” is derived by resetting
the LUT output, by an active-high RESET signal
given by
RESET = (x0 + x1 + x2 + x3) ・ x4.
(4)
Table III
Products and encoded words for X= (00000) and (10000)
operations of A. Similarly, 6A and 12A are derived by
left shifting 3A, while 10A and 14A are derived by
left shifting 5A and 7A, respectively. All even
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 49
International Journal of Engineering Trends and Technology- Volume4Issue1- 2013
From Tables II and III it shows that that the 5-bit
Fig 2 shows that the structure and function of the
input word X can be mapped into a 4-bit LUT address
LUT-based multiplier for L = 5 using the APC
(d3d2d1d0), by a simple set of mapping relations
technique. It consists of a four-input LUT of 16
words to store the APC values of product words as
di = x i+1, for i = 0, 1, 2 and d3 = x0
(5)
given in the sixth column of Table I, except on the
where X = (x3 x2 x1 x0) is generated by shifting-out
last row, where 2A is stored for input X = (00000)
all the leading zeros of X by an arithmetic right shift
instead of storing a “0” for input X = (10000).
followed by address mapping, i.e.,
Besides, it consists of an address-mapping circuit and
an add/subtract circuit. The address-mapping circuit
X=
YL,
if x4 = 1
Y’ L,
if x4 = 0
generates the desired address (X’3, X’2, X’1, X0)
(6)
according to (2). A straightforward implementation
of address mapping can be done by multiplexing XL
Where YL and Y’L are derived by circularly shifting-
and XL using x4 as the control bit. The address-
out all the leading zeros of XL and XL, respectively.
mapping circuit, can be optimized by the realization
of three XOR gates, three AND gates, two OR gates,
and a NOT gate, as shown in fig 2. According to eq
(4) RESET can be generated by a control circuit (not
III. IMPLEMENTATION OF THE LUT-BASED
MULTIPLIER USING THE PROPOSED LUT
OPTIMIZATION SCHEME
This section deals with the implementation of the
LUT-based multiplier using the proposed scheme,
where the LUT is optimized by a combination of the
shown in fig). The output of the LUT is added with
or subtracted from 16A, for x4 = 1 or 0, respectively,
according to (3) by the add/subtract cell. Hence, x4 is
used as the control for the add/subtract cell.
B. Implementation of the Optimized LUT Using
Modified OMS
APC scheme and a modified OMS technique.
A. Implementation of the LUT Multiplier Using
APC for L = 5
Fig.: 3 APC-OMS combined LUT design
for multiplication of W-bit fixed coefficient.
Fig.: 2 LUT based multiplier using APC technique for L=5.
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 50
International Journal of Engineering Trends and Technology- Volume4Issue1- 2013
Fig 3 shows that the combined schemes of proposed
As noted in Table-II and Table-III control signals are
APC–OMS design of an LUT for L = 5 for any
2-bit binary equivalent for required number of shifts.
coefficient width W. It consists of an LUT of nine
Alternative of reset signal for (4) is generated as (d3
words of (W + 4)-bit width, a four-to-nine-line
AND x4). In Fig. 4(b) generation of control signals
address decoder, a barrel shifter, an address
and reset signal is shown. According to (5) and (6)
generation circuit, and a control circuit for generating
address-generator circuit receives the input operand
the RESET signal and control word (s1,s0) for the
X as 5-bit and maps that onto the 4-bit address word
barrel shifter.
(d3d2d1d0).
IV. Results and Discussion
Table IV
Comparison factors
No. of word size for LUT’s
4-bit
5-bit
6-bit
15
10
14
No. of slices (4656)
8
6
8
No. of IO’s
46
60
67
No. of bonded IO’s
19
30
56
7.376
6.736
6.736
No. of 4-input LUT’s
(9312)
(a)
(b)
Fig.: 4(a) four-to-nine-line decoder. (b) Control circuit
(232)
Delay (ns)
The pre-computed values of A × (2i + 1) are stored in
stored in Table II as Pi, for i = 0, 1, 2, . . . , 7, in the
eight consecutive locations of the memory array,
while for input X= (00000) is stored for 2A at LUT
address “1000,” as mentioned in Table III. The
decoder generates the nine-word select lines by
taking 4-bit address lines, to select the required word
from the LUT multiplier. With simple modification
of 3-to-8 decoder we are getting 4-to-9-line decoder
as shown in Fig. 4(a). To produce desired number of
shifts in barrel shifter control signals S0 and S1 are
used according to the relations.
s0 =x0 + (x1 + x2)
Fig. (5) Simulated results for L=4
From the above fig. (5) We are applying the input bit
(7a)
sequences for X=4’h0 and getting the output response
for q=8’h03.
s1 =(x0 + x1)
(8b)
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 51
International Journal of Engineering Trends and Technology- Volume4Issue1- 2013
Fig. (7) Simulated results for L=6
Fig. (6) Simulated results for L=5
From the above fig. (6) We are applying the input bit
sequences for X=5’h00 and getting the output
response for q=9’h003.
From the above fig. (7) We are applying the input bit
derived the possibility of using LUT
sequences for X=6’h00 and getting the output
multipliers for the constant implement of operations
response for q=10’h003. As shown in the above table
like multiplication especially for DSP applications.
IV, for the increase in the word size in the LUT
Future scope for this will be implementation of
multiplier, there is a gradual degradation of delay for
derived OMS–APC-based LUTs for higher input
L=4 and L=5 and for L=6 there is no delay change
sizes for suitable area-delay product with different
with respect to L=5 with optimum utilization of
forms of decompositions.
memory. The LUT multiplier for L=W=4, 5 and 6
REFERENCES
bits are coded in Verilog HDL and synthesized using
Xillinx ISE 10.1i environment by using SPARTAN
[1]
3E FPGA fg320 package, device used is XC3S500e
with speed grade of ‘-5’.
[2]
IV CONCLUSION
[3]
The LUTs are implemented as arrays of constants for
[4]
efficient utilization of area-delay product. The area
and delay complexities of the multipliers estimated
from the synthesis results are listed in Table IV. It is
[5]
[6]
found that the proposed LUT design involves
comparable area and time complexities for a word
size of 4 bits, but for higher word sizes, it has
comparatively less delay factor. In this brief, we have
based
[7]
Pramod Kumar Meher,“LUT Optimization for
Memory-Based Computation”IEEE Transactions on
circuits and systems—ii: express briefs, vol. 57, no. 4,
april 2010
International
Technology
Roadmap
for
Semiconductors.
[Online].
Available:
http://public.itrs.net/
P. K. Meher, “New approach to LUT implementation
and accumulation for memory-based Multiplication,” in
Proc. IEEE ISCAS, May 2009, pp. 453–456.
P. K. Meher, “New look-up-table optimizations for
memory-based multiplication,” in Pro. Int. Symp.Integr.
Circuits (ISIC’09), Dec. 2009, to be published.
P. K. Meher, “Memory-based Hardware for resourceconstrained digital signal processing systems,” inProc.
6th Int Conf. ICICS, Dec.2007, pp.1–4.
P. K. Meher, “Systolic designs for DCT using a lowcomplexity Concurrent convolutional formulation,”
IEEE Trans. Circuits Syst. Video Technol., vol. 16, no.
9, pp. 1041–1050, Sep. 2006.
D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T.
Stouraitis, “Systolic algorithms and a memory-based
design approach for a unified architecture for the
computation of DCT/DST/IDCT/IDST,IEEE Trans.
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 52
International Journal of Engineering Trends and Technology- Volume4Issue1- 2013
Circuits Syst. I, Reg. Papers, vol. 52, no. 6, pp. 1125–
1137, Jun. 2005.
[8] H.-C. Chen, J.-I. Guo, T.-S. Chang, and C.-W. Jen, “A
memory-efficient realization of cyclic convolution and
its application to discrete cosine transform,” IEEE
Trans. Circuits Syst.Video Technol., vol. 15, no. 3, pp.
445–453, Mar. 2005.
[9] A. K. Sharma, Advanced Semiconductor Memories:
Architectures,Designs,and Applications. Piscataway,
NJ: IEEE Press, 2003.
[10] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T.
Stouraitis, “A Systolic array architecture for the discrete
sine transform,” IEEE Trans. Signal Process., vol. 50,
no. 9, pp. 2347–2354, Sep. 2002.
[11] H.-R. Lee, C.-W. Jen, and C.-M. Liu, “On the design
automation of The memory-based VLSI architectures
for FIR filters,” IEEE Trans. Consum. Electron., vol.
39, no. 3, pp. 619–629, Aug. 1993.
[12] J.-I. Guo, C.-M. Liu, and C.-W. Jen, “The efficient
memory-based VLSI array design for DFT and DCT,”
IEEE Trans. Circuits Syst. II, Analog Digit. Signal
Process., vol. 39, no. 10, pp. 723–733, Oct. 1992.
ISSN: 2231-5381 http://www.internationaljournalssrg.org
Page 53
Download