Design and Implementation of High

advertisement
Journal of VLSI Signal Processing 34, 227–237, 2003
c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands.
Design and Implementation of High-Performance RNS Wavelet Processors
Using Custom IC Technologies
JAVIER RAMÍREZ
Department of Electronics and Computer Technology, University of Granada, Spain
UWE MEYER-BÄSE
Department of Electrical and Computer Engineering, Florida State University, Tallahassee, FL 32310-6046, USA
FRED TAYLOR
High-Speed Digital Architecture Laboratory, University of Florida, Gainesville, FL 32611-6130, USA
ANTONIO GARCÍA AND ANTONIO LLORIS
Department of Electronics and Computer Technology, University of Granada, Spain
Received September 6, 2001; Revised August 14, 2002; Accepted August 14, 2002
Abstract. The design of high performance, high precision, real-time digital signal processing (DSP) systems, such
as those associated with wavelet signal processing, is a challenging problem. This paper reports on the innovative
use of the residue number system (RNS) for implementing high-end wavelet filter banks. The disclosed system
uses an enhanced index-transformation defined over Galois fields to efficiently support different wavelet filter
instantiations without adding any extra cost or additional look-up tables (LUT). A selection of a small wordwidth
modulus set are the keys for attaining low-complexity and high-throughput. An exhaustive comparison against
existing two’s complement (2C) designs for different custom IC technologies was carried out. Results reveal a
performance improvement of up to 100% for high-precision RNS-based systems. These structures demonstrated to
be well suited for field programmable logic (FPL) assimilation as well as for CBIC (cell-based integrated circuit)
technologies.
Keywords: discrete wavelet transform, RNS arithmetic, custom integrated circuit, field-programmable logic
devices
1.
Introduction
There is a growing demand that digital image processing be performed at greater real-time bandwidths,
with higher precision, and lower complexity. Since
these systems are intrinsically SAXPY (S = AX + Y)
dominant, advanced solutions must overcome existing
arithmetic limitations. An arithmetic system capable
of surmounting this barrier is the residue number system, or RNS. Computer arithmeticians have long held
that the RNS offers a distinct MAC (multiply and accumulate) speed-area advantage [1] in SAXPY-intensive
applications. The development of new RNS structures
to better build signal processing systems with custom
IC technologies is a field of continuous interest and
study.
The evolution of the DSP market and technology makes necessary considering not only cell-based
ASICs but modern CPLD (complex programmable
logic device) families, such as Altera FLEX10K [2] or
228
Ramı́rez et al.
Virtex [3] in the design and implementation of signal
processing systems. ASIC are becoming the dominant
technology with the Y-2000 DSP CBIC (cell-based integrated circuit) ASIC market valued in excess of $13B,
compared to $8B for PDSPs (programmable digital signal processors). The FPL ASIC market is expected to
expand at a rate of 20% per annum rate, with DSP
applications leading the way. While FPL houses champion their technology as a provider of system-on-achip (SOC) DSP solutions, engineers have historically
viewed FPLs as a prototyping technology. It should be
noted that 40% of the current FPL design starts are
rated at 1,500 gates. This figure falls well below the
reported 50,000+ gates that account for 50% of standard cell ASIC designs [1]. When one considers that an
FPGA typically requires 10× more gates than a CBIC
to implement a common logic function, a typical 50k
gate standard cell ASIC design would require a large
500k gate FPGA. In order for FPL to begin to compete
in areas currently controlled by low-end standard cell,
a means must be found to more efficiently implement
DSP objects.
In [4], the RNS was used to design a wavelet transform using field-programmable logic (FPL). The design was compared to a two’s complement (2C), and
distributed arithmetic (DA) implementation. The RNS
solution was found to be superior to the 2C case and
compared favourably with the DA instantiation, but
unlike a DA design, was fully programmable. An enhanced RNS implementation of FIR filters by means
of the DA method can be found in [5]. Later this
RNS-DA mechanization was enhanced and applied to
wavelet filter banks [6, 7]. Thus, a DWT filterbank
having a 14-bit input, designed by means of the reported RNS-DA methodology, achieved a performance
improvement over the equivalent 2C system of up to
156.27%, and with the conversion stage not degrading
the throughput of the overall system. The RNS speed
advantage is gained by reducing arithmetic to a set of
concurrent operations that reside in small wordlength
non-communicating channels. This attribute makes the
RNS potentially attractive for implementing DSP objects with commercially available FPL technology and
CBIC technologies. Another demonstration of the RNS
benefits is found in [8] for use in orthogonal wavelet
filter bank applications. The filter banks were designed
to accept 8-bit input signals, process using 10-bit coefficients, and ran 23.45% and 96.58% faster than a
2C design for one and two octaves, respectively. A
weakness of the reported RNS solution was that fixed
coefficient multiplication was mapped into look-up
tables (LUTs). Consequently, the tables needed to be
re-programmed whenever a different set of wavelet coefficients were selected. This paper explores an efficient means of obtaining efficient discrete wavelet
transform (DWT) architectures defined over multiple
filter coefficient sets, by means of the RNS. The paper extends these ideas and develops a mechanism of
achieving synergy within FPL-defined environments
and cell-based CMOS IC technologies to better implement arithmetic intensive DSP solutions. The quantifiable benefits of this approach are studied in the
context of a programmable wavelet filterbank. The
work will build upon previous works and RNS-FPL
design studies [6, 7, 9–13].
2.
Index-Based Arithmetic over Galois Fields
There is emerging evidence that an arithmetic technology, called the RNS, can avoid the throughput degradation with the increase in precision and become a
custom IC enabling technology [3, 6, 7, 12, 13]. Computer arithmeticians have long held that the RNS offers
the best MAC speed-area advantage [1]. In the RNS,
numbers are represented in terms of a relatively prime
basis set (moduli set) P = {m 1 , . . . , m L }. Anynumber
L
X ∈ Z M = {0, 1, . . . , M − 1}, where M = i=1
mi ,
has a unique RNS representation X ↔ {X 1 , . . . , X L },
where X i = X mod m i . Like the 2C system, the
RNS arithmetic is exact as long as the final result is
bounded within the system’s dynamic range Z M . Mapping from the RNS back to the integer domain is defined
by the Chinese Remainder Theorem (CRT) [1]. RNS
arithmetic is defined by pair-wise modular operations:
Z = X ± Y ↔ X m 1 ± Ym 1 m 1 , . . . , X m L ± Ym L m L
Z = X × Y ↔ X m × Ym , . . . , X m × Ym 1
1
m1
L
L
mL
(1)
where |Q|m j denotes Q mod m j . The individual modular arithmetic operations are typically performed as
LUT calls to small memories. The RNS differs from
traditional weighted numbering systems in the fact that
the RNS arithmetic is a carry-free and can operate at a
constant speed over a wide range of precisions.
A variety of RNS multipliers are available, including pure LUT multipliers, square law multipliers [14],
index-transform multipliers [15, 16], and array multipliers [17]. Pure LUT multipliers require a double
precision LUT and are only a good choice for small
Design and Implementation
moduli. Square law multipliers require two LUTs, two
adders and a modulo adder. Galois field multipliers
are based on index transformation and require a single LUT to implement modulo multiplication in a DSP
system [13]. Array multipliers are used, for instance in
cryptographic systems, since large moduli are required
and any LUT-based multiplier would require very large
LUTs.
The index-transformation multiplier [15, 16] constitutes an efficient means of designing high performance,
reduced complexity DSP systems. They are based on
the mathematical properties associated with a Galois
fields denoted GF( p), where p is prime. All the
non-zero elements in a Galois field can be generated by exponentiating a primitive element denoted
g j . This property can be exploited for multiplication in GF(m j ) through the use of a well known isomorphism existing between the multiplicative group
Q = {1, 2, . . . , m j − 1}, with multiplication performed
modulo m j , and the additive group I = {0, 1, . . . ,
m j − 2}, with addition performed modulo (m j − 1).
The mapping is given by:
i
q = −1
j (i) = g j mod m j
|q j qk |m j = g |i j +ik |m j −1
(3)
Thus, the multiplication of two numbers, say q j and
qk , can be performed by adding exponents in a modular sense. The exponents, or indexes i j and i k , can
be pre-computed and stored in a lookup table. Adding
the indexes can be performed with a modulo (m j − 1)
adder, and the inverse index transformation of i j into
q j can be performed again using a LUT.
3.
thogonal wavelets, W j+1 is defined as the orthogonal complement of V j+1 in V j . Assuming a sequence
ḡ n ∈ V0 exists such that {ḡ n−2k }k∈Z is a basis for V1 , a sequence h̄ n ∈ V0 can then be found such that {h̄ n−2k }k∈Z
is a basis for W1 . Thus, V0 can be decomposed as:
V0 = W1 ⊕ W2 ⊕ · · · ⊕ W J ⊕ V J by simply iterating
the decomposition rule J times. An attractive feature
of the wavelet series expansion is that the underlying
multiresolution structure leads to an efficient discretetime algorithm based on a filter bank implementation.
The octave-band analysis filter bank computes the inner
products with the basis functions for W1 , W2 , . . . , W J ,
and V J . The orthogonal projection of the input signal
onto W1 , W2 , . . . , W J , and V J is computed after convolution with the synthesis filters. Then, the sequence
is decomposed into a coarse resolution version in V J
with added details in Wi (i = 1, 2, . . . , J ). Thus a 1-D
N th-order DWT decomposition of a sequence xn is
defined by the recurrent equations:
an(i) =
dn(i)
(2)
q ∈ Q, i ∈ I and multiplication, using index arithmetic,
is based on:
Discrete Wavelet Transform
Interest in the wavelet transform [18, 19] has
grown dramatically during the last decade [20–25].
Wavelet transforms are routinely used in speech, image and video signal processing, and other applications. Discrete wavelet transforms (DWT) are defined over a sequence of embedded closed subspaces,
V J ⊂ V J −1 ⊂ . . . ⊂ V1 ⊂ V0 , where V0 = l2 (Z ) is the
space of square-summable sequences. These subspaces
satisfy the upward completeness property, ∪V j =
l2 (Z ), j ∈ [0, J ]. Assume that any element in V j can be
uniquely expressed as the sum of two elements from
V j+1 and W j+1 , where V j = V j+1 ⊕ W j+1 . For or-
229
=
N
−1
k=0
N
−1
k=0
(i−1)
gk a2n−k
i = 1, 2, . . . , J
(i−1)
h k a2n−k
an(0)
(4)
≡ xn
where an(i) and dn(i) are level-i approximation and detail
sequences, respectively, and gk and h k (k = 0, 1, . . . ,
N −1) correspond to the low-pass and high-pass analysis filter coefficients. On the other hand, the signal
xn can be perfectly recovered through its multiresolution decomposition {an(J ) , dn(J ) , dn(J −1 ), . . . , dn(1) } by
iteration on:
âm(i−1)
 N /2−1
N
/2−1


+
ḡ2k â (i)
h̄ 2k dˆ(i)
m
m

−k

2
2 −k
k=0
k=0
=
N
/2−1
N
/2−1




+
ḡ2k+1 â (i)
h̄ 2k+1 dˆ(i)
m−1
m−1
k=0
2
−k
k=0
2
m even
−k
m odd
(5)
where ḡ k and h̄ k represent low-pass and high-pass
synthesis filter coefficients. In order to ensure perfect
recovery of the input signal, the coefficients of the
analysis and synthesis filter banks are conveniently related to each other according to the perfect reconstruction condition [18, 19].
4.
DWT Solutions Enhanced by the RNS
The design of wavelet filter banks using the RNS,
presents new opportunities. If the wavelet filter
230
Ramı́rez et al.
coefficients are fixed a priori, the LUT-based modulo multiplier represents the most efficient solution to
meeting low-latency and hardware efficiency [8]. However, if the wavelet filter coefficients are to be run-time
programmable, then the solution may require an unacceptably large number of LUTs to cover all coefficient
instances [13].
The use of index-transformation multipliers [15, 16],
and re-timing techniques leads to DWT filterbanks designs requiring a single 2n j × n j LUT for each filter coefficient, where n j = log2 (m j ), is the modulus wordwidth. Figure 1 shows the design based on
index transformations of a modulo m j channel, for
an octave-i 8-tap decomposition filter bank. The input sequence |an(i−1) |m j is decomposed into even and
odd sequences that are converted to the index-domain
by means of two LUTs storing the j function. Some
circuitry is added to the input to detect zero values
of the input sequences. Notice that clearable registers have been added to make zero the filter products in case zero is detected in the even- and oddindexed sequences. The reason for this is that multiplication by zero is not defined in the index domain
and must be considered to be a special case. After
the filter products are computed in the index-domain,
the LUT storing the function −1
j maps the indices
back to the RNS domain, and the remaining filtering
or addition stage is carried out by a modular adder
tree. The system exhibits symmetry for the computation of the approximation and detail sequences. The
complete RNS design consists of a number of parallel channels whose combined wordwidth is sufficient
to ensure that the dynamic range requirements are met
[18, 19].
In a similar manner, an index-based architecture
may be derived for the reconstruction (synthesis)
filter bank. The resulting architecture for the 1-D
IDWT is shown in Fig. 2. The two input sequences
|ân(i) |m j and |dˆn(i) |m j are converted into their index
representations by means of two parallel LUT storing
the j function. The filter products are computed by
parallel and efficient index-based multipliers with each
filter product requiring a single LUT storing −1
j and a
modulo (m j − 1) adder. Additional logic and clearable
registers are used to detect a zero input values and
make zero the corresponding filter products. Finally,
two separate modulo m j addition stages are used to
compute the output sequence |ân(i−1) |m j in even and
odd clock cycles as required by Eq. (5).
5.
Results and Discussion
An 8-tap 1-D DWT filter bank was used to illustrate
the design of 2C and RNS-based system. The comparison was carried out using VHDL models over Altera
FLEX10KE field programmable logic (FPL) devices
and two standard cell ASIC technologies. The selected ASIC reference libraries were the 0.8 µm MSU
SCMOS and the Chip Express 0.35 µm triple-level
metal CX3003 CMOS technologies. The 0.8 µm MSU
SCMOS cell library consists of a set of gates implementing low-level logic functions. The Chip Express
0.35 µm CMOS CX3003 technology is based on the
definition of a high-level module that can be configured
to operate in a very wide range of simple and complex
circuit functions and combinations. The logic module
is a universal function composed of three multiplexers
and one AND gate. It is based on the fact that a multiplexer can implement any logic function, which may
be either combinatorial or sequential.
Table 1 shows the total area and maximum sampling rate obtained for 8-tap RNS and 2C designs using
Table 1. Total area and maximum sampling rate obtained for an 8-tap DWT filter bank. Notice that, [x, y, z] represents x-bit input, y-bit
coefficients and z-bit output.
Two’s complement
Area
(µm2 or no. of modules)
RNS
F (MHz)
Area
(µm2 or no. of modules)
F (MHz)
Wordwidths and modulus set
0.8 µm
0.35 µm
0.8 µm
0.35 µm
0.8 µm
0.35 µm
0.8 µm
0.35 µm
[8, 10, 21] {31, 29, 23, 19, 17}
748608
19810
106.38
367.65
820360
29500
209.64
584.80
[10, 10, 23] {61, 59, 53, 47}
855016
21849
105.71
353.36
910688
36436
188.32
515.46
[12, 12, 27] {61, 59, 53, 47, 43}
1026864
25507
86.43
293.26
1138360
45545
188.32
515.46
[14, 12, 29] {61, 59, 53, 47, 43, 41}
1111376
28441
84.89
223.21
1366032
54654
188.32
515.46
Design and Implementation
an( i )
mj
+m
+m
CLR6
+m
CLR4
j
+m
j
+m
j
231
+m
j
j
+m
j
j
CLR2
CLR0
CLR0
CLR1
CLR2
CLR3
CLR4
CLR5
CLR6
CLR7
Shift register
Φ −j 1
Φ −j 1
n
Even sequence
n
Φ −j 1
n
Φ −j 1
n
Φ −j 1
n
Φ −j 1
n
Φ −j 1
n
n
2 j × nj
2 j × nj
2 j × nj
2 j × nj
2 j × nj
2 j × nj
2 j × nj
+m
+m
+m
+m
+m
+m
+m
+m
LUT
nj
Φ −j 1
n
2 j × nj
j
LUT
−1
j
LUT
−1
j
LUT
−1
j
LUT
−1
j
LUT
−1
j
LUT
−1
j
LUT
−1
Φ j ( g0 )
Φ j ( g1 )
Φ j ( g2 )
Φ j ( g3 )
Φ j ( g4 )
Φ j ( g5 )
Φ j ( g6 )
Φ j ( g7 )
Φ j (h0 )
Φ j ( h1 )
Φ j (h2 )
Φ j (h3 )
Φ j (h4 )
Φ j (h5 )
Φ j (h6 )
Φ j (h7 )
j
−1
j
−1
Φj
2 j × nj
an( i −1)
LUT
m j nj
n
Φj
2 j × nj
Odd sequence
LUT
nj
+m
j
+m
−1
Φ −j 1
CLR1
+m
−1
Φ −j 1
n
Shift register
j
CLR0
CLR1
+m
−1
CLR4
+m
−1
Φ −j 1
+m
−1
Φ −j 1
Φ −j 1
n
2 j × nj
LUT
CLR5
j
n
2 j × nj
LUT
CLR3
j
n
2 j × nj
LUT
CLR2
+m
−1
n
2 j × nj
LUT
j
Φ −j 1
n
2 j × nj
LUT
j
Φ −j 1
n
2 j × nj
LUT
+m
−1
Φ −j 1
n
2 j × nj
j
2 j × nj
LUT
CLR6
LUT
CLR7
CLR3
CLR5
CLR7
+m
+m
j
+m
+m
j
+m
j
+m
d n(i)
Figure 1.
+m
j
j
j
j
mj
Design of an RNS-based 1-D DWT architecture with index-transformation.
0.8 µm and 0.35 µm CBIC technologies. The solution
adopted here for the 2C arithmetic DWT architecture
was to use pipelined 2C multipliers based on Booth
encoding and Wallace trees [26]. Hardware complex-
ity and delay rapidly increase as the precision of the
input and coefficients increases. These facts are shown
in Table 1 and Fig. 3. Note that performance is considerably higher for an RNS-based solution than for a
232
Figure 2.
Ramı́rez et al.
Design of an RNS-based 1-D IDWT architecture with index-transformation.
Design and Implementation
233
250
Sampling rate (MHz)
5-bit RNS
200
6-bit RNS
7-bit RNS
150
100
2C
50
19
21
23
25
27
29
31
MSU CMOS Technology
Output precision
700
5-bit RNS
Sampling rate (MHz)
600
6-bit RNS
500
7-bit RNS
400
300
2C
200
100
19
21
23
25
Output precision
27
29
31
Chip Express CMOS Technology
Figure 3. Sampling rate as a function of the output precision for index-based and 2C arithmetic 1-D DWT filter banks implemented by means
of CBIC technologies.
2C design. In order to maximize the sample rate gain,
small wordwidth channels are desirable. However, only
prime moduli are suitable for use in an index arithmetic
system. For a 5-bit modulus set, the only admissible
moduli are {17, 19, 23, 29, 31} which leads to a 22.7-bit
maximum dynamic range. With a 6-bit modulus set, the
dynamic range can be up to 39 bits using the moduli set
{37, 41, 43, 47, 53, 59, 61}. The use of a 6-bit modulus
set was found to be attractive for the designs demanding
23-, 27- and 29-bit outputs, while for the design with
a 21-bit output a 5-bit modulus set is more efficient in
terms of area and speed. The efficient hardware implementation of modulo multiplication by means of index
transformations reveals 2C and RNS-based systems to
have similar hardware complexities, while an RNS solution will take advantage of higher speed and better
ASIC routability inside each channel. For instance, a
DWT filter bank enhanced by RNS arithmetic and having 21-, 23-, 27-, and 29-bit output is about 97%, 78%,
118% and 122% faster than a 2C design when using the
MSU SCMOS 0.8µm technology. Notice that, using a
six-bit modulus set for RNS wavelet filterbanks with 23
bits output or above, makes the overall throughput improvement to not steadily increase with the wordlength.
On the other hand, filters having 27 and 29 bits are twice
as fast as the 2C equivalent design.
FPL devices have recently generated interest for
use in DSP systems due to their ability to implement
custom solutions while maintaining flexibility through
device reprogramming. FPL technology providing embedded LUTs and dedicated logic blocks are potential
solutions for MAC-intensive RNS-based DSP systems.
234
Ramı́rez et al.
Table 2. Total resources required and maximum sampling rate obtained for a 4-tap DWT filter bank on an Altera FLEX10KE
device (grade-1). Notice that, [x, y, z] represents x-bit input, y-bit coefficients and z-bit output.
Two’s complement
Wordwidths and modulus set
[8, 9, 19] {61, 59, 53, 47}
RNS
No. of LEs
No. of EABs
(Memory bits)
F (MHz)
No. of LEs
No. of EABs
(Memory bits)
F (MHz)
3470
0
39.06
4 × 314
4 × 10 (15360)
135.13
[8, 10, 20] {61, 59, 53, 47}
3440
0
38.16
4 × 314
4 × 10 (15360)
135.13
[9, 10, 21] {61, 59, 53, 47}
3647
0
34.24
4 × 314
4 × 10 (15360)
135.13
[10, 10, 22] {61, 59, 53, 47}
4354
0
30.67
4 × 314
4 × 10 (15360)
135.13
[12, 12, 26] {61, 59, 53, 47, 43}
5446
0
27.93
5 × 314
5 × 10 (19200)
135.13
[14, 12, 28] {61, 59, 53, 47, 43}
7972
0
26.95
5 × 314
5 × 10 (19200)
135.13
Modern CPLDs consist of LUTs (frequently called
logic elements) and dedicated memory blocks. Depending on the family, each LE (logic element) includes
one or more variable input size LUTs (typical 25 × 1 or
24 × 1), fast carry propagation logic and one or more
flip-flops. Specifically, each LE included in the Altera
FLEX10K [5] device consists of a 24 × 1 LUT, an output register and dedicated logic for fast carry and cascade chains in arithmetic mode; a number of embedded
array blocks (EABs), providing a 2K-bit RAM or ROM
and configurable as 28 × 8, 29 × 4, 210 × 2 or 211 × 1,
are the cores for the implementation of RNS LUT-based
multipliers. Likewise, LUTs allow building specialized memory functions such as ROM or RAM. Table 2
shows the total resources required and maximum sampling rate obtained for a 4-tap DWT filter bank using a
grade–1 Altera FLEX10KE FPL device, as well as the
moduli selected to cover the dynamic range. Hardware
requirements were assessed in terms of the number of
LEs and EABs while performance was evaluated in
terms of the register-to-register path maximum delay.
Figure 4 shows the sampling rate as a function of the
output precision. The use of 5- and 6- bit modulus set
was found to be an attractive choice since performance
is only limited by the LUT operation. Thus, the presented RNS-enhanced DWT filterbanks, with 19-, 20-,
21- and 22-bit output, are about two times faster than
a 2C implementation. This dramatic increase in the
system performance is gained due to the fast implementation of the index multipliers taking advantage of
the FPL embedded resources. Thus, LUTs storing the
−1
j function were able to operate at 135 MHz (when
mapped on EABs) and 5- and 6-bit modulo adders took
advantage of the fast carry propagation paths inside
the 8-bit LEs of a logic array block (LAB). In opposition to a 2C design, the presented RNS-enabled DWT
150
5,6-bit RNS
Sampling rate (MHz)
130
7-bit RNS
110
90
70
50
30
2C
10
18
20
22
24
Output precision
Figure 4.
devices.
26
28
30
Altera FLEX10KE (grade -1)
Sampling rate as a function of the output precision for index-based and 2C arithmetic 1-D DWT filter banks implemented with FPL
Design and Implementation
235
Table 3. Area required for binary-to-RNS and ε-CRT RNS-to-binary converters. Notice that, [x, y, z] represents x-bit input, y-bit coefficients
and z-bit output.
0.8 µm MSU CA/NCA (µm2 )
0.35 µm CX3003 CA/NCA (No. of modules)
RNS → 2C 16-bit
RNS → 2C 16-bit
2C → RNS (No. of stages) output (No. of stages) 2C → RNS (No. of stages) output (No. of stages)
[8, 10, 21] {31, 29, 23, 19, 17}
8696/4200 (2)
24252/15845 (4)
180/85 (2)
502/361 (4)
[10, 10, 23] {61, 59, 53, 47}
14893/6045 (2)
43378/23457 (4)
314/119 (2)
910/534 (4)
[12, 12, 27] {61, 59, 53, 47, 43}
19545/7582 (2)
54878/29345 (4)
412/152 (2)
1137/668 (4)
[14, 12, 29] {61, 59, 53, 47, 43, 41}
23587/9280 (2)
64897/34920 (4)
490/186 (2)
1364/795 (4)
CA: Combinational area.
NCA: Non combinational area.
solutions do not need long propagation paths or communicate information or carries between LABs since
carry chains are no longer than 7-bit. This fact is motivated by the reduced wordlength of the RNS channels
and made possible to mask the FPL device architectural
limitations.
6.
Binary-to-RNS and RNS-to-Binary
Converters
A historical barrier to the use of the RNS at the
system-level has been the overhead penalty associated with binary-to-RNS and RNS-to-binary conversion. Binary-to-RNS conversion can be carried out
efficiently by decomposing the B-bit 2C word, say x,
into a weighted sum of smaller words x̄i (e.g., 4-bit
words). Equation (6) exemplifies the case where a 4-bit
decomposition, namely:
|x|m j
B−2
B−1
l = −2 x B−1 +
2 xl l=0
mj
p−1
= −2 B−1 x B−1 +
x̄i 24i i=0
(6)
mj
and requires only 24 × n j LUTs and a modulo addition
stage.
RNS-to-binary conversion implies the use of a
CRT (Chinese Remainder Theorem)-based converter.
However, CRT conversion can often be a barrier in
certain applications. The auto-scaling RNS-to-binary
converter (ε-CRT) proposed by Griffin et al. [27] can
overcome these drawbacks by using a few LUTs and
binary (modulo 2n ) adders. For a scaled n-bit binary
output, and a n j -bit modulus set, this converter needs
one 2n j × n LUT for each modulus of the RNS and
a n-bit adder tree. This solution results more appropriate for most applications demanding high data rates
[28]. Implementation data, using cell-based integrated
circuit, of the 2C-to-RNS and RNS-to-2C converters
are provided in Table 3 for 5- and 6-bit modulus sets.
The design for the 2C-to-RNS converter was derived
from Eq. (6) while the ε-CRT algorithm with a 16-bit
output was used for the RNS-to-2C converter. The operating frequency of both converters was adapted to the
system performance by inserting a number of pipeline
stages (shown in Table 3), so the high throughput of the
presented index-based RNS architectures for forward
and inverse wavelet transforms was not degraded when
converters were inserted in the system.
7.
Conclusion
This paper reports on the design and implementation
using FPL devices and CBIC technologies of forward
and inverse wavelet filter banks by means of the RNS.
The architecture is based on index-transformation over
Galois fields, and requires a single LUT for each filter
coefficient multiplication. Efficient circuitry is used to
detect a zero value in the input sequence, a requirement of the design paradigm. The RNS design was
compared to a 2C architecture of comparable size. The
reported methodology demonstrated a performance
improvement over a 2C design.
Acknowledgments
J. Ramı́rez, A. Garcı́a and A. Lloris were supported
by the Comisión Interministerial de Ciencia y Tecnologı́a (Spain) under project PB98-1354. CAD tools
and supporting material were provided by Altera Corp.,
San Jose, CA, under the Altera University Program,
and Synopsys Inc., Mountain View, CA, under the
236
Ramı́rez et al.
Synopsys University Program. We would like to thank
the anonymous reviewers for their valuable comments and suggestions that contributed to enhance the
material presented in this paper.
References
1. M.A. Sodersterand, W.K. Jenkins, G.A. Jullien, and F.J. Taylor,
Residue Number System Arithmetic: Modern Applications in
Digital Signal Processing. New York: IEEE Press, 1986.
2. Altera Corporation, FLEX10K Embedded Programmable Logic
Device Family, ver. 4.1, 2001.
3. Xilinx Inc., The Programmable Logic Data Book, 1999.
4. U. Meyer-Baese, J. Buros, W. Trautmann, and F. Taylor, “Fast
Implementation of Orthogonal Wavelet Filterbanks Using FieldProgrammable Logic,” in Proc. of the 1999 IEEE International
Conference on Acoustics, Speech and Signal Processing, 1999,
vol. 4, pp. 2119–2122.
5. A. Garcı́a, U. Meyer-Bäse, A. Lloris, and F. Taylor, “RNS Implementation of FIR Filters based on Distributed Arithmetic Using
Field-Programmable Logic,” in Proc. of the 1999 IEEE International Symposium on Circuits and Systems, 1999, vol. 1, pp. 486–
489.
6. J. Ramı́rez, A. Garcı́a, U. Meyer-Baese, F. Taylor, and A. Lloris,
“Implementation of RNS-Based Distributed Arithmetic Discrete
Wavelet Transform Architectures Using Field-Programmable
Logic,” Journal of VLSI Signal Processing (Special Issue on
Computer Arithmetic and Applications), 2003, vol. 33, pp. 171–
190.
7. J. Ramı́rez, A. Garcı́a, U. Meyer-Bäse, F. Taylor, P.G. Fernández,
and A. Lloris, “Design of RNS-Based Distributed Arithmetic
DWT Filterbanks,” in Proc. of the 2001 International Conference on Acoustics, Speech and Signal Processing ICASSP 2001,
May 2001, vol. 2, pp. 1193–1196.
8. J. Ramı́rez, A. Garcı́a, P. G. Fernández, L. Parrilla, and A. Lloris,
“RNS-FPL Merged Architectures for the Orthogonal DWT,”
Electronics Letters, vol. 36, no. 14, 2000, pp. 1198–1199.
9. V. Hamann and M. Sprachmann, “Fast Residual Arithmetic with
FPGAs,” in Proc. of the Workshop on Design Methodologies for
Microelectronics, Slovakia, Sept. 1995.
10. E. Di Claudio, F. Piazza, and G. Orlandi, “Fast Combinational
RNS Processors for DSP Applications,” IEEE Transactions on
Computers, May 1995, pp. 624–633.
11. H. Safiri, H. Ahamadi, G. Jullien, and V. Dimitrov, “Design and
FPGA Implementation of Systolic FIR Filters Using the Fermat
ALU,” Proc. of the Asilomar Conference on Signals, Systems
and Computers, Pacific Grove, 1996.
12. U. Meyer-Bäse, A. Garcı́a, and F. Taylor, “Implementation of
a Communications Channelizer Using FPGAs and RNS Arithmetic,” Journal of VLSI Signal Processing, May 2001, vol. 28,
no. 1/2, pp. 115–128.
13. J. Ramı́rez, P.G. Fernández, U. Meyer-Bäse, F. Taylor, A. Garcı́a,
and A. Lloris, “Index-based RNS DWT Architectures for Custom IC Designs,” in Proc. of the IEEE Workshop on Signal
Processing Systems SiPS 2001, Oct. 2001, pp. 70–79.
14. F. Taylor, “Large Moduli Multipliers for Signal Processing,”
IEEE Transactions on Circuits and Systems, vol. CAS-28, no. 7,
1981, pp. 731–736.
15. G.A. Jullien, “Implementation of Multiplication, Modulo a
Prime Number, with Applications to Number Theoretic Transforms”, IEEE Trans. on Computer, vol. C-29, no. 10, 1980,
pp. 899–905.
16. D. Radhakrishnan and Y. Yuan, “Fast and Highly Compact RNS
Multipliers,” International Journal of Electronics, vol. 70, no. 2,
1991, pp. 281–293.
17. A.A. Hiasat, “New Efficient Structure for a Modular Multiplier
for RNS,” IEEE Transactions on Computers, vol. 49, no. 2, 2000,
pp. 170–174.
18. M. Vetterli and J. Kovacevic, Wavelets and Subband Coding,
Prentice Hall, 1995.
19. G. Strang and T. Nguyen, Wavelets and Filter Banks, WelleslyCambridge Press, 1997.
20. K.K. Parhi and T. Nishitani, “VLSI Architectures for Discrete
Wavelet Transforms,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 1, June 1993, pp. 191–202.
21. J. Fridman and E.S. Manolakos, “Distributed Memory and Control VLSI Architectures for the 1-D Discrete Wavelet Transform,” VLSI Signal Processing, vol. VII, 1994, pp. 388–397.
22. C. Chakrabarti and M. Vishwanath, “Efficient Realizations of
the Discrete and Continuous Wavelet Transform: From Single
Chip Implementations to Mappings on SIMD Array Computers,” IEEE Transactions on Signal Processing, vol. 43, March
1995, pp. 759–771.
23. M. Vishwanath, R.M. Owens, and M.J. Irwin, “VLSI Architectures for the Discrete Wavelet Transform,” IEEE Transactions on
Circuits and Systems II, vol. 42, no. 5, May 1995, pp. 305–316.
24. T.C. Denk and K.K. Parhi, “VLSI Architectures for Lattice
Structure Based Orthogonal Discrete Wavelet Transforms,”
IEEE Transactions on Circuits and Systems II, vol. 44, no. 2,
Feb. 1997, pp. 129–132.
25. F. Marino, “A ‘Double-Face’ Bit-Serial Architecture for the 1-D
Discrete Wavelet Transform,” IEEE Transactions on Circuits
and Systems II, vol. 47, no. 1, Jan. 2000, pp. 65–71.
26. J. Pihl and E.J. Aas, “A Multiplier and Squared Generator for
High Performance DSP Applications,” in Proc. of the 39th
Midwest Symposium on Circuits and Systems, 1996.
27. M. Griffin, F.J. Taylor, and M. Sousa, “New scaling algorithms
for the Chinese Remainder Theorem,” in Proc. of the 22nd Asilomar Conf. on Signals, Syst. and Comp., CA, 1988.
28. J. Ramı́rez, A. Garcı́a, P.G. Fernández, L. Parrilla, and A. Lloris,
“A New Architecture to Compute the Discrete Cosine Transform
using the Quadratic Residue Number System,” in Proc. of the
2000 International Symposium on Circuits and Systems, vol. 5,
May 2000, pp. 321–324.
Javier Ramı́rez received the M.A.Sc. degree in Electronic Engineering in 1998, and the Ph.D degree in Electronic Enginnering in 2001,
Design and Implementation
all from the University of Granada. Since 2001, he is an Assistant professor at the Department of Electronics and Computer Technology
of the University of Granada (Spain). His research interest includes
residue number system arithmetic, high performance digital signal
processing and FPGA and VLSI signal processing systems. He is author of more than 50 technical journal and conference papers in these
areas. He has served as reviewer for several international journals and
conferences and is a member of IEEE.
jramirez@ieee.org
Uwe Meyer-Bäese received his BSEE, MSEE, and Ph.D. “Summa
cum Laude” from the Darmstadt University of Technology in 1987,
1989, and 1995, respectively. In 1994 and 95 he hold a post-doc position in the “Inst. of Brain Research” in Magdeburg. In 1996 and
1997 he was a Visiting Professor at the University of Florida. From
1998 to 2000 Dr. Meyer-Baese worked in the ASIC industry. He
is now a Professor in the Electrical and Computer Engineering Department at Florida State University. During his graduate studies he
worked part time for TEMIC, Siemens, Bosch, and Blaupunkt. He
holds 3 patents, has supervised more than 60 master thesis projects
in the DSP/FPGA area, and gave four lectures at the University of
Darmstadt in the DSP/FPGA area. He is author of three books including “Digital Signal Processing with Field Programmable Gate
Arrays” and “Fast Digital Signal Processing” published by SpringerVerlag. He received in 1997 the Max-Kade Award in Neuroengineering. Dr. Meyer-Baese is a IEEE, BME, SP and C&S society member.
Uwe.Meyer-Baese@ieee.org
Fred J. Taylor received his Ph.D. from the University of Colorado
in 1969. Since then he has held professional positions at Texas Instruments and the University of Texas at El Paso, Cincinnati, and
Florida where he is currently a Professor of Electrical and Computer Engineering and Computer and Information Science, along
with being president of the Athena Group, Inc. He has authored
237
over 100 archived papers, nine books, contributed chapters to four
monographs and encyclopedias, and holds four U.S. patents. His
professional interests include digital design and architecture, digital
signal processing, and engineering education.
fjt@hsdal.ufl.edu
Antonio Garcı́a received the M.A.Sc. degree in Electronic Engineering (being awarded the Nation Best Academic Record) in 1995,
the M.Sc. degree in Physics (majoring in Electronics) in 1997 and
the Ph.D. degree in Electronic Engineering in 1999, all from the
University of Granada (Spain). He was an Associate Professor at the
Department of Computer Engineering of the Universidad Autónoma
de Madrid before joining the Deparment of Electronics and Computer
Technology at the University of Granada as an Associate Professor.
His research interests include Residue Number System arithmetic,
the application of RNS to high-performance digital signal processing, VLSI and FPL implementation of RNS-based systems and the
use of RNS for low-power VLSI systems. He has authored over
50 technical papers in international journals and conferences and
has served as reviewer for several international journals and conferences. He is a member of IEEE and a C, C&S and SP Society
member.
agarcia@ieee.org
Antonio Lloris received the M.Sc. Degree and the Ph.D. degree
from the Universidad Complutense (Madrid). He was at the Centro
de Investigaciones Tècnicas de Guipúzcoa (Spain) as a researcher
and, as a lecturer, at the Escuela Tècnica Superior de Ingenieros
Industriales de San Sebastian. He was at the Universities of Malaga
and Murcia (Spain). Now he is a Full Professor at the University of
Granada (Spain). His research interest include multiple-value logic,
testing of digital circuits and signal processing using the residue
number system.
lloris@ditec.ugr.es
Download