Journal of VLSI Signal Processing 34, 227–237, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Design and Implementation of High-Performance RNS Wavelet Processors Using Custom IC Technologies JAVIER RAMÍREZ Department of Electronics and Computer Technology, University of Granada, Spain UWE MEYER-BÄSE Department of Electrical and Computer Engineering, Florida State University, Tallahassee, FL 32310-6046, USA FRED TAYLOR High-Speed Digital Architecture Laboratory, University of Florida, Gainesville, FL 32611-6130, USA ANTONIO GARCÍA AND ANTONIO LLORIS Department of Electronics and Computer Technology, University of Granada, Spain Received September 6, 2001; Revised August 14, 2002; Accepted August 14, 2002 Abstract. The design of high performance, high precision, real-time digital signal processing (DSP) systems, such as those associated with wavelet signal processing, is a challenging problem. This paper reports on the innovative use of the residue number system (RNS) for implementing high-end wavelet filter banks. The disclosed system uses an enhanced index-transformation defined over Galois fields to efficiently support different wavelet filter instantiations without adding any extra cost or additional look-up tables (LUT). A selection of a small wordwidth modulus set are the keys for attaining low-complexity and high-throughput. An exhaustive comparison against existing two’s complement (2C) designs for different custom IC technologies was carried out. Results reveal a performance improvement of up to 100% for high-precision RNS-based systems. These structures demonstrated to be well suited for field programmable logic (FPL) assimilation as well as for CBIC (cell-based integrated circuit) technologies. Keywords: discrete wavelet transform, RNS arithmetic, custom integrated circuit, field-programmable logic devices 1. Introduction There is a growing demand that digital image processing be performed at greater real-time bandwidths, with higher precision, and lower complexity. Since these systems are intrinsically SAXPY (S = AX + Y) dominant, advanced solutions must overcome existing arithmetic limitations. An arithmetic system capable of surmounting this barrier is the residue number system, or RNS. Computer arithmeticians have long held that the RNS offers a distinct MAC (multiply and accumulate) speed-area advantage [1] in SAXPY-intensive applications. The development of new RNS structures to better build signal processing systems with custom IC technologies is a field of continuous interest and study. The evolution of the DSP market and technology makes necessary considering not only cell-based ASICs but modern CPLD (complex programmable logic device) families, such as Altera FLEX10K [2] or 228 Ramı́rez et al. Virtex [3] in the design and implementation of signal processing systems. ASIC are becoming the dominant technology with the Y-2000 DSP CBIC (cell-based integrated circuit) ASIC market valued in excess of $13B, compared to $8B for PDSPs (programmable digital signal processors). The FPL ASIC market is expected to expand at a rate of 20% per annum rate, with DSP applications leading the way. While FPL houses champion their technology as a provider of system-on-achip (SOC) DSP solutions, engineers have historically viewed FPLs as a prototyping technology. It should be noted that 40% of the current FPL design starts are rated at 1,500 gates. This figure falls well below the reported 50,000+ gates that account for 50% of standard cell ASIC designs [1]. When one considers that an FPGA typically requires 10× more gates than a CBIC to implement a common logic function, a typical 50k gate standard cell ASIC design would require a large 500k gate FPGA. In order for FPL to begin to compete in areas currently controlled by low-end standard cell, a means must be found to more efficiently implement DSP objects. In [4], the RNS was used to design a wavelet transform using field-programmable logic (FPL). The design was compared to a two’s complement (2C), and distributed arithmetic (DA) implementation. The RNS solution was found to be superior to the 2C case and compared favourably with the DA instantiation, but unlike a DA design, was fully programmable. An enhanced RNS implementation of FIR filters by means of the DA method can be found in [5]. Later this RNS-DA mechanization was enhanced and applied to wavelet filter banks [6, 7]. Thus, a DWT filterbank having a 14-bit input, designed by means of the reported RNS-DA methodology, achieved a performance improvement over the equivalent 2C system of up to 156.27%, and with the conversion stage not degrading the throughput of the overall system. The RNS speed advantage is gained by reducing arithmetic to a set of concurrent operations that reside in small wordlength non-communicating channels. This attribute makes the RNS potentially attractive for implementing DSP objects with commercially available FPL technology and CBIC technologies. Another demonstration of the RNS benefits is found in [8] for use in orthogonal wavelet filter bank applications. The filter banks were designed to accept 8-bit input signals, process using 10-bit coefficients, and ran 23.45% and 96.58% faster than a 2C design for one and two octaves, respectively. A weakness of the reported RNS solution was that fixed coefficient multiplication was mapped into look-up tables (LUTs). Consequently, the tables needed to be re-programmed whenever a different set of wavelet coefficients were selected. This paper explores an efficient means of obtaining efficient discrete wavelet transform (DWT) architectures defined over multiple filter coefficient sets, by means of the RNS. The paper extends these ideas and develops a mechanism of achieving synergy within FPL-defined environments and cell-based CMOS IC technologies to better implement arithmetic intensive DSP solutions. The quantifiable benefits of this approach are studied in the context of a programmable wavelet filterbank. The work will build upon previous works and RNS-FPL design studies [6, 7, 9–13]. 2. Index-Based Arithmetic over Galois Fields There is emerging evidence that an arithmetic technology, called the RNS, can avoid the throughput degradation with the increase in precision and become a custom IC enabling technology [3, 6, 7, 12, 13]. Computer arithmeticians have long held that the RNS offers the best MAC speed-area advantage [1]. In the RNS, numbers are represented in terms of a relatively prime basis set (moduli set) P = {m 1 , . . . , m L }. Anynumber L X ∈ Z M = {0, 1, . . . , M − 1}, where M = i=1 mi , has a unique RNS representation X ↔ {X 1 , . . . , X L }, where X i = X mod m i . Like the 2C system, the RNS arithmetic is exact as long as the final result is bounded within the system’s dynamic range Z M . Mapping from the RNS back to the integer domain is defined by the Chinese Remainder Theorem (CRT) [1]. RNS arithmetic is defined by pair-wise modular operations: Z = X ± Y ↔ X m 1 ± Ym 1 m 1 , . . . , X m L ± Ym L m L Z = X × Y ↔ X m × Ym , . . . , X m × Ym 1 1 m1 L L mL (1) where |Q|m j denotes Q mod m j . The individual modular arithmetic operations are typically performed as LUT calls to small memories. The RNS differs from traditional weighted numbering systems in the fact that the RNS arithmetic is a carry-free and can operate at a constant speed over a wide range of precisions. A variety of RNS multipliers are available, including pure LUT multipliers, square law multipliers [14], index-transform multipliers [15, 16], and array multipliers [17]. Pure LUT multipliers require a double precision LUT and are only a good choice for small Design and Implementation moduli. Square law multipliers require two LUTs, two adders and a modulo adder. Galois field multipliers are based on index transformation and require a single LUT to implement modulo multiplication in a DSP system [13]. Array multipliers are used, for instance in cryptographic systems, since large moduli are required and any LUT-based multiplier would require very large LUTs. The index-transformation multiplier [15, 16] constitutes an efficient means of designing high performance, reduced complexity DSP systems. They are based on the mathematical properties associated with a Galois fields denoted GF( p), where p is prime. All the non-zero elements in a Galois field can be generated by exponentiating a primitive element denoted g j . This property can be exploited for multiplication in GF(m j ) through the use of a well known isomorphism existing between the multiplicative group Q = {1, 2, . . . , m j − 1}, with multiplication performed modulo m j , and the additive group I = {0, 1, . . . , m j − 2}, with addition performed modulo (m j − 1). The mapping is given by: i q = −1 j (i) = g j mod m j |q j qk |m j = g |i j +ik |m j −1 (3) Thus, the multiplication of two numbers, say q j and qk , can be performed by adding exponents in a modular sense. The exponents, or indexes i j and i k , can be pre-computed and stored in a lookup table. Adding the indexes can be performed with a modulo (m j − 1) adder, and the inverse index transformation of i j into q j can be performed again using a LUT. 3. thogonal wavelets, W j+1 is defined as the orthogonal complement of V j+1 in V j . Assuming a sequence ḡ n ∈ V0 exists such that {ḡ n−2k }k∈Z is a basis for V1 , a sequence h̄ n ∈ V0 can then be found such that {h̄ n−2k }k∈Z is a basis for W1 . Thus, V0 can be decomposed as: V0 = W1 ⊕ W2 ⊕ · · · ⊕ W J ⊕ V J by simply iterating the decomposition rule J times. An attractive feature of the wavelet series expansion is that the underlying multiresolution structure leads to an efficient discretetime algorithm based on a filter bank implementation. The octave-band analysis filter bank computes the inner products with the basis functions for W1 , W2 , . . . , W J , and V J . The orthogonal projection of the input signal onto W1 , W2 , . . . , W J , and V J is computed after convolution with the synthesis filters. Then, the sequence is decomposed into a coarse resolution version in V J with added details in Wi (i = 1, 2, . . . , J ). Thus a 1-D N th-order DWT decomposition of a sequence xn is defined by the recurrent equations: an(i) = dn(i) (2) q ∈ Q, i ∈ I and multiplication, using index arithmetic, is based on: Discrete Wavelet Transform Interest in the wavelet transform [18, 19] has grown dramatically during the last decade [20–25]. Wavelet transforms are routinely used in speech, image and video signal processing, and other applications. Discrete wavelet transforms (DWT) are defined over a sequence of embedded closed subspaces, V J ⊂ V J −1 ⊂ . . . ⊂ V1 ⊂ V0 , where V0 = l2 (Z ) is the space of square-summable sequences. These subspaces satisfy the upward completeness property, ∪V j = l2 (Z ), j ∈ [0, J ]. Assume that any element in V j can be uniquely expressed as the sum of two elements from V j+1 and W j+1 , where V j = V j+1 ⊕ W j+1 . For or- 229 = N −1 k=0 N −1 k=0 (i−1) gk a2n−k i = 1, 2, . . . , J (i−1) h k a2n−k an(0) (4) ≡ xn where an(i) and dn(i) are level-i approximation and detail sequences, respectively, and gk and h k (k = 0, 1, . . . , N −1) correspond to the low-pass and high-pass analysis filter coefficients. On the other hand, the signal xn can be perfectly recovered through its multiresolution decomposition {an(J ) , dn(J ) , dn(J −1 ), . . . , dn(1) } by iteration on: âm(i−1) N /2−1 N /2−1 + ḡ2k â (i) h̄ 2k dˆ(i) m m −k 2 2 −k k=0 k=0 = N /2−1 N /2−1 + ḡ2k+1 â (i) h̄ 2k+1 dˆ(i) m−1 m−1 k=0 2 −k k=0 2 m even −k m odd (5) where ḡ k and h̄ k represent low-pass and high-pass synthesis filter coefficients. In order to ensure perfect recovery of the input signal, the coefficients of the analysis and synthesis filter banks are conveniently related to each other according to the perfect reconstruction condition [18, 19]. 4. DWT Solutions Enhanced by the RNS The design of wavelet filter banks using the RNS, presents new opportunities. If the wavelet filter 230 Ramı́rez et al. coefficients are fixed a priori, the LUT-based modulo multiplier represents the most efficient solution to meeting low-latency and hardware efficiency [8]. However, if the wavelet filter coefficients are to be run-time programmable, then the solution may require an unacceptably large number of LUTs to cover all coefficient instances [13]. The use of index-transformation multipliers [15, 16], and re-timing techniques leads to DWT filterbanks designs requiring a single 2n j × n j LUT for each filter coefficient, where n j = log2 (m j ), is the modulus wordwidth. Figure 1 shows the design based on index transformations of a modulo m j channel, for an octave-i 8-tap decomposition filter bank. The input sequence |an(i−1) |m j is decomposed into even and odd sequences that are converted to the index-domain by means of two LUTs storing the j function. Some circuitry is added to the input to detect zero values of the input sequences. Notice that clearable registers have been added to make zero the filter products in case zero is detected in the even- and oddindexed sequences. The reason for this is that multiplication by zero is not defined in the index domain and must be considered to be a special case. After the filter products are computed in the index-domain, the LUT storing the function −1 j maps the indices back to the RNS domain, and the remaining filtering or addition stage is carried out by a modular adder tree. The system exhibits symmetry for the computation of the approximation and detail sequences. The complete RNS design consists of a number of parallel channels whose combined wordwidth is sufficient to ensure that the dynamic range requirements are met [18, 19]. In a similar manner, an index-based architecture may be derived for the reconstruction (synthesis) filter bank. The resulting architecture for the 1-D IDWT is shown in Fig. 2. The two input sequences |ân(i) |m j and |dˆn(i) |m j are converted into their index representations by means of two parallel LUT storing the j function. The filter products are computed by parallel and efficient index-based multipliers with each filter product requiring a single LUT storing −1 j and a modulo (m j − 1) adder. Additional logic and clearable registers are used to detect a zero input values and make zero the corresponding filter products. Finally, two separate modulo m j addition stages are used to compute the output sequence |ân(i−1) |m j in even and odd clock cycles as required by Eq. (5). 5. Results and Discussion An 8-tap 1-D DWT filter bank was used to illustrate the design of 2C and RNS-based system. The comparison was carried out using VHDL models over Altera FLEX10KE field programmable logic (FPL) devices and two standard cell ASIC technologies. The selected ASIC reference libraries were the 0.8 µm MSU SCMOS and the Chip Express 0.35 µm triple-level metal CX3003 CMOS technologies. The 0.8 µm MSU SCMOS cell library consists of a set of gates implementing low-level logic functions. The Chip Express 0.35 µm CMOS CX3003 technology is based on the definition of a high-level module that can be configured to operate in a very wide range of simple and complex circuit functions and combinations. The logic module is a universal function composed of three multiplexers and one AND gate. It is based on the fact that a multiplexer can implement any logic function, which may be either combinatorial or sequential. Table 1 shows the total area and maximum sampling rate obtained for 8-tap RNS and 2C designs using Table 1. Total area and maximum sampling rate obtained for an 8-tap DWT filter bank. Notice that, [x, y, z] represents x-bit input, y-bit coefficients and z-bit output. Two’s complement Area (µm2 or no. of modules) RNS F (MHz) Area (µm2 or no. of modules) F (MHz) Wordwidths and modulus set 0.8 µm 0.35 µm 0.8 µm 0.35 µm 0.8 µm 0.35 µm 0.8 µm 0.35 µm [8, 10, 21] {31, 29, 23, 19, 17} 748608 19810 106.38 367.65 820360 29500 209.64 584.80 [10, 10, 23] {61, 59, 53, 47} 855016 21849 105.71 353.36 910688 36436 188.32 515.46 [12, 12, 27] {61, 59, 53, 47, 43} 1026864 25507 86.43 293.26 1138360 45545 188.32 515.46 [14, 12, 29] {61, 59, 53, 47, 43, 41} 1111376 28441 84.89 223.21 1366032 54654 188.32 515.46 Design and Implementation an( i ) mj +m +m CLR6 +m CLR4 j +m j +m j 231 +m j j +m j j CLR2 CLR0 CLR0 CLR1 CLR2 CLR3 CLR4 CLR5 CLR6 CLR7 Shift register Φ −j 1 Φ −j 1 n Even sequence n Φ −j 1 n Φ −j 1 n Φ −j 1 n Φ −j 1 n Φ −j 1 n n 2 j × nj 2 j × nj 2 j × nj 2 j × nj 2 j × nj 2 j × nj 2 j × nj +m +m +m +m +m +m +m +m LUT nj Φ −j 1 n 2 j × nj j LUT −1 j LUT −1 j LUT −1 j LUT −1 j LUT −1 j LUT −1 j LUT −1 Φ j ( g0 ) Φ j ( g1 ) Φ j ( g2 ) Φ j ( g3 ) Φ j ( g4 ) Φ j ( g5 ) Φ j ( g6 ) Φ j ( g7 ) Φ j (h0 ) Φ j ( h1 ) Φ j (h2 ) Φ j (h3 ) Φ j (h4 ) Φ j (h5 ) Φ j (h6 ) Φ j (h7 ) j −1 j −1 Φj 2 j × nj an( i −1) LUT m j nj n Φj 2 j × nj Odd sequence LUT nj +m j +m −1 Φ −j 1 CLR1 +m −1 Φ −j 1 n Shift register j CLR0 CLR1 +m −1 CLR4 +m −1 Φ −j 1 +m −1 Φ −j 1 Φ −j 1 n 2 j × nj LUT CLR5 j n 2 j × nj LUT CLR3 j n 2 j × nj LUT CLR2 +m −1 n 2 j × nj LUT j Φ −j 1 n 2 j × nj LUT j Φ −j 1 n 2 j × nj LUT +m −1 Φ −j 1 n 2 j × nj j 2 j × nj LUT CLR6 LUT CLR7 CLR3 CLR5 CLR7 +m +m j +m +m j +m j +m d n(i) Figure 1. +m j j j j mj Design of an RNS-based 1-D DWT architecture with index-transformation. 0.8 µm and 0.35 µm CBIC technologies. The solution adopted here for the 2C arithmetic DWT architecture was to use pipelined 2C multipliers based on Booth encoding and Wallace trees [26]. Hardware complex- ity and delay rapidly increase as the precision of the input and coefficients increases. These facts are shown in Table 1 and Fig. 3. Note that performance is considerably higher for an RNS-based solution than for a 232 Figure 2. Ramı́rez et al. Design of an RNS-based 1-D IDWT architecture with index-transformation. Design and Implementation 233 250 Sampling rate (MHz) 5-bit RNS 200 6-bit RNS 7-bit RNS 150 100 2C 50 19 21 23 25 27 29 31 MSU CMOS Technology Output precision 700 5-bit RNS Sampling rate (MHz) 600 6-bit RNS 500 7-bit RNS 400 300 2C 200 100 19 21 23 25 Output precision 27 29 31 Chip Express CMOS Technology Figure 3. Sampling rate as a function of the output precision for index-based and 2C arithmetic 1-D DWT filter banks implemented by means of CBIC technologies. 2C design. In order to maximize the sample rate gain, small wordwidth channels are desirable. However, only prime moduli are suitable for use in an index arithmetic system. For a 5-bit modulus set, the only admissible moduli are {17, 19, 23, 29, 31} which leads to a 22.7-bit maximum dynamic range. With a 6-bit modulus set, the dynamic range can be up to 39 bits using the moduli set {37, 41, 43, 47, 53, 59, 61}. The use of a 6-bit modulus set was found to be attractive for the designs demanding 23-, 27- and 29-bit outputs, while for the design with a 21-bit output a 5-bit modulus set is more efficient in terms of area and speed. The efficient hardware implementation of modulo multiplication by means of index transformations reveals 2C and RNS-based systems to have similar hardware complexities, while an RNS solution will take advantage of higher speed and better ASIC routability inside each channel. For instance, a DWT filter bank enhanced by RNS arithmetic and having 21-, 23-, 27-, and 29-bit output is about 97%, 78%, 118% and 122% faster than a 2C design when using the MSU SCMOS 0.8µm technology. Notice that, using a six-bit modulus set for RNS wavelet filterbanks with 23 bits output or above, makes the overall throughput improvement to not steadily increase with the wordlength. On the other hand, filters having 27 and 29 bits are twice as fast as the 2C equivalent design. FPL devices have recently generated interest for use in DSP systems due to their ability to implement custom solutions while maintaining flexibility through device reprogramming. FPL technology providing embedded LUTs and dedicated logic blocks are potential solutions for MAC-intensive RNS-based DSP systems. 234 Ramı́rez et al. Table 2. Total resources required and maximum sampling rate obtained for a 4-tap DWT filter bank on an Altera FLEX10KE device (grade-1). Notice that, [x, y, z] represents x-bit input, y-bit coefficients and z-bit output. Two’s complement Wordwidths and modulus set [8, 9, 19] {61, 59, 53, 47} RNS No. of LEs No. of EABs (Memory bits) F (MHz) No. of LEs No. of EABs (Memory bits) F (MHz) 3470 0 39.06 4 × 314 4 × 10 (15360) 135.13 [8, 10, 20] {61, 59, 53, 47} 3440 0 38.16 4 × 314 4 × 10 (15360) 135.13 [9, 10, 21] {61, 59, 53, 47} 3647 0 34.24 4 × 314 4 × 10 (15360) 135.13 [10, 10, 22] {61, 59, 53, 47} 4354 0 30.67 4 × 314 4 × 10 (15360) 135.13 [12, 12, 26] {61, 59, 53, 47, 43} 5446 0 27.93 5 × 314 5 × 10 (19200) 135.13 [14, 12, 28] {61, 59, 53, 47, 43} 7972 0 26.95 5 × 314 5 × 10 (19200) 135.13 Modern CPLDs consist of LUTs (frequently called logic elements) and dedicated memory blocks. Depending on the family, each LE (logic element) includes one or more variable input size LUTs (typical 25 × 1 or 24 × 1), fast carry propagation logic and one or more flip-flops. Specifically, each LE included in the Altera FLEX10K [5] device consists of a 24 × 1 LUT, an output register and dedicated logic for fast carry and cascade chains in arithmetic mode; a number of embedded array blocks (EABs), providing a 2K-bit RAM or ROM and configurable as 28 × 8, 29 × 4, 210 × 2 or 211 × 1, are the cores for the implementation of RNS LUT-based multipliers. Likewise, LUTs allow building specialized memory functions such as ROM or RAM. Table 2 shows the total resources required and maximum sampling rate obtained for a 4-tap DWT filter bank using a grade–1 Altera FLEX10KE FPL device, as well as the moduli selected to cover the dynamic range. Hardware requirements were assessed in terms of the number of LEs and EABs while performance was evaluated in terms of the register-to-register path maximum delay. Figure 4 shows the sampling rate as a function of the output precision. The use of 5- and 6- bit modulus set was found to be an attractive choice since performance is only limited by the LUT operation. Thus, the presented RNS-enhanced DWT filterbanks, with 19-, 20-, 21- and 22-bit output, are about two times faster than a 2C implementation. This dramatic increase in the system performance is gained due to the fast implementation of the index multipliers taking advantage of the FPL embedded resources. Thus, LUTs storing the −1 j function were able to operate at 135 MHz (when mapped on EABs) and 5- and 6-bit modulo adders took advantage of the fast carry propagation paths inside the 8-bit LEs of a logic array block (LAB). In opposition to a 2C design, the presented RNS-enabled DWT 150 5,6-bit RNS Sampling rate (MHz) 130 7-bit RNS 110 90 70 50 30 2C 10 18 20 22 24 Output precision Figure 4. devices. 26 28 30 Altera FLEX10KE (grade -1) Sampling rate as a function of the output precision for index-based and 2C arithmetic 1-D DWT filter banks implemented with FPL Design and Implementation 235 Table 3. Area required for binary-to-RNS and ε-CRT RNS-to-binary converters. Notice that, [x, y, z] represents x-bit input, y-bit coefficients and z-bit output. 0.8 µm MSU CA/NCA (µm2 ) 0.35 µm CX3003 CA/NCA (No. of modules) RNS → 2C 16-bit RNS → 2C 16-bit 2C → RNS (No. of stages) output (No. of stages) 2C → RNS (No. of stages) output (No. of stages) [8, 10, 21] {31, 29, 23, 19, 17} 8696/4200 (2) 24252/15845 (4) 180/85 (2) 502/361 (4) [10, 10, 23] {61, 59, 53, 47} 14893/6045 (2) 43378/23457 (4) 314/119 (2) 910/534 (4) [12, 12, 27] {61, 59, 53, 47, 43} 19545/7582 (2) 54878/29345 (4) 412/152 (2) 1137/668 (4) [14, 12, 29] {61, 59, 53, 47, 43, 41} 23587/9280 (2) 64897/34920 (4) 490/186 (2) 1364/795 (4) CA: Combinational area. NCA: Non combinational area. solutions do not need long propagation paths or communicate information or carries between LABs since carry chains are no longer than 7-bit. This fact is motivated by the reduced wordlength of the RNS channels and made possible to mask the FPL device architectural limitations. 6. Binary-to-RNS and RNS-to-Binary Converters A historical barrier to the use of the RNS at the system-level has been the overhead penalty associated with binary-to-RNS and RNS-to-binary conversion. Binary-to-RNS conversion can be carried out efficiently by decomposing the B-bit 2C word, say x, into a weighted sum of smaller words x̄i (e.g., 4-bit words). Equation (6) exemplifies the case where a 4-bit decomposition, namely: |x|m j B−2 B−1 l = −2 x B−1 + 2 xl l=0 mj p−1 = −2 B−1 x B−1 + x̄i 24i i=0 (6) mj and requires only 24 × n j LUTs and a modulo addition stage. RNS-to-binary conversion implies the use of a CRT (Chinese Remainder Theorem)-based converter. However, CRT conversion can often be a barrier in certain applications. The auto-scaling RNS-to-binary converter (ε-CRT) proposed by Griffin et al. [27] can overcome these drawbacks by using a few LUTs and binary (modulo 2n ) adders. For a scaled n-bit binary output, and a n j -bit modulus set, this converter needs one 2n j × n LUT for each modulus of the RNS and a n-bit adder tree. This solution results more appropriate for most applications demanding high data rates [28]. Implementation data, using cell-based integrated circuit, of the 2C-to-RNS and RNS-to-2C converters are provided in Table 3 for 5- and 6-bit modulus sets. The design for the 2C-to-RNS converter was derived from Eq. (6) while the ε-CRT algorithm with a 16-bit output was used for the RNS-to-2C converter. The operating frequency of both converters was adapted to the system performance by inserting a number of pipeline stages (shown in Table 3), so the high throughput of the presented index-based RNS architectures for forward and inverse wavelet transforms was not degraded when converters were inserted in the system. 7. Conclusion This paper reports on the design and implementation using FPL devices and CBIC technologies of forward and inverse wavelet filter banks by means of the RNS. The architecture is based on index-transformation over Galois fields, and requires a single LUT for each filter coefficient multiplication. Efficient circuitry is used to detect a zero value in the input sequence, a requirement of the design paradigm. The RNS design was compared to a 2C architecture of comparable size. The reported methodology demonstrated a performance improvement over a 2C design. Acknowledgments J. Ramı́rez, A. Garcı́a and A. Lloris were supported by the Comisión Interministerial de Ciencia y Tecnologı́a (Spain) under project PB98-1354. CAD tools and supporting material were provided by Altera Corp., San Jose, CA, under the Altera University Program, and Synopsys Inc., Mountain View, CA, under the 236 Ramı́rez et al. Synopsys University Program. We would like to thank the anonymous reviewers for their valuable comments and suggestions that contributed to enhance the material presented in this paper. References 1. M.A. Sodersterand, W.K. Jenkins, G.A. Jullien, and F.J. Taylor, Residue Number System Arithmetic: Modern Applications in Digital Signal Processing. New York: IEEE Press, 1986. 2. Altera Corporation, FLEX10K Embedded Programmable Logic Device Family, ver. 4.1, 2001. 3. Xilinx Inc., The Programmable Logic Data Book, 1999. 4. U. Meyer-Baese, J. Buros, W. Trautmann, and F. Taylor, “Fast Implementation of Orthogonal Wavelet Filterbanks Using FieldProgrammable Logic,” in Proc. of the 1999 IEEE International Conference on Acoustics, Speech and Signal Processing, 1999, vol. 4, pp. 2119–2122. 5. A. Garcı́a, U. Meyer-Bäse, A. Lloris, and F. Taylor, “RNS Implementation of FIR Filters based on Distributed Arithmetic Using Field-Programmable Logic,” in Proc. of the 1999 IEEE International Symposium on Circuits and Systems, 1999, vol. 1, pp. 486– 489. 6. J. Ramı́rez, A. Garcı́a, U. Meyer-Baese, F. Taylor, and A. Lloris, “Implementation of RNS-Based Distributed Arithmetic Discrete Wavelet Transform Architectures Using Field-Programmable Logic,” Journal of VLSI Signal Processing (Special Issue on Computer Arithmetic and Applications), 2003, vol. 33, pp. 171– 190. 7. J. Ramı́rez, A. Garcı́a, U. Meyer-Bäse, F. Taylor, P.G. Fernández, and A. Lloris, “Design of RNS-Based Distributed Arithmetic DWT Filterbanks,” in Proc. of the 2001 International Conference on Acoustics, Speech and Signal Processing ICASSP 2001, May 2001, vol. 2, pp. 1193–1196. 8. J. Ramı́rez, A. Garcı́a, P. G. Fernández, L. Parrilla, and A. Lloris, “RNS-FPL Merged Architectures for the Orthogonal DWT,” Electronics Letters, vol. 36, no. 14, 2000, pp. 1198–1199. 9. V. Hamann and M. Sprachmann, “Fast Residual Arithmetic with FPGAs,” in Proc. of the Workshop on Design Methodologies for Microelectronics, Slovakia, Sept. 1995. 10. E. Di Claudio, F. Piazza, and G. Orlandi, “Fast Combinational RNS Processors for DSP Applications,” IEEE Transactions on Computers, May 1995, pp. 624–633. 11. H. Safiri, H. Ahamadi, G. Jullien, and V. Dimitrov, “Design and FPGA Implementation of Systolic FIR Filters Using the Fermat ALU,” Proc. of the Asilomar Conference on Signals, Systems and Computers, Pacific Grove, 1996. 12. U. Meyer-Bäse, A. Garcı́a, and F. Taylor, “Implementation of a Communications Channelizer Using FPGAs and RNS Arithmetic,” Journal of VLSI Signal Processing, May 2001, vol. 28, no. 1/2, pp. 115–128. 13. J. Ramı́rez, P.G. Fernández, U. Meyer-Bäse, F. Taylor, A. Garcı́a, and A. Lloris, “Index-based RNS DWT Architectures for Custom IC Designs,” in Proc. of the IEEE Workshop on Signal Processing Systems SiPS 2001, Oct. 2001, pp. 70–79. 14. F. Taylor, “Large Moduli Multipliers for Signal Processing,” IEEE Transactions on Circuits and Systems, vol. CAS-28, no. 7, 1981, pp. 731–736. 15. G.A. Jullien, “Implementation of Multiplication, Modulo a Prime Number, with Applications to Number Theoretic Transforms”, IEEE Trans. on Computer, vol. C-29, no. 10, 1980, pp. 899–905. 16. D. Radhakrishnan and Y. Yuan, “Fast and Highly Compact RNS Multipliers,” International Journal of Electronics, vol. 70, no. 2, 1991, pp. 281–293. 17. A.A. Hiasat, “New Efficient Structure for a Modular Multiplier for RNS,” IEEE Transactions on Computers, vol. 49, no. 2, 2000, pp. 170–174. 18. M. Vetterli and J. Kovacevic, Wavelets and Subband Coding, Prentice Hall, 1995. 19. G. Strang and T. Nguyen, Wavelets and Filter Banks, WelleslyCambridge Press, 1997. 20. K.K. Parhi and T. Nishitani, “VLSI Architectures for Discrete Wavelet Transforms,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 1, June 1993, pp. 191–202. 21. J. Fridman and E.S. Manolakos, “Distributed Memory and Control VLSI Architectures for the 1-D Discrete Wavelet Transform,” VLSI Signal Processing, vol. VII, 1994, pp. 388–397. 22. C. Chakrabarti and M. Vishwanath, “Efficient Realizations of the Discrete and Continuous Wavelet Transform: From Single Chip Implementations to Mappings on SIMD Array Computers,” IEEE Transactions on Signal Processing, vol. 43, March 1995, pp. 759–771. 23. M. Vishwanath, R.M. Owens, and M.J. Irwin, “VLSI Architectures for the Discrete Wavelet Transform,” IEEE Transactions on Circuits and Systems II, vol. 42, no. 5, May 1995, pp. 305–316. 24. T.C. Denk and K.K. Parhi, “VLSI Architectures for Lattice Structure Based Orthogonal Discrete Wavelet Transforms,” IEEE Transactions on Circuits and Systems II, vol. 44, no. 2, Feb. 1997, pp. 129–132. 25. F. Marino, “A ‘Double-Face’ Bit-Serial Architecture for the 1-D Discrete Wavelet Transform,” IEEE Transactions on Circuits and Systems II, vol. 47, no. 1, Jan. 2000, pp. 65–71. 26. J. Pihl and E.J. Aas, “A Multiplier and Squared Generator for High Performance DSP Applications,” in Proc. of the 39th Midwest Symposium on Circuits and Systems, 1996. 27. M. Griffin, F.J. Taylor, and M. Sousa, “New scaling algorithms for the Chinese Remainder Theorem,” in Proc. of the 22nd Asilomar Conf. on Signals, Syst. and Comp., CA, 1988. 28. J. Ramı́rez, A. Garcı́a, P.G. Fernández, L. Parrilla, and A. Lloris, “A New Architecture to Compute the Discrete Cosine Transform using the Quadratic Residue Number System,” in Proc. of the 2000 International Symposium on Circuits and Systems, vol. 5, May 2000, pp. 321–324. Javier Ramı́rez received the M.A.Sc. degree in Electronic Engineering in 1998, and the Ph.D degree in Electronic Enginnering in 2001, Design and Implementation all from the University of Granada. Since 2001, he is an Assistant professor at the Department of Electronics and Computer Technology of the University of Granada (Spain). His research interest includes residue number system arithmetic, high performance digital signal processing and FPGA and VLSI signal processing systems. He is author of more than 50 technical journal and conference papers in these areas. He has served as reviewer for several international journals and conferences and is a member of IEEE. jramirez@ieee.org Uwe Meyer-Bäese received his BSEE, MSEE, and Ph.D. “Summa cum Laude” from the Darmstadt University of Technology in 1987, 1989, and 1995, respectively. In 1994 and 95 he hold a post-doc position in the “Inst. of Brain Research” in Magdeburg. In 1996 and 1997 he was a Visiting Professor at the University of Florida. From 1998 to 2000 Dr. Meyer-Baese worked in the ASIC industry. He is now a Professor in the Electrical and Computer Engineering Department at Florida State University. During his graduate studies he worked part time for TEMIC, Siemens, Bosch, and Blaupunkt. He holds 3 patents, has supervised more than 60 master thesis projects in the DSP/FPGA area, and gave four lectures at the University of Darmstadt in the DSP/FPGA area. He is author of three books including “Digital Signal Processing with Field Programmable Gate Arrays” and “Fast Digital Signal Processing” published by SpringerVerlag. He received in 1997 the Max-Kade Award in Neuroengineering. Dr. Meyer-Baese is a IEEE, BME, SP and C&S society member. Uwe.Meyer-Baese@ieee.org Fred J. Taylor received his Ph.D. from the University of Colorado in 1969. Since then he has held professional positions at Texas Instruments and the University of Texas at El Paso, Cincinnati, and Florida where he is currently a Professor of Electrical and Computer Engineering and Computer and Information Science, along with being president of the Athena Group, Inc. He has authored 237 over 100 archived papers, nine books, contributed chapters to four monographs and encyclopedias, and holds four U.S. patents. His professional interests include digital design and architecture, digital signal processing, and engineering education. fjt@hsdal.ufl.edu Antonio Garcı́a received the M.A.Sc. degree in Electronic Engineering (being awarded the Nation Best Academic Record) in 1995, the M.Sc. degree in Physics (majoring in Electronics) in 1997 and the Ph.D. degree in Electronic Engineering in 1999, all from the University of Granada (Spain). He was an Associate Professor at the Department of Computer Engineering of the Universidad Autónoma de Madrid before joining the Deparment of Electronics and Computer Technology at the University of Granada as an Associate Professor. His research interests include Residue Number System arithmetic, the application of RNS to high-performance digital signal processing, VLSI and FPL implementation of RNS-based systems and the use of RNS for low-power VLSI systems. He has authored over 50 technical papers in international journals and conferences and has served as reviewer for several international journals and conferences. He is a member of IEEE and a C, C&S and SP Society member. agarcia@ieee.org Antonio Lloris received the M.Sc. Degree and the Ph.D. degree from the Universidad Complutense (Madrid). He was at the Centro de Investigaciones Tècnicas de Guipúzcoa (Spain) as a researcher and, as a lecturer, at the Escuela Tècnica Superior de Ingenieros Industriales de San Sebastian. He was at the Universities of Malaga and Murcia (Spain). Now he is a Full Professor at the University of Granada (Spain). His research interest include multiple-value logic, testing of digital circuits and signal processing using the residue number system. lloris@ditec.ugr.es