International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014 Modular delay Commutator for DHT algorithm Thamma Sai Sireesha 1 G. Malyadri 2 1 PG Student (M.Tech), Dept. Of ECE, KKR & KSR Institute of Technology & Sciences, Guntur 2 Assistant Professor, Dept. Of ECE, KKR & KSR Institute of Technology & Sciences, Guntur ____________________________________________________________________________________ Abstract— In this paper, a new VLSI DHT algorithm that is well suited for a VLSI implementation on a highly parallel and modular architecture is proposed. It can be used for designing a completely novel VLSI architecture for DHT. Proposed discrete Hartley transform (DHT) that can be efficiently implemented on a highly modular and parallel VLSI architecture having a regular structure is presented. The DHT algorithm can be efficiently split on several parallel parts that can be executed concurrently. Moreover, the proposed algorithm is well suited for the subexpression sharing technique that can be used to significantly reduce the hardware complexity of the highly parallel VLSI implementation. Using the advantages of the proposed algorithm and the fact that we can efficiently share the multipliers with the same constant, the number of the multipliers has been significantly reduced such that the number of multipliers is very small comparing with that of the existing algorithms. Keywords—Fast Fourier transforms (FFT), Discrete Hartley Transform (DHT), VLSI architecture, Domain Processing. I. INTRODUCTION Image compression, the art and science of reducing the amount of data required to represent an image, is one of the most useful and commercially successful technologies in the field of digital image processing. Digital image and video compression is now very essential. Internet teleconferencing, High Definition Television (HDTV), satellite communications and digital storage of movies would not be feasible unless a high degree of compression is achieved. Compression is useful as it helps in reduction of the usage of expensive resources, such as memory (hard disks), or the transmission bandwidth required. In today’s age of competition where everything is reducing its size every minute, the smaller is the better. But on the downside, compression techniques result in distortion (due to lossy compression schemes) and also additional computational resources are required for compressiondecompression of the data. The Discrete Fourier transform (DFT) is used in many digital signal processing applications as in signal and image compression techniques, filter banks [1], signal representation, or harmonic analysis [2]. ISSN: 2231-5381 The discrete Hartley transform (DHT) can be used to efficiently replace the DFT when the input sequence is real. The classical split-radix algorithm is difficult to implement on VLSI due to its irregular computational structure and due to the fact that the butterflies significantly differ from stage to stage. Thus, it is necessary to derive new such algorithms that are suited for a parallel VLSI system. In the first step of encoding process the image f(x, y) is mapped to a format to reduce spatial redundancy [2]. The various transforms used for mapping are • Discrete cosine transform • Discrete wavelet transform • Discrete Hartley transform Next quantization is done, where the loss of information takes place. Since it is an irreversible process, we can omit this step for a lossless coding technique. The final step is symbol coding, where various coding techniques can be used to represent the information in minimum possible number of bits. The various coding techniques used are Huffman coding, run-length coding, LZW coding, bit plane coding, block transform coding and many other. http://www.ijettjournal.org Page 34 International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014 Formula Formally, the discrete Hartley transform is a linear, invertible function H: R→R (where R denotes the set of real numbers). The N real numbers 0 1… −1 are transformed into N real numbers 0 1… −1 according to the formula N 1 Figure 1 Functional block diagram of a general image compression system The FPGAs need to be programmed i.e. configuring the logic circuits and interconnection switches to implement a desired structural circuit. Applications of FPGAs include digital signal processing, software-defined radio, aerospace and defence systems, ASIC prototyping, medical imaging, computer vision, speech recognition, cryptography, bioinformatics, computer hardware emulation, radio astronomy, metal detection and a growing range of other areas. The human eye is fairly good at seeing small differences in brightness over a relatively large area, but not so good at distinguishing the exact strength of a high frequency brightness variation. This fact allows one to get away with a greatly reduced amount of information in the high frequency components. This is done by simply dividing each component in the frequency domain by a constant for that component, and then rounding to the nearest integer. This is the main lossy operation in the whole process. As a result of this, it is typically the case that many of the higher frequency components are rounded to zero, and many of the rest become small positive or negative numbers. H k xn (cos n 0 2 nk 2 nk sin ) N N The inverse transform is given by: xn 1 N N 1 H k 0 k (cos 2 nk 2 nk sin ) N N The cas function is given by: cas( 2 nk 2 nk 2 nk ) cos sin N N N And one of the properties of cas function is: 2cas(a b) cas(a)cas(b) cas(a)cas(b) cas(a)cas(b) cas(a)cas(b) 2 –Dimensional DHT of an array x (m, n) of size MxN may be defined as: M N X (k , l ) x(m, n)cas ( m 0 n 0 2 mk 2 nl ) M N The inverse transform is given by the same formula along with a scaling factor of 1/MN i.e. X (k , l ) 1 MN M N x(m, n)cas( m0 n 0 2 mk 2 nl ) M N Image compression is minimizing the size in bytes of a graphics file without degrading the quality of the image to an unacceptable level. The reduction in file size allows more images to be stored in a given amount of disk or memory space. It also reduces the time and II. DISCRETE HARTLEY TRANSFORM bandwidth required for images to be sent over the The Hartley transform is an integral transform Internet or downloaded from Web pages. closely related to the Fourier transform, but which There are several different ways in which image transforms real-valued functions to real-valued functions. files can be compressed. For Internet use, the two most It was proposed as an alternative to the Fourier common compressed graphic image formats are the transform by R. V. L. Hartley in 1942[8]. Compared to JPEG format and the GIF format. The JPEG method is the Fourier transform, the Hartley transform has the more often used for photographs, while the GIF method advantages of transforming real functions to real is commonly used for line art and other images in which functions (as opposed to requiring complex numbers) geometric shapes are relatively simple. and of being its own inverse. The discrete version of the First of all the image is divided into blocks of transform, the Discrete Hartley transform, was 8x8 pixel values. These blocks are then fed to the introduced by R. N. Brace well in 1983. encoder from where we obtain the compressed image. ISSN: 2231-5381 http://www.ijettjournal.org Page 35 International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014 The next step is mapping of the pixel intensity value to another domain. The mapper transforms images into a (usually non-visual) format designed to reduce spatial and temporal redundancy. Quantizing the transformed coefficients results in the loss of irrelevant information for the specified purpose. Source coding is the process of encoding information using fewer bits (or other informationbearing units) than an encoded representation would use, through use of specific encoding schemes. The block diagram of the steps Figure 2 Energy quantization based image compression encoder For retrieving the image back, the steps have to be reversed from the forward process. First the data is decoded using the decoder. Next inverse transform (IDHT) is calculated to get the 8x8 blocks. These blocks are then connected to form the final image. From the reconstructed image pixel values it is clear that some of the high frequency components are preserved. This indicates that the edge property of the image is preserved. It is required to convert the pixel values into another domain so that it is easier to compress. A transform operates on an image’s pixel values and converts them to a set of less correlated transformed coefficients. Natural images (which are the most common images to be compressed) have a lot of spatial correlation between the pixel intensities in its neighborhood. These correlations can be exploited by using the transform and so the spatial and temporal redundancy is reduced. This operation is generally reversible and may or may not reduce the data content of the images. Here discrete Hartley transform (DHT) is used for generating the coefficients Figure 3 Energy quantization based image compression decoder Quantization is the process of approximating a continuous range of values (or a very large set of possible discrete values) by a relatively small ("finite") set of discrete symbols or values. In other words it means mapping a broad range of input values to a limited number of output values. It reduces the accuracy of the transformed coefficients in accordance with a preestablished fidelity criterion. The goal is to reduce the amount of irrelevant information present in the image. The human eye is fairly good at seeing small differences in brightness over a relatively large area, but not so good at distinguishing the exact strength of a high frequency brightness variation. This fact allows one to get away with a greatly reduced amount of information in the high frequency components. This is done by simply dividing each component in the frequency domain by a constant for that component, and then rounding to the nearest integer. This is the main lossy operation in the whole process. As a result of this, it is typically the case that many of the higher frequency components are rounded to zero, and many of the rest become small positive or negative numbers. The quantization matrices are formed for different transforms according to their frequency distribution in the coefficient matrix. Quantization matrix for DCT can be easily obtained but it is difficult for DHT since the scanning order is special for DHT. Figure 4 Scanning Order for DHT ISSN: 2231-5381 http://www.ijettjournal.org Page 36 International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014 Block diagram description Because computing the DFT of an N point sequence requires N summations each involving N operations, the total computation requires O(N2) operations. Writing out the entire computation by hand will show, however, that many of these operations are redundant and can be eliminated. Using Danielson and Lanczos’ [3] observation that an N point DFT can be expressed as the summation of two N/2 point DFTs, these redundancies can be eliminated as we now show in above block diagram by Adopting the conventional definition. The basic, radix-2 FFT algorithm is very symmetrical, but it accepts general complex input when all that we need here is the ability to transform real sequences. The Fourier transform of areal sequence has conjugate symmetry (the real part of the transform is even while the imaginary part is odd) which can be exploited to reduce the number of computations in an FFT by one half the image. There are several different ways in which image files can be compressed. For Internet use, the two most common compressed graphic image formats are the JPEG format and the GIF format. The JPEG method is more often used for photographs, while the GIF method is commonly used for line art and other images in which geometric shapes are relatively simple ALU + adders (only 4 ALUs+7 adders in proposed DHT but 9 ALUs +6 adders in DCT). The blocks indicates the multiplications involved in the convolution of the inputs with twiddle factors . The FFT operation is show in fig 7 in which the imaginary multiplication involved is restricted which reduces the complexity as well area. The Discrete Fourier Transform converts discrete data from a time wave into a frequency spectrum. Using the DFT implies that the finite segment that is analysed is one period of an infinitely extended periodic signal. The DFT equation N 1 F ( n ) x ( k )e j 2 kn N k 0 x(k) is the time wave that is converted to a frequency spectrum by the DFT. Here are key concepts required to understand a DFT: The "sampling rate", sr. The sampling rate is the number of samples taken over a time period. For simplicity we will make the time interval between samples equal. This is the "sample interval". Figure 7 Multiplications involved in the DHT algorithm Figure 5 Functional block diagram Figure 6 Twiddle factor constant multiplier The fundamental period, T, is the period of all the samples taken. This is also called the "window". The "fundamental frequency" is f0, which is 1/T. f0 is the first harmonic, the second harmonic is 2*f0, the third is 3*f0, etc. The number of samples is N. The "Nyquist Frequency", fc, is half the sampling rate. The Nyquist frequency is the maximum frequency that can be detected for a given sampling rate. This is because in order to measure a wave you need at least two sample points to identify it (trough and peak). 1. "Euler's formula" 2. The sampled part of the time wave, x(t), should be "typical" of how the wave behaves over all time that it exists. Compared to DCT adder/ sub tractor of existing algorithms, DHT adder/subtracter requires less no. of ISSN: 2231-5381 http://www.ijettjournal.org Page 37 International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014 MDC Architecture The goal is to convert the input streams in Fig. 8 to the format in Fig.9. There are 12 memory banks at the input stage for converting the parallel input streams into serial blocks, such that one butterfly at each stage can compute the four data streams without idle period. The 12 memory banks are grouped into four memory sets as shown in Fig. 4(a), that is, memory sets a, b, c, and, which are used to store the input streams A, B,C, and, respectively. There are two kinds of grouping methods, namely grouping for even indexed symbols and grouping for odd indexed symbols. Let the index of OFDM symbol begin from 0. For even-indexed OFDM symbols, the grouping method in the left side of Fig. 8 is used and for odd indexed OFDM symbols, the grouping method in the right side of Fig.8 is used. Fig. 8 illustrate the memory scheduling for even-indexed OFDM symbols. The scheduling for odd-indexed OFDM symbols will become clear after the illustration for evenindexed OFDM symbols. Let us take N = 2048 as an example and explain the input scheduling as follows. Initially the 12 memory banks are logically grouped into four sets {a1, a2, a3}, {b1, b2, b3}, {c1, c2, c3}, and {d1, d2,d3} as shown in Fig. 4.1(a). Each set is in charge of one input stream. From the first to the 3N/4th cycle, the memory banks keep the first to 3N/4th samples of each input stream. For the case of N = 2048, the memory banks {a1, a2, a3}, {b1,b2, b3}, {c1, c2, c3}, and {d1, d2, d3} store the samples 1th–512th, 513th–1024th, 1025th–1536th} of the first, the second, the third, and the fourth input streams, respectively. From the (3N/4+1)th to the Nth cycle shown in Fig. 4.1(b),the radix-4 butterfly processes the read-out data from the memory set {a1, a2, a3} and then this memory set are updated with the incoming samples from stream B,C, and D. That is, together with the previously stored first to 3N/4thsamples, now the radix-4 butterfly can process the samples of stream A, because the (3N/4 + 1)th to the Nth samples are ready at this moment, also, since only one butterfly isused at each stage, the (3N/4 + 1)th to the Nth samples for input streams B, C, and D are stored in the vacated memories a1, a2, and a3, respectively. Continuing with the example of N = 2048, at the end of the 2048th clock cycle, the radix-4 butterfly has computed the 2048 samples of stream A, and the memory set {a1, a2, a3} is updated with the 1537th to the 2048th samples of stream B,C, and D, respectively. ISSN: 2231-5381 Similarly, in the next N/4 cycles, the contents in memory set b are updated as shown in Fig. 4(c). The processor readout the 2048 samples of stream B from the memory banks a1and {b1, b2, b3} and sends it to the radix-4 butterfly. Then, the empty memories a1 and {b1, b2, b3} are updated by the first to the N/4th samples of streams A, B, C and D, respectively, of the second OFDM symbols. Continuing with the example of N = 2048, at the end of the 2560th clock cycle, the radix4butterfly has computed the 2048 samples of stream B, and the memories a1 and {b1, b2, b3} are updated with the first to the512th samples of stream A, B,C, and D respectively, of the second OFDM symbols. Continuing with the example of N = 2048, at the end of the3072th and the 3584th clock cycles, the radix-4 butterfly has handled streams C and D, respectively. Moreover, at the end of the 3584th clock cycle, all the memories are updated with the first to the 1536th samples of the second OFDM symbol. Next, similar procedures mentioned above are used to handle the second OFDM symbol. For a practical implementation, the control mechanism of the proposed input scheduling is summarized in Fig. 4.2, where the switch-boxat stage s updates the routing rule every N/4s+1 OFDM symbol time. Each of the four scheduled sequences occupies 1/4 of one OFDM symbol time, hence all four scheduled sequences can be handled within one OFDM symbol duration using one radix-4 butterfly at each stage. As a result, the utilization rates for adders, multipliers and memories are 100%. The computational complexity for each stage is thus one radix-4 butterfly, three twiddlefactor multipliers, and a switch-box with first in firstouts (FIFOs). Since stage s needs 3N/4s words of FIFOs, together with the input scheduling memory that is of3N words, the overall required memory size of the proposedradix-4 MDC FFT/IFFT processor with four parallel input streams is 3N +log4 N−1s=1 3N/4s . Butterfly Operations: The proposed FFT/IFFT processor uses radix-4 butterflies as fundamental computing elements as shown in fig 8 . Each stage adopts the sameradix-4 butterfly, while the last stage uses a radix-8 butterfly which can also be configured as a radix-4 butterfly. As for the storage requirement of the twiddle factors, Lin suggested to keep only the twiddle factors whose phase indices are within N/8 , the rest of the twiddle factors can be derived from quadrant conversion. As for the complex http://www.ijettjournal.org Page 38 International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014 multiplications, eachradix-4 butterfly needs three multipliers and five real adders. We adopted the routing rule for switch-box proposed. We propose a configurable radix-8/radix-4 butterfly for the last stage, where the multiplications of twiddle factor can be realized by constant multipliers. This butterfly is composed by one radix-4 and four radix-2 butterflies as shown in Fig. 4.2.When a radix-4 instead of a radix-8 computation is needed, this butterfly enables only the internal radix-4 computations and disables the other radix-2 computation environment could be used to exploit the advantages of the larger and more complex algorithms like vector radix techniques. Many promising hybrid techniques have been also developed and deserve attention. Image should also support saving the FHT Buffer to disk as well as its power spectrum. The ability to view and alter the amplitude and phase of images should also be supported. Finally, the dyadic frequency domain operations deserve optimization, since their speed could doubled with- out too much difficulty. Figure 8 MDC architecture for the Proposed System Figure 10 Stage box 1 Figure 9 Memory scheduling for the Even and Odd indexed terms. III. The above simulation shows the twiddle factor distributed for the multipliers W0..W3 represents the twiddle factors in memory .f0 ....f3 represents the switching activity of multiplier RESULTS AND CONCLUSIONS A DHT transform of 8 bit input is being implemented with radix 4 implementation. the selection line is the switch for DHT and its inverse .a variable length of inputs have been tested and synthesized . FFT has consumed 8 adders and 4 multipliers where as the proposed scheme has only 4 adders and 2 sharing multipliers as one of the twiddle factor is one. Implementation is done in verilog language using Xilinx tool. Split-radix techniques are very attractive since they provide both compact size and minimum operation counts. As processors evolve, the finite register set limitation also becomes less stringent. Such an ISSN: 2231-5381 Figure 11 Stage box 2 The above simulation shows the twiddle factor distributed for the multipliers W0..W3 represents the twiddle factors in memory .f0 ....f3 represents the http://www.ijettjournal.org Page 39 International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014 switching activity of multiplier.the multiplier consists of 1 and -1 The above simulation is the multiplier output for the twiddle factors and the inputs .the outputs are shown by out1 to out4. Figure 8 Stage 3 The above simulation shows the twiddle factor distributed for the multipliers W0..W3 represents the twiddle factors in memory .f0 ....f3 represents the switching activity of multiplier. As the multiplier consists of common digits -1 and 1 it can be re shared Figure 11 Imaginary multiplication The above simulation is the multiplier output for the twiddle factors and the inputs .the imaginary multiplication is done by using the polarity in inverse. The below fig.16 & 17 shows RTL Schematics of the DHT Module Figure 9 Top module The above simulation is the real values obtained and the imaginary vales are set to zero .two sucessive inputs are given the sel line represents the inverse and normal DHT output Figure 12 RTL schematic DHT Module Figure 13 RTL schematic DHT Module Figure 10 Multiplication ISSN: 2231-5381 http://www.ijettjournal.org Page 40 International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014 7. P. K. Meher, ―LUT optimization for memorybased computation,‖ IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 4, pp. 285–289, Apr. 2010. 8. R.E. Crochiere and L.R.Rabiner, Multirate Digital Signal Processing. Englewood Cliffs, NJ, USA: Prentice-Hall, 1983. Authors Profile: Thamma Sai Sireesha is pursuing her Master degree M.Tech in VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS KKR & KSR Institute Technology & Science. in of Figure 14 Comparison for Modular DHT and MDC DHT REFERENCES 1. G. Bi, Y. Chen, and Y. Zeng, ―Fast algorithms for generalized discrete Hartley transform of composite sequence length,‖ IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 47, no. 9, pp. 893–901, Sep. 2000. 2. D. F. Chiper, ―Radix-2 fast algorithm for computing discrete Hartley transform of type III,‖ IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 59, no. 5, pp. 297–301, May 2012. 3. H. Z. Shu, J. S. Wu, C. F. Yang, and L. Senhadji, ―Fast radix-3 algorithm for the generalized discrete Hartley transform of type II,‖ IEEE Signal Process. Lett., vol. 19, no. 6, pp. 348–351, Jun. 2012. 4. G. Bi, ―New split-radix algorithm for the discrete Hartley transform,‖ IEEE Trans. Signal Process., vol. 45, no. 2, pp. 297–302, Feb. 1997. 5. P. K. Meher, J. C. Patra, and M. N. S. Swamy, ―High throughput memorybased architecture for DHT using a new convolutional formulation,‖ IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 7, pp. 606–610, Jul. 2007. 6. P. K. Meher, T. Srikanthan, and J. C. Patra, ―Scalable and modular memory-based systolic array architectures for discrete Hartley transform,‖ IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53, no. 5, pp. 1065–1077, May 2006. ISSN: 2231-5381 http://www.ijettjournal.org G. Malyadri is working as Assistant Professor in KKR & KSR Institute of Technology & Science. He has over seven years of teaching experience. Page 41