Image Compression with 2-D Discrete Cosine Transforms Final Report – 12/09/99 John Hill David Oltmanns Delayne Vaughn 1 ACKNOWLEDGEMENTS ......................................................................................................................... 4 ABSTRACT .................................................................................................................................................. 5 CHAPTER 1: INTRODUCTION ............................................................................................................... 6 PROJECT OBJECTIVES ................................................................................................................................. 6 CHAPTER 2: BACKGROUND ON DCT .................................................................................................. 7 CHAPTER 3: ALGORITHM CHOICES .................................................................................................10 CHAPTER 4: ALGORITHM SOLUTIONS ............................................................................................11 SYSTEM DESCRIPTION ...............................................................................................................................12 DCT EXPLANATION ..................................................................................................................................13 QUANTIZATION FACTOR ............................................................................................................................13 HUFFMAN CODING EXPLANATION .............................................................................................................13 BREAKDOWN OF C CODE ...........................................................................................................................14 VERILOG CODE ..........................................................................................................................................15 CHAPTER 5: SYSTEM COMPONENTS ................................................................................................16 FPGA INTERNAL LAYOUT .........................................................................................................................16 DCT/IDCT BLOCKS ..................................................................................................................................17 DCT MODULE PIN DESCRIPTIONS .............................................................................................................17 MAIN CONTROL MODULE .........................................................................................................................18 MAIN CONTROL MODULE DESCRIPTIONS ..................................................................................................18 SERIAL MODULE .......................................................................................................................................19 BYTE-STREAM-TO-BUS CONTROL MODULE ..............................................................................................21 BIT STREAM TO BUS CONTROL MODULE PIN DESCRIPTIONS .....................................................................21 SERIAL HARDWARE ...................................................................................................................................22 HARDWARE DETAIL ..................................................................................................................................23 FINAL FPGA .............................................................................................................................................24 DEVELOPMENTAL COMPONENTS ...............................................................................................................25 CHAPTER 6: DEVELOPMENT TOOLS ................................................................................................27 HARDWARE ...............................................................................................................................................27 SOFTWARE ................................................................................................................................................27 CHAPTER 7: POSSIBLE IMPROVEMENTS ........................................................................................28 SERIAL ALTERNATIVES .............................................................................................................................28 MEMORY AND CAMERA CONSIDERATIONS ................................................................................................28 PC USER INTERFACE .................................................................................................................................28 CHAPTER 8: CHALLENGES & SOLUTIONS ......................................................................................29 SOFTWARE ALGORITHM ............................................................................................................................29 VERILOG ALGORITHM ...............................................................................................................................29 THE CAMERA HARDWARE .........................................................................................................................30 SERIAL CONSIDERATIONS ..........................................................................................................................30 FPGA(S) ...................................................................................................................................................31 CHAPTER 9: TIME LINE .........................................................................................................................32 2 CHAPTER 10:RESULTS AND DISCUSSIONS ......................................................................................33 SOFTWARE RESULTS .................................................................................................................................33 INDIVIDUAL CONTRIBUTIONS ............................................................................................................38 JOHN HILL .................................................................................................................................................38 DAVID OLTMANNS ....................................................................................................................................38 DELAYNE VAUGHN ...................................................................................................................................38 REFERENCES ............................................................................................................................................39 APPENDICES INDEX ................................................................................................................................41 3 Acknowledgements Foremost we’d like to thank Dr. Rabi N. Mahapatra. Dr. Mahapatra helped us to find an intriguing and extremely challenging project and has continuously aided our development. He provided much of our original research but also served as a great resource along the way. Also, our teaching assistant Nan Ni has helped us by supplying parts and, in many cases, technical data sheets so that we could build the hardware portions of our project. Finally, we’d like to mention Trey Griffin. Trey shared his time and know-how to help us complete the serial interface for our system. Many of the components and even test banks were borrowed from him and modified to suit our needs. Without these people, we could not have taken even the first step towards our goal. 4 Abstract Our project is a compression/decompression system utilizing a FPGA interfaced with a Connectix QuickCam and a PC. The system will allow an image to be captured by the camera, compressed and decompressed using an FPGA for some of the more time-consuming tasks. Finally, the resulting image will be displayed. The FPGA will be outfitted with 2-D DCT logic and will also incorporate a serial interface, camera interface, and memory control module. Our goal is for the system to increase the speed at which an image can be compressed and decompressed using these transforms. We expect that the FPGA will allow us to do so, as it should be significantly faster with regards to digital signal processing. 5 Chapter 1: Introduction Image compression has been of interest to the computing community for quite some time. With the limited bandwidth of today's marketplace, the need to compress data, especially images, is in demand. The two factors that find themselves highest in priority are speed and compression ratio. The aspect we have chosen to attempt to optimize is the speed. By porting the computationally intensive portions of the image compression process to an FPGA, significant speed gains should be possible. By choosing an efficient algorithm for the computation of the Discrete Cosine Transform (DCT), even greater speed gains should be attainable. The implications of this are myriad as the speed gains could allow for the use of image compression for real-time applications. Our original project objectives are listed below. Project Objectives Our design will use a Connectix QuickCam and PC interfaced to a Xilinx FPGA. We hope to achieve the following goals: Capture the image using QuickCam Compress the image using FPGA-based 2-D Discrete Cosine Transform (DCT) on a Xilinx FPGA Decompress the image using Inverse DCT (IDCT) Display decompressed image on a PC using a serial port as the interface. In addition to these objectives, the project was first implemented entirely in C to provide accurate simulations. Once satisfactory software simulation was achieved, it was possible to begin the process of designing the hardware. 6 Chapter 2: Background on DCT It has long been acknowledged that the Karhunen-Loev Transform (KLT) is optimal for signal compression, but it is not feasible to implement. Introduced in 1974 by Ahmed and Rao, the Discrete Cosine Transform (DCT) presents a viable alternative to KLT. Ahmed and Rao demonstrated the close approximation that the DCT provides to the KLT. The two-dimensional DCT for a square N x N matrix is defined in Figure 2.1. Figure 2.1-DCT Definition c(i,j) is given by c (0,j) = 1/N, c (i,0) = 1/N, and c (i,j) = 2/N for both i and j 0. The input matrix is s, and t is the output matrix. Like the Fourier transform, the DCT maps the signal to the frequency domain. In fact, the DCT is often computed indirectly by first computing a Fast Fourier Transform (FFT). The output matrix, t, represents the different frequency components of the image. The upper left-hand corner of the DCT matrix provides the coefficients for the low frequency components while the lower right-hand corresponds to high frequencies. Once the DCT has been computed, the next step in compression is quantization. This is basically discarding less-important information. The human eye is much less responsive to very high-frequency components, so a quantization matrix is derived which rounds the entries in the DCT matrix giving more attention to the lower frequencies while often virtually disregarding the higher frequencies. Quantization is the part of the process that actually allows for compression. The quantization matrix can be altered to create an acceptable balance between image quality and compression ratio. Once quantization has occurred, the data are encoded to a bit-stream in which form they are stored or transported. 7 Image decompression occurs in basically an opposite flow. First it is decoded, then it is dequantized. Finally, the inverse DCT (IDCT) is used to reconstruct the original signal. The IDCT is defined in Figure 2.2. Figure 2.2-Inverse DCT All variables are defined in Figure 2.1 Being very computationally intensive and thus somewhat cumbersome, the DCT does not lend itself to real-time applications when implemented as software. To combat this speed lag, we are proposing to implement the DCT in hardware. Through the years, a variety of algorithms have been presented to quickly compute the DCT. We have chosen to implement the recursive algorithm presented by Cvetkovic and Popovic. The benefits of this algorithm are myriad. First, it is spatially efficient, requiring less adders and multipliers than direct rowcolumn approaches. Second, it is a direct computation of the DCT, i.e. it does not require the use of the FFT or some other transform. Finally, it provides excellent speed gains, which, after all, is the reason for porting the transform to hardware in the first place. A signal flow graph for this algorithm follows in figure 2.3, and Figure 2.4 shows how we have extended it for the two-dimensional case. X(0) x(0) C1/4 x(1) C1/8 x(2) C5/8 x(3) X(2) C1/4 X(6) C1/16 x(4) X(1) C5/16 x(5) x(6) x(7) X(4) bitrev C1/4 C9/16 C1/8 C13/16 C5/8 X(5) X(3) C1/4 X(7) bitrev scramble fwd_butterflies fwd_sums Figure 2.3. Signal flow graph for 1D-DCT. Circles indicate addition, squares subtraction, and arrows multiplication. Ca/b=(2cosa/b)-1. Named sections indicate routines in C code. 8 COLUMNS ROWS x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18) x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27) x(28) x(29) x(30) x(31) x(32) x(33) x(34) x(35) x(36) x(37) x(38) x(39) x(40) x(41) x(42) x(43) x(44) x(45) x(46) x(47) x(48) x(49) x(50) x(51) x(52) x(53) x(54) x(55) x(56) x(57) x(58) x(59) x(60) x(61) x(62) x(63) 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1-D FCT 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 X(0) X(8) X(16) X(24) X(32) X(40) X(48) X(56) 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 X(1) X(9) X(17) X(25) X(33) X(41) X(49) X(57) 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 X(2) X(10) X(18) X(26) X(34) X(42) X(50) X(58) 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 X(3) X(11) X(19) X(27) X(35) X(43) X(51) X(59) 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 X(4) X(12) X(20) X(28) X(36) X(44) X(52) X(60) 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 X(5) X(13) X(21) X(29) X(37) X(45) X(53) X(61) 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 X(6) X(14) X(22) X(30) X(38) X(46) X(54) X(62) 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 X(7) X(15) X(23) X(31) X(39) X(47) X(55) X(63) Figure 2.4 – 2-D DCT Signal Flow Arrows indicate multiplication. 9 Chapter 3: Algorithm Choices In developing our project, it became necessary to choose algorithms for the actual implementation of the various stages of the code. While some of these issues were discussed in earlier sections, we will once again review those decisions along with some other choices that have not previously been discussed. First, it was necessary to choose an algorithm to compute the DCT. Many papers have been written about this, and we read several of them. Because of its simplicity and its spatial and temporal efficiency, we chose the algorithm presented by Cvetkovic and Popovic for the one-dimensional DCT. From that, we chose from a simplicity standpoint to compute the two-dimensional DCT using a simple row-column computation of the one-dimensional DCT. This provided a good combination of speed, simplicity, and size. The signal flow-graphs for these algorithms are shown above in Figures 2.3 and 2.4. A variety of encoding schemes is available which would have served well, but we chose to use Huffman encoding. Several factors influenced this decision. First, this is the encoding scheme most often used in the Joint Photographic Experts Group (JPEG) image format—the industry standard for natural image compression. It is also simpler than arithmetic encoding which is sometimes used for JPEG compression, and the technology is not proprietary, as is the case with each of several arithmetic encoding algorithms. These were the primary considerations when choosing an encoding algorithm. The remaining portions of the code did not require the development of a “formal” algorithm. 10 Chapter 4: Algorithm Solutions The purpose of this project is to present a method of compressing images with a FPGA, combining 2-Dimensional Discrete Cosine Transforms (DCT) and Huffman encoding. A camera will take an 8-bit grayscale image and send it to a PC in which it is stored as a bitmap (BMP) file. In the computer, header information is removed from the BMP file, and the remaining part of the BMP file is the luminance information, one byte per pixel. This is broken up into 8byte by 8 blocks and sent to the FPGA for compression/ decompression. Since this is the computationally intensive part, the FPGA should allow for quicker processing. See Figure 4.1. Personal Computer Serial Hardware Camera FPGA Figure 4.1-Data Flow 11 System Description An enlarged image of the FPGA from Figure 4.1 is shown below in Figure 4.2. The FPGA is broken up into the serial interface, control module, bus control, DCT data bus, compression, and decompression algorithms. The serial interface receives the bits from the computer byte by byte and put the bytes into the DCT data bus until the bus control tells the control module it is full and ready for compression/ decompression to take place. Depending on what signal is sent to the FPGA the control module will call the correct compression/decompression algorithm. Serial Interface Control Module Bus Control DCT DATA BUS Compression Decompression Figure FPGA 4.2 - Components of FPGA Within the compression components there are three parts: DCT, Quantizer, and Encoder. Figure 4.3 illustrates this. The encoder/decoder is where the Huffman encoding occurs. When the information leaves the encoder the information is in it’s compressed form. Figure 4.3 - Compression/Decompression Breakdown 12 DCT Explanation The DCT is the most computationally intensive portion of the process. Figure 10.7 below shows that the DCT/IDCT takes about 75% of the processor cycles for the entire project. The DCT is calculated for each row sequentially and then repeated for the columns. This allows for a 2-dimensoinal DCT to be implemented using just a 1-dimensional DCT. The two-dimensional DCT for a square N x N matrix is defined in Figure 4.4. Figure 4.4-DCT Definition c(i,j) is given by c (0,j) = 1/N, c (i,0) = 1/N, and c (i,j) = 2/N for both i and j 0. The input matrix is s, and t is the output matrix. where c(i,j) is given by c (0,j) = 1/N, c (i,0) = 1/N, and c (i,j) = 2/N for both i and j 0. The input matrix is s, and t is the output matrix. The output matrix provides the values for the low frequencies in the upper left-hand corner while the lower right-hand corner contains the high frequencies. The human eye is more responsive to the low frequencies, but fortunately they tend to retain higher values. This causes more of the important information to be retained during quantization while the high frequencies often go to zero. Quantization Factor This is the part that actually allows for compression. The quantization factor can be altered to create an acceptable balance between image quality and compression ratio. The pixel information in the DCT matrix is divided by the quantization factor thus producing a fewer number of values for the Huffman coding to deal with. Huffman Coding Explanation The idea behind Huffman coding is simply to use shorter bit patterns for more common numbers, and longer bit patterns for less common numbers. The first step in Huffman encoding is to find the frequency of the numbers involved and then to create an encoding tree with this information. In the encoding tree each different value is a node in the tree and this is how the different bit stream values are determined. The tree is used to compress the information and to send a smaller bit stream to represent the original bit stream. This along with the encoding tree is sent in the compressed file. 13 Breakdown of C code We broke the software into three main programs: compress.c, DCT.c, and huffman.c. Each file had it's own header file. The DCT.c contains the procedures that handle the DCT. The huffman.c contains the parts to do the Huffman encoding which actually does the compression. The compress.c file integrates the other programs. The code in the dct.c and huffman.c files was found online and referred to in our reference section. The modifications to this code were minimal. In the compress.c file, first, size of the header and the height and width of the image are obtained from the image file. Second, the pixel information is extracted from the BMP file. Then, this information is divided into 8x8 blocks and processed by the DCT. Next, the overall matrix is quantized. Finally, the encoding routine is called. At this point, the file is in its compressed form and we have achieved the goal of our project. Now, the reverse process can begin. First, the file is decoded using the huffman.c code again. Then pixel information is extracted and put into a matrix so the dequantization can be done. Next the matrix is broken up into 8x8 blocks and the IDCT code within DCT.c is used. Now we must clean up the values to ensure all values are between 0 and 255 inclusive. (This might not be the case because uncertainty can be introduced by quantization.) Finally, the matrix is reattached to the header and a new BMP file is created. In the DCT.c file we have a bit reverse function, inverse/forward sums, unscramble/scramble, inverse/forward butterflies, initialize cosine array, the ifdct/fdct noscale function which calls the above functions, and a main that calls the functions in the correct order. As previously mentioned the DCT code was accessed from the web is capable of handling any size block divisible by two. We should have updated the code and made it explicitly for 8 by 8 blocks. This will be discussed in greater detail in a later section. In the huffman.c file we have to first count the number of occurrences of each byte in the data. Next we build the initial heap from the frequency count. Now the code tree can be built and this is what is used to generate the compression code. Now the image is compressed. The reversal of this process is much easier since the code tree is passed along with the compressed code. Next all we do is decompress the image using the code tree. 14 Verilog code In the beginning we thought the translation from C to Verilog was going to be straightforward and easy. This was NOT the case. Things as simple as add, subtract, and multiply took over 2 pages of Verilog code to implement. The first step was to take all 8-bit numbers and extend them to 24-bit numbers 12 bits to the left of the decimal and 12 bits to the right. This allows for precision to approximately 10-3. After the DCT was run some of the numbers might be negative, so we had to assign a sign array to follow the matrix information. We did this with the sign convention 1 = negative and 0 = positive. The add/subtract tasks were long because they had to handle the signs of the values separately from the actual values. Granted, this is just one big if-then-else statement, but it is an example of something that was simple in C and not so simple in Verilog. To get around the problem of initializing the cosine array we simply translated the cosine values into the format above and saved them as constants in Verilog. In the Verilog code we made one module for the DCT and one module for the IDCT. The modules are basically the same so I will just discuss the DCT module here. We created the DCT module and made tasks for all of the steps we had to do inside. We made a task to extend the numbers to 24 bit numbers and to zero out the sign array. Next we created tasks for add, subtract, and multiply. The tasks for bit reverse, half-bit reverse, scramble, forward sums, and forward butterflies are all represented in the signal flow graph below (figure 4.5). We had to add a special task to round all of the values to whole integers and to handle the quantization, we chose a quantization factor of 64 for the project. X(0) x(0) C1/4 x(1) C1/8 x(2) C5/8 x(3) X(2) C1/4 X(6) C1/16 x(4) X(1) C5/16 x(5) x(6) x(7) X(4) bitrev C1/4 C9/16 C1/8 C13/16 C5/8 X(5) X(3) C1/4 X(7) bitrev scramble fwd_butterflies fwd_sums Figure 4.5 - Signal flow graph for 1D-DCT. Circles indicate addition, squares subtraction, and arrows multiplication. Ca/b=(2cosa/b)-1. Named sections indicate routines in Verilog code. 15 Chapter 5: System Components FPGA internal layout At the heart of our system is our Field Programmable Gate Array (FPGA). Our final design utilizes a Xilinx Virtex 400HQ240 chip, which has 166 User Input/Output pins and a 40 x 60 CLB array. We chose this package based on our CLB needs which we estimated from a similar transform algorithm. Xilinx also recommended this particular FPGA series for DSP (digital signal processing). Figure 5.1 shows the FPGA itself and how it is seated on our board. The pins on the FPGA were closer together than we expected so we had to modify the board to allow it to fit properly. Further below it is a more detailed description of the core logical units that were designed for the Virtex Chip. FIGURE 5.1 16 DCT/IDCT blocks Figure 5.2 shows the DCT module as generated from our Verilog code. The DCT module interfaces with both the control module and the Bit-Stream to Bus Control Module. The DCT_Start and DCT_Stop signals are connected to the control module. The DCT_Start signal is raised by the control module after it has been alerted that all the data on the input bus ( FF[0:511] ) is ready and stable. Once the DCT_Start signal is raised the DCT module begins compressing the input data. Once the DCT module has completed its tasks, it assets the DCT_Stop signal to alert the Control module that the output is ready for transfer back to the PC via the serial Module/Hardware. Figure 5.2 DCT Module Pin descriptions INPUTS GO FF[0:511] OUTPUTS DCT_STOP OUT_MAT[0:511] F_SIGN[0:63] DONE See the appendix for the Verilog code that fully defines this and the IDCT macro. Also, see chapter 4 for detail on the internals of DCT/IDCT. 17 Main Control Module The main control module is a state machine implemented in Verilog. It uses handshakes with the other modules to move the data safely and reliably through the system. Initially when the system is started and the control module resets the serial module via the SERIALRSET signal. The control module then asserts the SERIALGO signal to allow the serial module to begin receiving data from the PC. When the MEMDONE and SERIALDONE each are asserted the control module knows that the data has arrived successfully and is now on the input bus of the DCT/IDCT. It then initiates the DCT compression via the DCTGO signal. The DCT module will then return the DCTDONE signal when it has finished. Continuing, the control module then re-asserts the SERIALGO signal and the data begins transfer back to the PC by way of the bit stream to bus control module. The bit stream to bus control module is the controlling mechanism determining that data should be transferred and not received during this process. Figure 5.3 shows a detail of this module and the wires that connect it to the other modules in the FPGA. Figure 5.3 Main Control Module descriptions INPUTS DCTDONE MEMDONE SERIALDONE OUTPUTS DCTGO SERIALGO SERIALSTOP SERIALRSET Also reference the appendix (imagecompress.zip) for full Verilog description of this module. 18 Serial Module The serial module is largely taken from the PDACS project done in the spring of 1999. We modified it to fit into our project. It acts in one of two ways. Either it receives data from the PC and delivers it to the bit-stream-to-bus control module, or it receives data from the bit-stream-to-bus control module and delivers it to the PC. The bit-stream-to-bus control module itself controls the direction of data transfer and the two modules together receive “begin” and “stop” instructions from the main control module. The serial module is unlike any other module because it interfaces between the FPGA and the serial hardware. Let’s take a closer look at this process. (Note: the pin descriptions below may aid in this explanation) Initially, the control module resets the serial module. As mentioned, this action affects the hardware as well, as the two are tightly coupled. When the SERIALSTART signal is given, the serial module, and hardware work together to either receive data on the DIN[7:0] bus (from the PC) and deliver it to the MEMDOUT[7:0] bus, or receive data on the MEMDIN[7:0] bus and deliver it to the hardware (and eventually the PC) on the DOUT[7:0] bus. The bit-stream-to-bus control module and ultimately the main control module control the direction. The majority of the other signals are for timing, preventing the buffers on either side of the module from being overrun or otherwise corrupted. Figure 5.4 shows the Serial module schematic. Figure 5.5 is a detail of the I/O array that interfaces the DIN[7:0] and DOUT[7:0] busses with the serial hardware. Figure 5.4 19 Figure 5.5 Main Control Module Pin descriptions INPUTS CLK DSRCLK DSR SERRSET TXRDY RXRDY MEMEND MEMBUSY MEMDIN[7:0] DIN[7:0] OUTPUTS MR RD WR MEMRD MEMWR DOUT[7:0] MEMDOUT[7:0] SERIALSTART SERIALSTOP A[2:0] See the appendix for our full Verilog description of this module. 20 Byte-Stream-to-Bus Control Module This module is what allows the DCT and serial modules to interface smoothly. The DCT module requires a parallel input. It needs to act on all 64 bytes of information simultaneously. This is also how it naturally returns the compressed information. IDCT behaves similarly but inversely. The serial module, when receiving, supplies data to this module one byte at a time (not a true bit stream). The bit-stream-to-bus control module essentially validates that information and lines it up on the DCT bus. When all the data is there, it signals the main control module to proceed. Without this module data would be lost and never processed. On the return trip the same is true. When the data is returned from the DCT/IDCT stage, the bit stream to bus control module breaks it up and delivers it back to the serial module one value at a time. Figure 5.6 below shows this module and the wires we use to connect it to the rest of the components in the system. Figure 5.6 Bit stream to Bus Control Module Pin descriptions INPUTS SerialDataIN[0:7] DCTDataIN[0:511] DCTSignIN[0:63] Write Read OUTPUTS SerialDataOUT[0:7] DCTDataOUT[0:511] Busy Done Also reference the appendix (imagecompress.zip) for full Verilog description of this module. 21 Serial Hardware The serial hardware was built based on the explanations of the PDACS group from the spring. We were able to fully implement our proposed design as seen in Figure 5.7. In this diagram the two center blocks represent the hardware that was physically built on our board. On the left and right end are the proposed block representation of the Computer and FPGA, respectively. Figure 5.7 22 Hardware Detail Figure 5.8 is the full detail diagram that we used in constructing the circuit. The PDACS group constructed this diagram. Their work was very well done and easy to follow. We used a different crystal because there weren’t any matching the original specifications. We then had to reconfigure the UART to match the new 9600 speed. This maybe a little slow, but with a different crystal it the rate could be increased. Figure 5.8 23 Final FPGA Our final design utilizes a Xilinx Virtex 400HQ240 FPGA. These are the standard specifications as listed on the Xilinx web site. For additional details see the full data sheet in the appendix of this document. We chose this particular chip because of cost, I/O pins, and CLB concerns. Xilinx also recommended it for use in DSP applications. XCV400HQ240 CLB Array (Row x Col.) Logic Cells System Gates Max. Block RAM Bits Max. Distributed RAM Bits 40x60 10,800 468,252 81,920 153,600 Delay Locked Loops (DLLs) 4 I/O Standards Supported 17 Speed Grades I/O Pins 4, 5, 6 166 Figure 5.9 (See Page 61 of Xilinx Data Sheet) 24 Developmental Components There were also several hardware and software components that were developed during the course of the project that were not part of our final design. These things helped us to learn about, develop, and test our system. The first of these components is the original camera control. This involved hardware and FGPA programmed logic. It was our original goal to use this methodology for controlling the camera and capturing images. We based the design around the Basic Stamp 2 device as detailed in Chapter 22 of the online tutorials. This was fairly simple and only took about two weeks to get up and working. The FPGA could capture an image from the camera and then return it to the PC by way of the Windows Hyperterminal. In this way we could test the module by itself before we built it into our large systems. However, our camera had faulty wiring and replacing it proved to make our prior work useless. The new camera was incompatible with our hardware, and to make things worse the manufacturer provided no datasheets. This portion had to be scrapped and redesigned. Figure 5.10 shows the camera hardware. The above mentioned Chapter 22 documentation is also included in the appendix. Figure 5.10 25 During our development, we also utilized two other FPGAs that are worth mentioning. We realized early that we would need to order a much larger FPGA to complete our project. While we were waiting for the new FPGA to arrive we continued working by using the 4003E demo-boards and also a free standing 4010 FPGA. The 4003E was utilized for testing the Camera interface. When it was time to start the Serial Module, we had to upgrade to the 4010 because it required more CLBs. Both chips, especially the stand alone, required significant amounts of setup time, and it would have been more efficient if we would have had the large FPGA from the start. Below are photos of these two FPGAs. Figure 5.11 26 Chapter 6: Development Tools Hardware Basic Wire-wrapping Tools: (wire-wrapper, wire, various pin connectors, snips, wire strippers, etc.) We used these simple tools to build our circuits on the board. Multimeter: This we used to test voltages, resistances, and various other signals. Logic Probe: The logic probe was most useful of any of the devices we used to analyze the hardware. It was easy to test for the discrete values of different wires. Software Microsoft Visual C++ 6.0/GNU C Compiler: These were the main environments in which we developed the original software version. They were helpful in debugging the software and getting it to a useful state. Xilinx Foundation Series: This was used at length in the design of the serial module and the various control modules. It also would have been the final implementation tool for the whole project. While it was very helpful in design, its simulation capabilities were lacking. Microsoft Visual Basic: Modules for the testing of the serial and control modules were developed in Visual Basic. VeriLogger/VeriWell: These Verilog simulators were used extensively in designing and testing the hardware-based DCT. They provided much simpler simulation and verification than did the Xilinx Foundation Series. Microsoft Visual FoxPro (HexEdit): We used the HexEdit tool within FoxPro to view the contents of image files so that we could understand their composition and verify the accuracy of our results. Microsoft Photo Editor/Paint: These somewhat buggy image editors were used to manipulate images received from the QuickCam and to view the results of our compression/decompression software. 27 Chapter 7: Possible Improvements Serial Alternatives Having another way to transfer the information to the FPGA instead of using the serial interface we borrowed from other groups. This was a huge bottleneck for our project. Perhaps USB or even Fire Wire would be better alternatives. Regardless, serial communication is not a good choice when shooting for speed improvements. Memory and Camera Considerations If the new QuickCam we received would have been grayscale and worked with the FPGA camera interface we designed, then we could have gone directly to the FPGA and avoided the Serial communication slow down on one side. If we could have done this we would have also needed to include enough memory outside of the FPGA to hold the entire image. Again speed is important here. PC User Interface A better PC interface is another major improvement that would benefit our project. Having a button to click on to take the picture and then one to compress would be much easier for the end user. Because the new camera requires proprietary software, for now we have to use the QuickCam software to take the picture and then use Microsoft Photo Editor to save the picture as an 8-bit grayscale image. All of these things should be done as a part of the PC interface, but were not planned for in the original proposal. We were unaware that the new camera would require extra development time. 28 Chapter 8: Challenges & Solutions Software Algorithm The first major challenge we faced was in the development of the software version of the project. While it was relatively simple to get each of the major portions—DCT, quantization, encoding, decoding, dequantization, IDCT—of the code to work individually, the integration of the components proved to be a formidable task. By rigorous restructuring and clarification of the code we were able to overcome this obstacle and develop a software project that worked satisfactorily. Verilog Algorithm Many challenges were faced in the Verilog design of the DCT component. We naïvely thought that the conversion from C to Verilog would be relatively simple. In the final analysis, however, it appears that the best course would have been to ignore the C code when developing the Verilog code. By going back to the original algorithm, a much simpler Verilog solution could have been provided. As mentioned earlier, one of the first problems we tackled was that of implementing floating point. While this was a challenge, it was by no means insurmountable. We were able to overcome obstacles along the way. The second major issue we faced in the Verilog design was the fact that Xilinx does not allow for arrays of vectors, a standard Verilog construction. Because of this we had to begin viewing our 8-byte by 8 input not as an array of 64 8-bit vectors, but as one 512-bit vector. All arrays and vectors had to be converted to this one-dimensional perspective. Once the dimensionality hurdle was cleared, many problems appeared which this initial issue had masked. The most notable of these was the fact that we were not able to use non-constant ranges for the indices of our vectors. For example we often had a for loop with the index i. To choose the i-th byte of a given vector a, we would select a[8 * i : 8 * i + 7]. However, this was not allowable, so we had to take the byte one bit at a time. There were many other issues with which we had to deal before we were ever able to get the simulation of the DCT running. Once running, it appeared that the timing between different tasks within the DCT module. This issue has not yet been resolved. We faced many obstacles in the development of the compression/decompression software and hardware, but we were able to overcome most of them. Given more time the rest of these problems could be sufficiently eliminated. 29 The Camera Hardware The QuickCam is an important piece of our project. In order to compress an image, you first have to have an image. The question then was how to capture this image and manipulate it such that it could be compressed on the FPGA. The solution started off to be fairly simple. Several previous projects used QuickCams and there was enough existing documentation to re-create their earlier work. This is where we started work. We were issued a camera and we built interfacing hardware utilizing a Basic Stamp 2 chip. By using Hyperterminal we could see that the camera was indeed returning a bitmap but there was something odd about the resulting image. After a little diagnosis, it was discovered that the camera was broken and wasn’t able to take a clear reliable picture. We ordered a new camera, and after working with it only a few hours, it was found that the old camera was incompatible with the hardware. Furthermore, the company that makes the camera didn’t have any datasheets on that camera. The Basic stamp 2 method for interfacing with the camera was scrapped after about 2 ½ weeks of development efforts. Our final design conceded that a new FPGA interface for the camera would be far too complex for our timeframe. Instead we turned to our serial interface as a means for delivering the image to the FPGA. This meant that the previous serial interface would have to be redesigned to incorporate this new need. This is what ended up being done. Serial Considerations Our original plan for the serial interface was to simply use it as a means of sending the final compressed image back to the PC. This unidirectional and relatively simple plan was laid out and researched. We found that other groups (especially PDACS) had used a serial connection before and we were able to build from the existing base of their projects. The physical hardware was built identically to their design with the exception of a few resisters and a different crystal. Our choice of crystals was limited by the supply in the design lab, but we were able to find one that delivered a transfer rate of 9600. We were temporarily satisfied with this speed as we knew that it could always be increased later. 30 The internal program for the FPGA was more difficult. After running test cases to ensure that the serial hardware was working, we redirected our efforts toward this task. A need developed for a module to interface with the serial module. The legacy code would never be able to provide the parallel data needed by the DCT/IDCT algorithms. A large bus was used and a new module (bit-stream-tobus) was designed to handshake with the serial module and to enable this large bus to deliver the data for compression. This turned out to be a very successful method, as the existing serial module needed only minor tweaking. The other module essentially emulates a memory controller, at least from the point of view of the serial module. FPGA(s) We knew from the start that we would need a big FPGA to make our project successful. As mentioned before, we ordered a Virtex 400HQ240 based primarily on our CLB needs, but we couldn’t afford to wait until it arrived to start developing our hardware. Instead we began work testing individual components of the system on small chips that were available in the lab. Two smaller FPGAs were used along the way, the first was a 4003 that was part of the standard demo boards. This we were able to successfully utilize this chip while we were building the first camera hardware. It was a bit messy with all the jumper wires that were needed to tie the chip into our main board. On the other hand, it was extremely convenient because the boards already had LEDs, switches and buttons that proved useful. This FPGA had to be retired when it came time to start the serial interface. It simply wasn’t big enough. Searching for more CLBs, we were given a 4010. This was a sizable step up from the previous FPGA, but it was not incorporated into a demo board. We had to wire the chip into a new board and connect not only the power and ground pins but also all of the LEDs and switches we would need. This was a simple matter, but nevertheless took some amount of time. We were eventually successful at getting the serial hardware and FPGA logic working on this chip. When the FPGA we ordered finally arrived, we had about 2 ½ weeks remaining. Much of the project was completed, but it proved to be a very time-consuming task to integrate the new FPGA. The primary problem was the size of the wire wrapping socket. The pins were smaller than our wire-wrapping tools, and each pin could only hold about two wires before the wrap would touch neighboring pins. This was one of the few obstacles that were never overcome. 31 Chapter 9: Time Line Date Week 3 9/12–9/18 Week 4 9/19-9/25 Week 5 9/29-10/2 Week 6 10/3-10-9 (Oct. 7) Week 7 10/10-10/16 Week 8 10/17-10/23 (Oct. 21) Week 9 10/24-10/30 Week 10 10/31-11/6 (Nov. 4) Week 11 11/7-11/13 Week 12 11/21-11/27 (Nov. 18) Week 13 11/21-11/27 Week 14 11/28-12/3 Week 15 12/4-1210 Task Research DCT and subsystems Begin Proposal Finalize Proposal Present Proposal Status Done Done Done Done Begin software Compression Algorithm Begin Camera Interface Begin Software Decompression Algorithm Finish Software Compression Algorithm Finish Camera Interface Begin Serial Hardware Biweekly Report 1 Begin FPGA Compression Module Continue Serial Interface Testing Continue Serial Testing Done Begin Control Modules Biweekly Report 2 Begin FPGA Decompression Module Continue test bank cases on Serial Module Begin Midterm Presentation Continue FPGA Decompression Mid-term Presentation Begin Integration of Serial and DCT modules with the Control Module Continue Integration of Modules Done Done Done Continue IDCT on FPGA Begin Setup of New FPGA for programming Biweekly Report 3 Begin final report and documentation Continue working with the new FPGA Thanksgiving Finalize secondary modules Finish working with the new FPGA Finish IDCT Finish Reports and Documentation Done Done Done Notes Lots of good references found on the web and from other groups. Pretty smooth – few things to reconsider. Choose Data flow graph methodology. Using Chapter 22 Tutorial Hard to test without FPGA Done Done Done Done Done Tested with 4003 and HyperTerm PDACS group is very helpful. Done Different Crystal used – 9600 Done Test bank seem to be working on 4010 FPGA Main Control Module for Data Flow Done DTR select and spooling test cases working! Done Done Done Done Done Can’t test this until the big FPGA arrives – but it should work. Timing is critical. Handshaking used for stability. Yeah! – The new Chip is here. Done Done Done Small pins cause difficulties Done X Umm - Turkey Main Control, Stream to Bus This proved way more time consuming and largely impossible. X Done Turned in: 12/9/99 32 Chapter 10:Results and Discussions Software Results We tested our software originally with an 8x8 image. Our final test runs were based on a 320x240 image. Below, our results from these tests are shown. The original picture is presented (Figure 10.1) along with the pictures corresponding to several different quantization factors. These compressed images (Figures10.210.5) degrade in quality as the quantization factor (indicated below each image) is increased. Figure 10.1 - Original Picture Figure 10.2 - Quantization Factor = 16 33 Figure 10.3 - Quantization Factor = 32 Figure 10.4 -. Quantization Factor = 64 Figure 10.5 - Quantization Factor = 128 34 Figure 10.6 below illustrates the differences between the compression percentages, The chart shows the actual values. There is not much difference in these just between twenty and twenty-five percent compression. Now we have to compare the compression ratios to determine which balance of image quality and compression ratio was the best. The biggest difference from the quantization factors is in the image quality. Shown in figures above. Quantization Factor 128 64 32 16 Initial File (bytes) 76800 76800 76800 76800 Compressed File (bytes) 15845 16396 17386 19906 Compression Percentage 20.63 21.35 22.64 25.92 Chart 10.1 – File Compression Data Com pression Percentages 30 25 Percentage of Original 20 15 10 5 0 128 64 32 16 Quantization Factors Figure 10.6 35 Since the difference in compression percentage is not that dramatic a quantization factor of 16 has been chosen as the most beneficial. To calculate the time it takes for the entire compression/decompression to run we calculated the amount of processor cycles used. This is depicted in Chart 10.2. The DCT/IDCT consumes over 75% of the process. Thus, we determined that it was the part that most needed to be implemented on the FPGA. The encoding takes a little more time than decoding because it has to make the encoding tree. Chart 10.2 below shows the data collected and the graph (Figure 10.7) illustrates this. Quanti zation Factor 128 AVG 128 64 AVG 64 32 AVG 32 16 AVG 16 AVG Break up header 100 100 101 100 100 DCT Quantize Encode Decode Dequan tize IDCT Remak e file 221 221 230 231 230 226.6 30 30 20 30 30 28 70 60 40 40 50 52 30 40 40 40 30 36 30 30 20 30 30 28 220 250 230 240 221 232.2 110 151 111 101 100 100 90 90 100 90 231 221 221 240 230 228.6 20 30 20 30 30 26 50 50 50 51 60 52.2 60 40 40 40 30 42 20 30 20 30 30 26 230 250 280 210 221 238.2 111 121 121 110 110 100 90 100 100 90 230 230 210 231 221 224.4 40 20 30 30 40 32 50 50 50 50 60 52 40 40 41 40 30 38.2 40 20 30 30 40 32 221 231 210 220 250 226.4 120 110 120 110 121 90 91 100 100 100 230 220 220 231 220 224.2 30 20 30 30 31 28.2 50 60 50 60 50 54 50 40 40 40 50 44 30 20 30 30 31 28.2 241 261 221 230 220 234.6 170 120 150 120 110 225.95 28.55 52.55 40.05 28.55 232.85 Chart 10.2 - Processor Ticks Per Activity 36 Time Consumption for Image Compression IDCT 37% Dequantization 5% DCT 37% Decoding 7% Encoding 9% Quantization 5% Figure 10.7 The working part of our project ends with the C implementation of the entire design. Granted, we have designed the rest but are far from finishing the implementation. 37 Individual Contributions John Hill Researched legacy components from previous project groups and determined their relevance to our project. Build serial, camera, and stand along FPGA hardware (for two FPGAs) and worked on modifications to old designs to fit our project. Designed all modules for download onto the FPGA with the help of my other group members. Built, modified, and ran test cases on the hardware individually to verify correctness. Integrated FPGA program with hardware and built the final schematics in the Xilinx Foundation Series. Worked with the new FPGA to attempt to get it integrated into our project. Wrote outline for, and compiled the proposal, midterm, and final reports and presentations, and contributed my part to each biweekly report. David Oltmanns Researched the DCT heavily and looked into the other sub-systems. Contributed to all documentation associated with the project. Worked jointly with Delayne to get the compression, decompression, and interface all working and integrated in C. Again in a joint effort with Delayne, translated the C code into Verilog. The C code for the DCT and quantization. This is what we wanted to download to the FPGA since these were the computationally intensive parts of the project. Delayne Vaughn Mainly responsible for testing and debugging the C/Verilog code. Spent most of his time finding and fixing the problems mentioned in the challenges section. Also collaborated with David to write several of the functions in compress.c. Divided the C code from one megalithic file to the six more manageable files which comprise the final program. Assisted David in design of Verilog DCT module. 38 References N. Ahmed, T. Natarajan, and K.R. Rao, "Discrete cosine transform," IEEE Trans. Comput., vol. C-23, pp. 90-93. Jan. 1974. K. Aldrich, D. Brandenberger, C. Chilek, and B. Raymond, "Sign Language Aquisition and Recognition System," www.cs.tamu.edu/course-info/cpsc483/common/99b/g3/Final.htm (Sept. 20, 1999) M. Berger, J. Curtin, T. Griffin, A. King, M. Nordfelt, and J. Whitted, "Portable Digital Compression/Decompression System," www.cs.tamu.edu/course-info/cpsc483/common/99a/g5/g5.html (Sept. 20, 1999) J. Berglund, R. Cuaycong, W. Day, A. Fikes, and K. Shah, "Autonomous Tracking Unit," www.cs.tamu.edu/course-info/cpsc483/common/99a/g1/g1.html (Sept. 20, 1999) N.I. Cho and S.U. Lee, "Fast algorithm and implementation of 2-D discrete cosine transform," IEEE Trans. CAS, Mar. 1991, pp. 297-305 N.I. Cho and I.D. Yun, and S.U. Lee, "On the regular structure for the fast 2D DCT algorithm," IEEE Trans. CAS, Apr. 1993, pp.259-266 S.C. Chan and K.L. Ho, "A new 2D fast cosine transform algorithm," IEEE Trans. SP, Feb. 1991, pp.481-485 H.S. Hou, "A fast recursive algorithm for computing the discrete cosine transform," IEEE Trans. ASSP, Oct. 1987, pp. 1455-1461. C.W. Kok, "Fast Algorithm for Computing 2D Discrete Cosine Transform," Unpublished article, pp. 1-4 39 R. Mahapatra, A. Kumar, and B. Chatterji, "Performance Analysis of 2-D Inverse Fast Cosine Transform Employing Multiprocessors," Article, pp. 1-31 Cvetkovic, Popovic, "New fast recursive algorithms for the computation of discrete cosine and sine transforms," IEEE Trans. Aug. 1992, pp 2083-2086 http://dmsun4.bath.ac.uk/dcts/fastdct.html (location for DCT info 10/2/99) http://bbs.galilei.com/libs/tools.htm (location for Huffman encoding info. 10/2/99) Previous Project Groups PDACS – Spring 1999 http://www.cs.tamu.edu/course-info/cpsc483/common/99a/g5/g5.html Sign Language Acquisition and Recognition System – Summer 1999 http://www.cs.tamu.edu/course-info/cpsc483/common/99b/g3/Final.htm 40 Appendices Index Proposal Proposal presentation Midterm report Midterm report presentation Biweekly Report 1 Biweekly Report 2 Biweekly report 3 HW DATA SHEETS Dct.h Dct.c Compress.c Compress.h Huffman.h Huffman.c Fdct.v Ifdct.v Serial control code from PC Serial control on FPGA Write into wires and call the *.v modules code 41