EE175WS-00-11 EE175WS00-11 JPEG Decoder Design Team # 11 EE 175AB: Senior Design Project June 14, 2000 John P. Jones Technical Advisor: Frank Vahid Project Advisor: Barry S. Todd EE175WS-00-11 Executive Summary This design project consisted of designing and implementing a JPEG decoder system. JPEG is a commonly used digital image compression algorithm officially known as ISO Standard 10918-1. JPEG coding allows digital images to be stored in a compressed form that achieves anywhere from 12:1 to 100:1 depending on the acceptable loss in image quality of the compression. This JPEG decoder system is primarily intended for use in consumer electronics devices such as digital cameras. This use requires the design to have a number of features including low price, low power, compact design, and high speed. To meet these requirements the system was designed as a custom digital logic component described in the standard hardware description language VHDL (VHSIC Hardware Description Language). This type of design is a high-level description of the system that is then translated into a digital circuit. A number of challenges needed to be met to design and implement a JPEG decoder in hardware rather than in software running on a microprocessor. JPEG coding normally requires many floating-point calculations. Since these types of calculations are not efficiently implemented in custom hardware they were replaced by scaled fixed-point approximations. Also the JPEG decoding algorithm requires a substantial amount of memory. To reduce the memory requirements of the JPEG decoding only the core algorithm, which works on relatively small blocks of data, was implemented. To test and demonstrate the design a Field Programmable Gate Array (FPGA) prototype board was purchased. Unfortunately the project cannot currently be downloaded to the prototype board due to time constraints. The JPEG decoder system does work in Simulation however and the results of that simulation will be presented. 1 EE175WS-00-11 Acknowledgements I would like to thank the following individuals for their assistance throughout the course of this project. Their help has greatly improved the project, my understanding of the concepts and the design process in general. Dr. Frank Vahid Dr. Vahid has been very helpful in providing the resources necessary for this project. Dr. Vahid was also very helpful in helping me to decide upon this project and providing me with ideas on where to find information helpful to the project. Barry Todd Mr. Todd has provided a lot of guidance on the design process and project management. Tony Givargis Tony Givargis has helped considerably by providing me with a book on Graphics File Formats and a rough Inverse Discrete Cosine Transform unit, which I was able to modify and incorporate into the design. Also Tony’s PC side serial interface library for Windows was used to communicate with the XESS board. Jeremy Thorpe Jeremy Thorpe was helpful in helping to configure and test the board and to help develop the serial communications devices on the board. In the early parts of testing Jeremy and I were able to work together to solve our common problems communicating with the development board. 2 EE175WS-00-11 Keywords / Terminology Following is a list of some important terminology used in this report and a brief description of its usage. JPEG (Joint Photographic Experts Group) The Joint Photographic Experts Group is a standardization body for the development of continuous tone computer image algorithms. ISO 10918-1 ISO 10918-1 is the formal name for the basic image compression algorithm developed by the Joint Photographic Experts Group and is commonly called JPEG. This is the algorithm that is discussed and decoded in this project. VHDL (VHSIC (Very High Speed Integrated Circuit) Hardware Description Language) VHDL is an IEEE standardized language for describing the function and behavior of a digital logic device. VLSI (Very Large Scale Integration) Very Large Scale Integration is a description of the process of designing and implementing digital systems using CMOS Integrated Circuit technology. SOC (System On a Chip) Today as minimum chip feature sizes decrease the effective area on a single IC is growing rapidly. Many designers are working to design entire systems on a single chip from components the same way that ICs are commonly interconnected on a printed circuit board today. Rapid Prototyping Rapid Prototyping is the effort to increase turn-around time for design testing by initially testing designs on a programmable logic device before they are sent to a fabrication plant for prototyping. FPGA (Field Programmable Gate Array) A Field Programmable Gate Array is a type of re-configurable logic device that uses an array of logic blocks that can be programmed and interconnected to one another to implement both combinational and sequential logic. CPLD (Complex Programmable Logic Device) A Complex Programmable Logic Device is another type of re-configurable logic device that uses a number of PLA (Programmable Logic Array) type devices interconnected on a single chip. 3 EE175WS-00-11 XILINX XILINX is a company that builds and sells FPGAs, CPLDs and a number of software packages that allow these chips to be programmed from a number of different sources including VHDL code. XESS (X Engineering Software Systems) XESS Corporation is a manufacturer of prototype boards with Xilinx FPGAs. The prototype board used for testing in this project was manufactured by XESS. DCT (Discrete Cosine Transform) The primary principle of JPEG compression of an image is based on a discrete frequency transformation called the Discrete Cosine Transform. This transformation is related to the standard Discrete Fourier Transform but has specific properties that make it applicable to image processing. Huffman Coding Huffman Coding is an algorithm to minimize the length of messages by using a short code word to encode highly probable symbols such as the letter ‘e’ and longer code words for less probable symbols such as the letter ‘z’. JFIF (JPEG File Interchange Format) The JPEG File Interchange Format is a commonly used file format for storing JPEG encoded streams of data for storage and communication. JFIF files are commonly named with a .JPG file extension. BMP (Bitmap) A Bitmap is a device independent format for describing a graphics image as a simple array of pixel values. This method is commonly used for displays and image processing algorithms since it provides a simple Cartesian representation of the data. 4 EE175WS-00-11 Table Of Contents EXECUTIVE SUMMARY ......................................................................................................................................... 1 ACKNOWLEDGEMENTS ........................................................................................................................................ 2 KEYWORDS / TERMINOLOGY............................................................................................................................. 3 TABLE OF CONTENTS............................................................................................................................................. 5 INTRODUCTION ........................................................................................................................................................ 7 PROBLEM STATEMENT ......................................................................................................................................... 8 SPECIFICATION............................................................................................................................................................. 8 General Description ............................................................................................................................................... 8 Performance Requirements.................................................................................................................................... 8 SOLUTION .................................................................................................................................................................. 10 ALTERNATE SOLUTIONS ANALYSIS .......................................................................................................................... 10 Software Implementation...................................................................................................................................... 10 Hardware Implementation ................................................................................................................................... 10 Solutions Analysis Table....................................................................................................................................... 11 ENGINEERING ANALYSIS........................................................................................................................................... 12 DESIGN OVERVIEW.................................................................................................................................................... 12 HUFFMAN DECODER ................................................................................................................................................. 14 RUN-LENGTH DECODER............................................................................................................................................ 14 QUANTIZATION DECODER......................................................................................................................................... 15 INVERSE DISCRETE COSINE TRANSFORMATION....................................................................................................... 16 XESS DEVELOPMENT BOARD INTERFACING ........................................................................................................... 17 TESTING PROCEDURE.......................................................................................................................................... 18 Simulation ............................................................................................................................................................. 18 Synthesis & Hardware Testing ............................................................................................................................ 18 RESULTS ..................................................................................................................................................................... 19 BUDGET / RESOURCES ............................................................................................................................................... 19 COMPARISON TO SPECIFICATIONS ............................................................................................................................ 19 Precision ............................................................................................................................................................... 20 Chip Area .............................................................................................................................................................. 20 Speed ..................................................................................................................................................................... 20 Power .................................................................................................................................................................... 20 CONCLUSIONS AND RECOMMENDATIONS................................................................................................. 21 WHAT WAS LEARNED ............................................................................................................................................... 21 WHAT WENT WRONG ............................................................................................................................................... 21 FUTURE WORK .......................................................................................................................................................... 21 REFERENCE DOCUMENTS ................................................................................................................................. 23 APPENDICES ............................................................................................................................................................. 24 FIXED-POINT ARITHMETIC ........................................................................................................................................ 24 5 EE175WS-00-11 SCHEMATICS .............................................................................................................................................................. 25 JPEG Decoder Unit.............................................................................................................................................. 25 Huffman Decoder / Run-Length Decoder Unit ................................................................................................... 25 Quantization Decoder Unit .................................................................................................................................. 26 Inverse Discrete Cosine Transformation Unit .................................................................................................... 26 VHDL SOURCE CODE ............................................................................................................................................... 28 JPEG Library........................................................................................................................................................ 28 JPEG Decoder Unit.............................................................................................................................................. 30 Huffman / Run-Length Decoder Unit .................................................................................................................. 36 Quantization Decoder Unit .................................................................................................................................. 41 Inverse Discrete Cosine Transform Unit ............................................................................................................. 43 Serial Input Controller ......................................................................................................................................... 52 Serial Output Controller....................................................................................................................................... 54 Memory Input Controller ..................................................................................................................................... 56 MATLAB & C++ CODE .............................................................................................................................................. 60 Data Create & Test Matlab Script....................................................................................................................... 60 Huffman Coding in C ........................................................................................................................................... 63 DCT Test Matlab Code ........................................................................................................................................ 73 Computation of DCT Coefficient Matrix ............................................................................................................. 74 Quantization Testing in Matlab ........................................................................................................................... 75 Image DCT, Quantization, De-Quantization, IDCT Testing in Matlab ............................................................ 76 XESS XSV BOARD V1.0 MANUAL .......................................................................................................................... 78 6 EE175WS-00-11 Introduction Since modern computer systems are required to store and transmit vast amounts of data the field of data compression has become very important. One form of data that is commonly processed by computer systems is graphic images. To compress graphic images the Joint Photographic Experts Group (JPEG) developed a method of compressing images by reducing the precision of the high-frequency portions of images. This allows the images to be stored more compactly without sacrificing the important lowfrequency portions. This is done by first dividing the image into an array of 8 pixels by 8 pixels data blocks and performing a transformation on these data blocks that expresses each data block by a linear combination of sinusoidal components of harmonic frequencies. Then the magnitudes of the components corresponding to the higher frequency harmonics are stored with less precision then the lower frequencies. This filtering loses some of the detail of the image but retains most of the image’s information since the human eye acts as an integrator, which reduces the contribution of high detail portions of our visual field. After being filtered the data is coded so that large values will be stored with larger numbers of bits then smaller values. This process allows a variable length coding of the data for compression. Finally the data is Huffman coded so that more frequent data values are stored as shorter codes This algorithm for image compression is formally known as ISO10918-1 but is commonly referred to as JPEG after the standardization body that developed it. JPEG is frequently used both on the Internet and in consumer electronics devices such as digital cameras. To decode JPEG images into uncompressed data commonly stored as Bitmaps, which are a device independent representation of the array of pixels that make up an image a device called a JPEG decoder, is needed to restore the image. This device performs the inverse of the JPEG encoder, which encodes bitmap images as JPEG streams. Usually JPEG encoders and decoders are written as programs in a high-level language such as C or C++ and run on general-purpose microprocessors. The purpose of this project is to design and implement a JPEG decoding system that can be incorporated into a digital camera design. This application requires the JPEG decoder to be simple, fast, low power, and easily integrated into a larger system. Such systems have been built before and are commonly used in consumer electronics devices such as digital cameras. Also JPEG decoder designs such as the one built for this project are available for purchase and can be incorporated into larger designs. 7 EE175WS-00-11 Problem Statement This project involved the design of a JPEG decoder. JPEG, the Joint Photographic Expert Group, is a standardization body that produces standards for continuous tone image coding. Perhaps the best known such standard is IS10918-1 which is a widely used image compression standard. The JPEG decoder designed in this project will be used to decode a JPEG File Interchange Format (JFIF) file into an uncompressed bitmap file. JFIF is the file format that is commonly associated with JPEG and is used widely on the Internet and in consumer electronics devices to store still image data. While the JPEG standard (ISO 10918-1) defines a large class of related compression algorithms the JPEG decoder designed for this project will focus on the simplest and most widely used such algorithm known as baseline JPEG. The wide use of JPEG in consumer electronics devices such as digital cameras produces a need for a fast, low-power implementation that is capable of meeting the demands of the overall system. As with any digital system the JPEG decoder could be implemented either in software running on a general purpose microprocessor, or more likely a special purpose microprocessor such as a Digital Signal Processor, DSP, or with custom hardware circuitry. The advantages and disadvantages of both software and hardware implementations will be discussed shortly. While this project will produce only the JPEG decoder much of the design would be reusable in the design of a JPEG encoder. This project will demonstrate the JPEG decoder using a Field Programmable Gate Array (FPGA). The FPGA will be programmed with the JPEG decoder design and will receive input JPEG images from a serial communication link with a computer system and send the decoded output images back to the computer for viewing. Specification General Description This project requires that a system for decoding JPEG images into a standard bitmap image representation be implemented. This system must adhere to the baseline JPEG standard described by ISO 10918-1. Performance Requirements Precision Keep average per pixel error to within 3% of a standard floating-point implementation of JPEG decoding. Chip Area Maintain a reasonable area for the implementation of the JPEG decoder. The target FPGA has a capacity of about 300 thousand gates. Since routing of components produces a less then optimal usage of an FPGA it is desired to keep the gate count at about 140 thousand gates. This target gate count will be useful in ensuring that the entire design will be able to fit onto the FPGA board. 8 EE175WS-00-11 Speed It is desirable to maximize the JPEG decoder’s speed. While again the speed of the circuit is highly dependent on implementation technology the JPEG decoder must be able to perform at speeds between the speed of a software implementation of JPEG decoding and the speed of a fully optimized JPEG decoder design that is available commercially. While speed is a crucial design point in a production design it will not be emphasized in this prototype using an FPGA while the design should be suitable for optimization towards a specific usage. Power Lastly the power consumption of the JPEG decoder must be within acceptable limits. However since the power consumed is determined by the FPGA used not the design itself only simulation data will be available to measure the predicted actual power consumption of the JPEG decoder when implemented using as an ASIC (Application Specific Integrated Circuit). 9 EE175WS-00-11 Solution Alternate Solutions Analysis As previously mentioned a digital system can be implemented either in software running on a microprocessor or with a custom designed digital logic circuit. These are the major realms of digital system design; each of these solutions has a wide variety of design decisions associated with them. Software Implementation Implementation of the JPEG decoding algorithm in software is very common. There are numerous open-source software implementations of JPEG in languages such as C and C++. The existence of this software and the easy accessibility to C compilers for most microprocessor designs simplifies the software design to the point where only moderate coding would be required to modify one of these implementations for a specific use. Since microprocessors are relatively affordable at low volumes such an implementation would be essential for a small volume product. In some applications that use a microprocessor it would be reasonable to bear the extra load of JPEG decoding on the microprocessor but in many situations the microprocessor is a valued resource that would better be utilized performing other calculations. Hardware Implementation The repetitive, well defined nature of the process of JPEG decoding lends itself very well to a hardware implementation where the ease of design and implementation are traded off for a faster, less power consuming solution which allows greater computational flexibility at the cost of design effort. In addition to these conventional arguments for a hardware implementation the constantly expanding area of chips produced by the continual progress of Moore’s Law, which states that chip capacity will double every 18 months, provides another reason to consider a hardware implementation of JPEG decoding. The increased chip capacity has allowed, in recent years, the combination of a generalpurpose microprocessor with custom logic units on a single chip. By placing these external units on the same piece of silicon as the microprocessor the costs of communications are greatly reduced. As the capacity of integrated circuits continues to increase such System-On-a-Chip designs will continue to grow in popularity. VHDL – VHSIC Hardware Description Language VHSIC (Very High Speed Integrated Circuit) Hardware Description Language (VHDL) is a powerful language used for the description of digital circuits. VHDL allows the mixture of both high-level behavioral descriptions and low-level structural descriptions to be connected and used together. Using these multiple levels of abstraction together allows the design process to focus on testing functionality and then optimizing the critical areas of the design by specifying them at a more detailed level where the designer can optimize the circuit as needed to meet specifications. 10 EE175WS-00-11 Verilog Verilog is another standardized hardware description language. In contrast to VHDL, Verilog is more commonly used in the United States since it is more popular in industry. Schematic Capture Before hardware description languages such as VHDL and Verilog became popular the standard industry practice was to use CAD programs to draw board and chip layouts from standard components and simple logic gates. This method is similar to structural architecture description in hardware description languages. Schematic capture is basically a graphical way of connecting standard and custom discrete components together to develop a digital system. While sophisticated automated routing tools are available in packages such as Protel the placement of the components must be done by hand in most instances, which increases the design complexity of the process. Modern HDL synthesis tools take advantage of regular design structures such as Field Programmable Gate Arrays (FPGAs) to simplify the task of placement and routing of logic for a design. Magic Layout Design Editor Magic is a layout design editor for CMOS technology where actual transistors are created from the varying layers of silicon of varying impurity levels and insulation layers. These transistors are then routed by metal layers into the actual physical structure of a microchip. The output of such a tool would then be extracted to a simulation tool such as PSpice and accurate, albeit intolerably slow, simulations could be run on the design. Finally the masks defined by the Magic generated layout would be sent to manufacturing where the actual manufacturing masks would be generated and the chip could be produced. Solutions Analysis Table Design Metrics Software Vs. Solutions VHDL – Behavioral Synthesis VHDL – Structural Description Schematic Capture / Magic Design Ease / Design Cost Winner Good Acceptable Unacceptable Speed Worst Case Acceptable Good Good Power Worst Case Good Good Winner Chip Area Efficiency. Worst Case Acceptable Good Winner Accuracy Winner Good Good Good Unit Cost (@ High Volume) Poor Good Good Good 11 EE175WS-00-11 This table clearly shows that while the low level solutions towards the right side have better performance attributes the high level solutions towards the left side have better practicality attributes. The behavioral or algorithmic VHDL description method provides a strong compromise between software design, in which JPEG is usually implemented, and higher performance hardware solutions. Engineering Analysis The relationships between layout tools, schematic capture tools, and hardware description languages are very closely analogous to the corresponding relationships in software between machine languages, assembly languages, and programming languages. Just as the current focus in software design is on reusable, machine independent algorithmic descriptions the use of a technology independent description language such as VHDL or Verilog are strongly preferred. The effort expended on designing HDL descriptions of digital circuits can be reused and optimized as logic synthesis tools become more powerful. This potential for improving designs through the advancement of synthesis tools and implementation technologies makes the design of large libraries of digital designs to be designed and reused as software libraries are today. This concept of a design, described in an HDL, has been termed Intellectual Property or IP which conveys the great potential importance of reusable designs. For all these reasons digital system design in a High-Level Description Language is becoming the preferred method for design of hardware and the relative ease of this design in an HDL is comparable to software implementation in a High-Level Programming Language such as C and C++. The proposed design for the JPEG decoder will allow the pipeline structure of the JPEG decoding operation to be performed in a parallel manner to enhance the operation’s concurrency thus increasing speed. Due to this explicit parallelism of the design a hardware implementation of JPEG decoding has a great potential for being faster then software implementations. Design Overview The JPEG decoder device was designed and implemented in VHDL at a behavioral description level of abstraction to be synthesized to logic gates. The design of a JPEG decoder in VHDL will provide a robust, hardware technology independent description. The decoder could then be downloaded to a Field Programmable Gate Array for testing and verification. Figure 1 is a general block diagram representing the JPEG encoding process. A good understanding of the encoding process will help illuminate some of the design options of the decoding process while describing the fundamental problem at hand in greater detail. 12 EE175WS-00-11 Figure 1 [1] As shown in Figure 1 the JPEG encoding process is performed on blocks of an image that are 8 pixel wide by 8 pixels high. Each of these JPEG data blocks is encoded in a sequence of three operations. First the image block is transformed using a 2-dimensional Forward Discrete Cosine Transform (FDCT) to determine the spectral components of the image. After the FDCT is performed the upper left corner of the coefficient matrix contains the DC component of the block and the lower right corner contains the highest frequency components of the image. Since the human eye does not readily perceive high frequency changes the high frequency components can be stored with less precision then the more important low frequency components. This low pass filtering of the image is performed by the next stage, which quantizes the data in exactly this manner. The Quantization table used in this step thus determines the exact filter characteristics and thus the compression ratio and quality of the encoded JPEG image. Finally the Coding stage transforms the 8x8 quantized block into a linear stream of values and then assigns the more frequently occurring values to shorter binary codes and less frequently occurring values to longer binary codes to minimize the length of the encoded message. The Coding table used in this step determines the compression ratio since the table must accurately match the relative frequencies of the input values to achieve good compression. The JPEG decoding process is an inverse transformation where the encoded data is first decoded and restored to 8 pixel by 8 pixel data blocks using a preliminary Decoding stage. This stage includes both Huffman Decoding and Run-Length decoding two distinct coding schemes. Next the Quantization table specification is used to approximately regain the spectral components of the image block, while low frequency components may be fully restored the high frequency components may be severely distorted however this distortion is barely perceptible. Finally the Inverse Discrete Cosine Transform approximately recovers the original 8x8 data block. Figure 2 is a detailed block diagram showing this process, which is implemented by the JPEG decoder designed for this project. Figure 2 [1] 13 EE175WS-00-11 Huffman Decoder The first stage of the JPEG decoding process is the decoding of data values using a Huffman coding. The Huffman decoder was designed to read in a Huffman Table that was extracted from the JPEG data file and use that data to determine the decoding of the input. Since the Huffman encoded data is of variable length the decoder must make decisions one bit at a time. To do this the decoder reads data 1 bit at a time from the input using a separate process to handle buffering the input data and delivering it as needed to the decoder. The process of decoding is exactly like walking down a binary tree. At each step from the root of the Huffman tree the decoder makes a single decision based on the next bit of input until it reaches the end of the path. At this point the decoder is able to decide from the Huffman Table what the appropriate decoded value is. The following figure shows a Huffman Tree and the corresponding codes for a 2-bit message where ‘00’ is very common and ‘10’ and ‘11’ are very rare. 0 Root 1 00 Value 00 01 10 11 Code 0 10 110 111 0 1 01 0 10 1 11 Figure 3 Run-Length Decoder The decoded 8-bit word from the Huffman Decoder represents a 4-bit run-length followed by a 4-bit data-length. The 4-bit run-length is a count of the number of zero data values occurred between the last non-zero data value and the current one. The 4-bit data-length is the number of bits following this 8-bit word that make up the actual non-zero data point. A data-length of 0 signifies either the end of a data block or if the run-length is 15 then the event of 16 consecutive zero data values. Since both the Huffman Decoder and the RunLength decoder have to read bit by bit from the input to decode the data I decided to merge these two distinct operations into a single VHDL entity that performs both of these operations. The data values are then read from the input and decoded according to the following rules. If the high order bit of the data value is 0 then it corresponds to a negative number and should be sign extended with ones since we are using a signed 2’s complement numbering system. If the high order bit of the value is 1 then it corresponds to a positive value and should be sign extended with zeros. Then 1 is added to the negative values to make their codes the right value. This creates a gap between the least 14 EE175WS-00-11 negative number and the least positive number exactly large enough to hold the values that can be represented by less bits and would thus have another sign bit. Table 4 below is a table for the 3-bit long data values and their decodings. Value to Be Coded Conversion If Negative 8-bit 2’s Complement Encoded Value -7 -8 11111000 000 -6 -7 11111001 001 -5 -6 11111010 010 -4 -5 11111011 011 4 00000100 100 5 00000101 101 6 00000110 110 7 00000111 111 Table 4 [1] Quantization Decoder The Quantization Decoder requests data values from its input. It multiplies these data values by the corresponding value in the Quantization table and then places them in the appropriate location in the 8x8 JPEG data block. During JPEG encoding the frequency components of the data block are ordered so that the low frequency components are at the beginning and higher frequency components follow. To do this the frequency matrix is ordered in a zig-zag fashion as described in the following diagram. 15 EE175WS-00-11 Figure 5 [1] This data block is then passed on to the Inverse Discrete Cosine Transform unit. Since the Quantization Decoder is in the middle of the JPEG decoder pipeline and is relatively simple I decided to make it the master device of the JPEG decoder. It requests the Huffman Decoder to give it data and with that data it assembles a data block and requests the Inverse Discrete Cosine Transform unit to decode it. This allows almost all of the operations of the Quantization Unit to be done while the Huffman decoder, which takes a long time since it has to make decisions at every bit, is running. This increased parallelism is one of the major advantages to a hardware-based design. Inverse Discrete Cosine Transformation The Inverse Discrete Cosine Transform unit is definitely the most complex unit in the JPEG decoder. The IDCT requires many multiplications and additions of irrational values and is computationally intensive. Since a floating point ALU is very difficult to design, very large, and very slow floating point arithmetic is generally never done in custom hardware designs except for the data-path of a microprocessor where it can be properly shared among many different uses. For this design I quickly realized that I would have to work around this problem. I chose to implement the IDCT using only scaled fixed-point arithmetic. After extensive Matlab testing I decided that a 16-bit whole part followed by an 8-bit fractional extension would be used. Thus the input was extended from 16-bits to 2416 EE175WS-00-11 bits and then the calculations could be performed. After computation the output is rounded back to whole numbers and reduced to the final 8-bit output. Here are the equations for the 2-Dimensional 8x8 Discrete Cosine Transform and its Inverse Transform… 7 7 1 2 x 1 u 2 y 1 v FDCT : Su ,v u, v Cu Cv s x , y cos cos 4 16 16 x 0 y 0 7 7 1 2 x 1 u 2 y 1 v IDCT : s x , y x, y Cu Cv Su ,v cos cos 4 u 0 v 0 16 16 1 n0 Cn 2 1 n 0 These equations can be rewritten in the form of linear transformations using matrices as follows… S D D D S D s D T D DT S D D T s DT T D s DT Here D is a constant 8x8 Matrix formed by the cosine values and constants above. This Form shows that D is orthogonal since its transpose is also its inverse. The JPEG decoder used this linear transformation equation to compute the IDCT since computing the product of two matrices is relatively straightforward. XESS Development Board Interfacing The XESS Board that I decided to use for this project has turned out to be a very useful and versatile development board. The documentation and developing tools provided by XESS have been very helpful and have enabled the basic communications devices required to communicate with the board to be developed. To communicate with the PC Jeremy and I decided to use the onboard Serial communications port and to implement VHDL entities to work on the board to communicate with the PC via the serial port. To do this we had to reprogram the CPLD (Complex Programmable Logic Device) on the board to route the serial pins to the FPGA. These serial lines were not originally configured to connect to the FPGA and we needed these pins for serial communication. Once the serial pins were routed to the FPGA we designed two very simple VHDL entities to control the receiving and transmitting of data. The Serial Input Controller listens to the serial receive line and reads off bytes of data as they arrive and presents this data as an 8-bit output value. The Serial Output Controller waits for a signal to send an 8-bit input value to the serial communications link and when it receives this signal it transmits it over the serial port to the PC. The design of these entities is based on a simple Finite State Machine to read and write the data one bit at a time. 17 EE175WS-00-11 Testing Procedure Since the prototype fabrication process is prohibitively expensive for this project the actual JPEG decoder chip could not be built and tested instead testing of the JPEG decoder design proceeded in two phases, software simulation testing and hardware testing on a Field Programmable Gate Array (FPGA). This process of software simulation of VHDL code followed by FPGA testing is commonly referred to as Rapid Prototyping since these tools allow the design to iterate much quicker then under a conventional development cycle. Simulation To simulate the JPEG decoder the program Active HDL was used. This program allows the VHDL code to be written and simulated in a single development environment and offers an advanced logic analyzer type waveform display for observing how signals evolve through simulation time. This allows different signals to be viewed and analyzed for errors easily. The procedure basically consists of starting an empty design project and adding to the design all the VHDL source code. The code can then be compiled and the simulation will begin. The first time the software asks which entity is the top level entity to simulate and this can be changed as needed later. Then the simulation will display the available VHDL entities, their input and output ports, and their internal signals for observation. When the desired signals have been added to the waveform viewer the simulation begins by specifying how long to run. This will run the simulation for the specified interval and display the waveforms during that interval. The simulation of the JPEG decoder has been successful and has yielded good results. The simulation helped considerably during the design process, as I was able to pinpoint what signals were behaving incorrectly and where in the code this problem arose. Synthesis & Hardware Testing The plan was to have the synthesis of the JPEG decoder completed and have the device downloaded to the FPGA on the XESS board and running so that JPEG data blocks could be sent to the board for decoding and the decoded data could be returned to the PC and assembled into a viewable image. While many of the pieces for such a setup are in place the project ran out of time and this testing could not be performed. Preliminary work on designing the serial communications units has been completed and additional work has been done to design a memory interface controller. Specifically the memory interface controller can successfully read and write the on-board memory but some timing issues have not been resolved when the device is both reading and writing. I have been making use of a logical analyzer to discover errors and fix them and that has proven to be a very important tool. The logical analyzer is easily connected to the expansion port pins on the XESS development board and signals from the FPGA can be easily viewed. 18 EE175WS-00-11 Results The implementation of the JPEG decoder was not completed. All of the pieces for the JPEG decoder have been written and simulated and I have determined that they do work. Then these pieces were assembled and a full simulation of the JPEG decoder was made for a single block of a JPEG image. The second part of the project, which involves the synthesis of the JPEG decoder and testing on the XESS development board, has not been completed. Budget / Resources Since this project consisted of the design and testing of a JPEG decoder in simulation followed by testing of the design on a re-programmable logic device the resources required for this project included a great deal of usage of expensive software systems and prototyping devices but no expenditure of resources will be involved. The required resources for this project included the VHDL tools necessary to write, compile, test, and synthesize a VHDL description and the FPGA prototype hardware are very expensive and the Technical Advisor, Frank Vahid, generously provided access to these resources. Dr. Vahid supplied the resources necessary to purchase the XESS XSV-300 Virtex Prototyping Board, which I selected for use on this project. This prototype board has all the features such as a gate capacity of approximately 300K gates and an easy interface with the software packages being used. The total cost of this board was $899.00. This board also has many other features that will be used by Dr. Vahid and his research group as well as possible future Senior Design Projects under Dr. Vahid. This board features a Xilinx FPGA of the Virtex class that is capable of implementing approximately a 300 thousand gate design with reasonable speed. Comparison to Specifications The specifications required that the JPEG decoder operate not just in simulation but actually function on the board. Also the glue logic to repeat the JPEG decoding process for every block of an image and to interface with a standard JPEG file format (JFIF) was not completed. I will discuss these two issues separately First there was not enough time for me to get the JPEG decoder to work on the board. This was due to the amount of time needed to get the JPEG decoder operational in a simulation environment. There were no major stumbling blocks except for time constraints. I believe that if I had been able to schedule the project with more time devoted to slowly integrating the working parts of the design on the board the project would be able to operate on the board. Second the current JPEG decoder needs a substantial amount of control to deliver the appropriate data. Ideally a standard JFIF file could be delivered to the decoder and a decoded image bitmap would be produced. Unfortunately time did not permit me to continue on with adding this control logic. As it stands now the JPEG decoder design, if implemented in hardware would be able to significantly increase the speed of decoding JPEG images when connected to a microprocessor programmed to control it and do the necessary bookkeeping. This in itself is an important feature that demonstrates how a 19 EE175WS-00-11 system can be implemented as a hybrid hardware / software implementation to improve performance without sacrificing versatility. Precision The specification required that the average per pixel error be below 3% of the error rate introduced by a standard floating-point implementation. This design goal appears to have been met. Matlab experiments show that I my approximate method of computing the IDCT will introduce approximately 0.245% error into the image. The VHDL simulation results agree to this prediction within 0.06%. The actual error introduced through the simulation was found to be 0.227%, which is actually less then the predicted value. This data shows that the JPEG decoder is sufficiently accurate. Chip Area The specification required that the design be able to synthesize and fit on the FPGA on the development board, which has a capacity of about 300 thousand gates. Since this part of the project did not get completed this design goal has not been evaluated. I believe that the JPEG decoder is sufficiently simple to fit within this boundary but I have not been able to determine the actual required number of gates due to time restrictions. Speed From simulation it has been determined that the number of clock cycles required to complete one 8x8 block of data is about 3,700 clock cycles. Since I anticipate no problems implementing this design to run at 50MHz and possibly much faster at about 100MHz I have calculated that a 640 by 480 pixel image consisting of 4,800 data blocks will take about one third of a second. This would give a throughput of about 8 Mega bits per second. This speed is very acceptable and while it is slower then a standard PC decoding images it can keep up with less powerful microprocessors. Power As mentioned in the specifications this project being implemented on a FPGA is not a good measure of the power requirements of the design since the power used is dependent on the FPGA that it is running on. 20 EE175WS-00-11 Conclusions And Recommendations This project proved to be considerably more challenging and time-consuming then originally projected. While the overall project was not completed the decoder is basically complete and simulation shows that it does work. Also many of the necessary components for communication and memory storage on the prototype board were developed. I believe that future projects building upon this one would be able to use the work I have done to complete this project or a similar one. I believe that a two-member team could expand the results of this project and deliver a system capable of downloading JPEG images. The theoretical foundations of the JPEG algorithm such as the Discrete Cosine Transform and Huffman coding are good applications of the concepts learned in Digital Signal Processing and Digital Communications. What Was Learned The project showed that the design and implementation of complex systems in hardware is considerably different then in software. While the JPEG algorithm is relatively simple to implement in software it does not easily translate to a hardware implementation. Also a lot of valuable experience in project management was learned. It was also very instructive to see how the frequency transformations learned about throughout my studies can be applied to a seemingly simple problem such as compression. I have an increased awareness of the complexities of the JPEG algorithm. The chosen prototype board produced by XESS worked well and was well documented. I believe that the board is capable of being useful for a number of different design projects. What Went Wrong The failure of this project is due primarily due to time delays and the complexity of the project. Even though the project did not get finished I believe that I successfully completed a lot of work on the project and that my final progress of a proper simulation of the JPEG decoder is significant. The project turned out to be considerably more challenging then I originally anticipated and I believe that it would have been much better if I had only taken a minimal course load during the second quarter of the project so that I could focus on the project almost exclusively. Future Work First of all I believe that this project is too much for a single beginning engineer to handle alone. Further work on this progress should be attempted in a group where the team members are well trained in VHDL programming and experienced in the design of digital systems. Beyond finishing this project there are a number of related topics that would be interesting to explore… 21 EE175WS-00-11 Prototype Board – Since this project did not succeed in getting the JPEG decoder working on the prototype board future work should start with replicating the successful simulation results on the actual board. Huffman Table Extraction – The Huffman Table format used in the JPEG decoder is not the same as the information stored in a normal JPEG file. This could be changed by extracting the useful form of the Huffman Tree from the data stored in the file. I have done this with some simple C code, which could be used as a basis for doing it in VHDL. Color Space Transformation – The JPEG decoder does not currently use a specific color space and generally treats the data as an array of bytes. For image applications hardware usually uses the RGB color space but many JPEG streams use the YCbCr color space since the Cb and Cr coordinates can be stored more efficiently [1]. ‘Fast’ Cosine Transformation Realization – The Discrete Cosine Transformation is very similar to the Discrete Fourier Transform and the efficient algorithms for computing the Discrete Fourier Transform collectively known as the Fast Fourier Transform algorithms can be applied to improve the efficiency of the Discrete Cosine Transformation as well. While I used a simple algorithm based on linear transformations to implement the Inverse Discrete Cosine Transformation a Fast Cosine Transform implementation may be a better alternative. JPEG Encoder – A JPEG encoder could also be extended from this project. I would suggest that the encoder be designed to use a pre-determined Huffman code tree and one of a few selected Quantization terms. This would decrease the complexity of the design considerably. The corresponding decoder would also be simplified since the Codes would be known without having to extract them from the data stream, which I found to be difficult and inefficient. Also the compression would be better since the codes would not need to be explicitly stored in the file. I believe that this type of implementation is better for a digital camera since many features of the data set are well known and generality is not needed. ( A digital camera does not need to encode any size image for example but only one of a few different resolutions. ) JPEG 2000 – The Joint Photographic Experts Group has now submitted a draft for completely new image compression standard to replace JPEG. This new standard is expected to be approved later this year and quickly replace the current JPEG standard. This new algorithm uses more advanced Wavelet analysis methods to provide better compression, image quality and versatility. I believe that the design of a JPEG 2000 system perhaps using a Digital Signal Processor would make a good project. 22 EE175WS-00-11 Reference Documents The following is a list of the documents that I have referred to and found to be useful in the course of this project. [1] C. W. Brown, B. J. Shepherd, Graphics File Formats: Reference and Guide. Greenwich, CT: Manning Publications Co, 1995. [2] XSV Board v1.0 Manual, XESS Corporation, Apex, NC, 2000 [3] Hsu et al, VHDL Modeling for Digital Design Synthesis, Norwell, MA, Kluwer Academic Publishers, 1995 [4] Xentec, “JPEG_CODEC –X_JPEG Short Form Datasheet,” http://www.xentecinc.com/X-datasheets/x_jpeg_rev1.4.pdf (current June 11, 2000) [5] T. Tran, “Fast Multiplierless Approximation of the DCT”, Johns Hopkins University, ECE Department, http://thanglong.ece.jhu.edu/Tran/Pub/intDCT-SPL.pdf (current June 11, 200) [6] Collosseum Builders Inc, “Image Library Source Code Version 3”, http://www.collosseumbuilders.com/imageformats/compressedimageformats.html (current June 11, 2000) 23 EE175WS-00-11 Appendices Fixed-Point Arithmetic To efficiently compute he Inverse Discrete Cosine Transform in hardware a scaled fixedpoint arithmetic system was used to simulate real numbers. This numbering system consisted of 16-bits of a whole number followed by 8-bits of fractional part. This allows numbers from –32,768.9906375 to 32,767.9906375 to be stored in 24-bits as follows. A a15 a14 a13 a12 a11a10 a9 a8 a 7 a 6 a5 a 4 a3 a 2 a1 a 0 .a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 Where ai 0, 1 And… 15 23 i 8 i 0 A ai 2i 2 8 ai 8 2 i A 2 8 Where A is a normal 24-bit binary number. Addition works as expected for this system due to linearity. A B A B 2 8 However multiplication does introduces an extra term. A B A B 2 8 2 8 To correct for this extra term the product of two numbers must be shifted right an extra 8bits. In actuality this could be performed by editing the multiplication algorithm to shift right once extra per iteration (normal hand multiplication algorithm from grade-school.) but this is not feasible when working with a pre-defined multiplication algorithm. This can be described simply by noting that while 1.0 1.0 1.0 when shifted to allow fractions 10 10 100 , which gives 10.0 unless we include an additional shift to arrive at the expected answer of 1.0 . With this scaled-fixed point system and the appropriate correction for multiplication I was able to approximate the Inverse Discrete Cosine Transform using only standard integer arithmetic. 24 EE175WS-00-11 Schematics The following sections show the schematics of the different components of the JPEG decoder. They primarily depict the I/O characteristics of each device and their interconnection structure. The implementation of these devices is described by a VHDL process that is basically a high-level representation of a Finite State Machine with Datapath. This behavior is best analyzed by reading the VHDL source presented later. JPEG Decoder Unit JPEG Decoder Unit in_req inp(0:7) go value(0:7) in_rdy rdy rst clk Here the JPEG Decoder Unit takes a stream of input data in the form of 8-bit words and produces the JPEG decoded values from the data stream. The beginning of the input data stream will have both the Huffman Code Table and the Quantization Table appended to it. The decoder logic will extract these values and send them to the Huffman Decoder and Quantization Decoder which are serially connected along with the Inverse Discrete Cosine Transform Unit.. The output of the IDCT unit is then placed one entry at a time on the 8-bit output value. Huffman Decoder / Run-Length Decoder Unit Huffman Decoder in_req inp(0:7) value(15:0) in_rdy rst clk rst 25 go rdy code EE175WS-00-11 Here code is the Huffman Code Table extracted from the input stream. Input is an 8-bit word from the data stream. Value is the Huffman and Run-Length Decoded data value that has been extracted from the data stream. The other signals are control signals. Quantization Decoder Unit Quantization Decoder in_req in_val(15:0) outp in_rdy rst clk go rdy quan Here quan is the Quantization Table extracted from the input stream. In_val is the 16-bit word from the Huffman Decoder. Outp is the assembled 8x8 JPEG data block matrix. It consists of 64 entries each a 16-bit value that is the product of one of the last 64 in_vals from the Huffman Decoder and the corresponding term from the Quantization Table. These are the restored DCT coefficients and are sent to the IDCT unit to recover the actual data values. The other signals are control signals. Inverse Discrete Cosine Transformation Unit Inverse Discrete Cosine Transform X go O rdy rst clk Here the IDCT Unit computes the Inverse Discrete Cosine Transformation of the input and produces it on the output. To do this the IDCT Unit has two matrix multiplication sub-units and an output-rounding subunit, which are described next. First the input is extended to form the internal scaled fixed-point form of the input matrix then the two matrix 26 EE175WS-00-11 multiplications by the constant DCT matrix and its inverse are performed and then the output is rounded back to an 8-bit integer valued matrix. Matrix Multiplication Sub-Unit Matrix Multiplier inp1 outp inp2 go rdy rst clk Here the Matrix Multiplier will compute inp1 pre-multiplied to inp2 (inp1*inp2) and deliver the result on outp. Output Rounding Sub-Unit Output Rounder inpt go outp rdy rst clk Here the Output Rounder will take the input, which is a 8x8 matrix of internal scaled fixedpoint values, and round it to an integral 8x8 matrix and output it. 27 EE175WS-00-11 VHDL Source Code JPEG Library 28 EE175WS-00-11 29 EE175WS-00-11 JPEG Decoder Unit 30 EE175WS-00-11 31 EE175WS-00-11 32 EE175WS-00-11 33 EE175WS-00-11 34 EE175WS-00-11 35 EE175WS-00-11 Huffman / Run-Length Decoder Unit 36 EE175WS-00-11 37 EE175WS-00-11 38 EE175WS-00-11 39 EE175WS-00-11 40 EE175WS-00-11 Quantization Decoder Unit 41 EE175WS-00-11 42 EE175WS-00-11 Inverse Discrete Cosine Transform Unit 43 EE175WS-00-11 44 EE175WS-00-11 45 EE175WS-00-11 46 EE175WS-00-11 47 EE175WS-00-11 48 EE175WS-00-11 49 EE175WS-00-11 50 EE175WS-00-11 51 EE175WS-00-11 Serial Input Controller 52 EE175WS-00-11 53 EE175WS-00-11 Serial Output Controller 54 EE175WS-00-11 55 EE175WS-00-11 Memory Input Controller 56 EE175WS-00-11 57 EE175WS-00-11 58 EE175WS-00-11 59 EE175WS-00-11 Matlab & C++ Code Data Create & Test Matlab Script 60 EE175WS-00-11 61 EE175WS-00-11 62 EE175WS-00-11 Huffman Coding in C 63 EE175WS-00-11 64 EE175WS-00-11 65 EE175WS-00-11 66 EE175WS-00-11 67 EE175WS-00-11 68 EE175WS-00-11 69 EE175WS-00-11 70 EE175WS-00-11 71 EE175WS-00-11 72 EE175WS-00-11 DCT Test Matlab Code 73 EE175WS-00-11 Computation of DCT Coefficient Matrix 74 EE175WS-00-11 Quantization Testing in Matlab 75 EE175WS-00-11 Image DCT, Quantization, De-Quantization, IDCT Testing in Matlab 76 EE175WS-00-11 77 EE175WS-00-11 XESS XSV Board V1.0 Manual 78 EE175WS-00-11 137