Image Compression - CS Course Webpages

advertisement
Image Compression
with 2-D Discrete Cosine Transforms
Final Report – 12/09/99
John Hill
David Oltmanns
Delayne Vaughn
1
ACKNOWLEDGEMENTS ......................................................................................................................... 4
ABSTRACT .................................................................................................................................................. 5
CHAPTER 1: INTRODUCTION ............................................................................................................... 6
PROJECT OBJECTIVES ................................................................................................................................. 6
CHAPTER 2: BACKGROUND ON DCT .................................................................................................. 7
CHAPTER 3: ALGORITHM CHOICES .................................................................................................10
CHAPTER 4: ALGORITHM SOLUTIONS ............................................................................................11
SYSTEM DESCRIPTION ...............................................................................................................................12
DCT EXPLANATION ..................................................................................................................................13
QUANTIZATION FACTOR ............................................................................................................................13
HUFFMAN CODING EXPLANATION .............................................................................................................13
BREAKDOWN OF C CODE ...........................................................................................................................14
VERILOG CODE ..........................................................................................................................................15
CHAPTER 5: SYSTEM COMPONENTS ................................................................................................16
FPGA INTERNAL LAYOUT .........................................................................................................................16
DCT/IDCT BLOCKS ..................................................................................................................................17
DCT MODULE PIN DESCRIPTIONS .............................................................................................................17
MAIN CONTROL MODULE .........................................................................................................................18
MAIN CONTROL MODULE DESCRIPTIONS ..................................................................................................18
SERIAL MODULE .......................................................................................................................................19
BYTE-STREAM-TO-BUS CONTROL MODULE ..............................................................................................21
BIT STREAM TO BUS CONTROL MODULE PIN DESCRIPTIONS .....................................................................21
SERIAL HARDWARE ...................................................................................................................................22
HARDWARE DETAIL ..................................................................................................................................23
FINAL FPGA .............................................................................................................................................24
DEVELOPMENTAL COMPONENTS ...............................................................................................................25
CHAPTER 6: DEVELOPMENT TOOLS ................................................................................................27
HARDWARE ...............................................................................................................................................27
SOFTWARE ................................................................................................................................................27
CHAPTER 7: POSSIBLE IMPROVEMENTS ........................................................................................28
SERIAL ALTERNATIVES .............................................................................................................................28
MEMORY AND CAMERA CONSIDERATIONS ................................................................................................28
PC USER INTERFACE .................................................................................................................................28
CHAPTER 8: CHALLENGES & SOLUTIONS ......................................................................................29
SOFTWARE ALGORITHM ............................................................................................................................29
VERILOG ALGORITHM ...............................................................................................................................29
THE CAMERA HARDWARE .........................................................................................................................30
SERIAL CONSIDERATIONS ..........................................................................................................................30
FPGA(S) ...................................................................................................................................................31
CHAPTER 9: TIME LINE .........................................................................................................................32
2
CHAPTER 10:RESULTS AND DISCUSSIONS ......................................................................................33
SOFTWARE RESULTS .................................................................................................................................33
INDIVIDUAL CONTRIBUTIONS ............................................................................................................38
JOHN HILL .................................................................................................................................................38
DAVID OLTMANNS ....................................................................................................................................38
DELAYNE VAUGHN ...................................................................................................................................38
REFERENCES ............................................................................................................................................39
APPENDICES INDEX ................................................................................................................................41
3
Acknowledgements
Foremost we’d like to thank Dr. Rabi N. Mahapatra. Dr. Mahapatra helped us to
find an intriguing and extremely challenging project and has continuously aided
our development. He provided much of our original research but also served as a
great resource along the way. Also, our teaching assistant Nan Ni has helped us
by supplying parts and, in many cases, technical data sheets so that we could build
the hardware portions of our project. Finally, we’d like to mention Trey Griffin.
Trey shared his time and know-how to help us complete the serial interface for
our system. Many of the components and even test banks were borrowed from
him and modified to suit our needs. Without these people, we could not have
taken even the first step towards our goal.
4
Abstract
Our project is a compression/decompression system utilizing a FPGA interfaced
with a Connectix QuickCam and a PC. The system will allow an image to be
captured by the camera, compressed and decompressed using an FPGA for some
of the more time-consuming tasks. Finally, the resulting image will be displayed.
The FPGA will be outfitted with 2-D DCT logic and will also incorporate a serial
interface, camera interface, and memory control module. Our goal is for the
system to increase the speed at which an image can be compressed and
decompressed using these transforms. We expect that the FPGA will allow us to
do so, as it should be significantly faster with regards to digital signal processing.
5
Chapter 1: Introduction
Image compression has been of interest to the computing community for quite
some time. With the limited bandwidth of today's marketplace, the need to
compress data, especially images, is in demand. The two factors that find
themselves highest in priority are speed and compression ratio.
The aspect we have chosen to attempt to optimize is the speed. By porting the
computationally intensive portions of the image compression process to an FPGA,
significant speed gains should be possible. By choosing an efficient algorithm for
the computation of the Discrete Cosine Transform (DCT), even greater speed
gains should be attainable. The implications of this are myriad as the speed gains
could allow for the use of image compression for real-time applications. Our
original project objectives are listed below.
Project Objectives
Our design will use a Connectix QuickCam and PC interfaced to a Xilinx FPGA.
We hope to achieve the following goals:




Capture the image using QuickCam
Compress the image using FPGA-based 2-D Discrete Cosine
Transform (DCT) on a Xilinx FPGA
Decompress the image using Inverse DCT (IDCT)
Display decompressed image on a PC using a serial port as the interface.
In addition to these objectives, the project was first implemented entirely in C to
provide accurate simulations. Once satisfactory software simulation was achieved,
it was possible to begin the process of designing the hardware.
6
Chapter 2: Background on DCT
It has long been acknowledged that the Karhunen-Loev Transform (KLT) is
optimal for signal compression, but it is not feasible to implement. Introduced in
1974 by Ahmed and Rao, the Discrete Cosine Transform (DCT) presents a viable
alternative to KLT. Ahmed and Rao demonstrated the close approximation that
the DCT provides to the KLT. The two-dimensional DCT for a square N x N
matrix is defined in Figure 2.1.
Figure 2.1-DCT Definition
c(i,j) is given by c (0,j) = 1/N, c (i,0) = 1/N, and c (i,j) = 2/N for both i and
j  0. The input matrix is s, and t is the output matrix.
Like the Fourier transform, the DCT maps the signal to the frequency domain. In
fact, the DCT is often computed indirectly by first computing a Fast Fourier
Transform (FFT). The output matrix, t, represents the different frequency
components of the image. The upper left-hand corner of the DCT matrix provides
the coefficients for the low frequency components while the lower right-hand
corresponds to high frequencies.
Once the DCT has been computed, the next step in compression is quantization.
This is basically discarding less-important information. The human eye is much
less responsive to very high-frequency components, so a quantization matrix is
derived which rounds the entries in the DCT matrix giving more attention to the
lower frequencies while often virtually disregarding the higher frequencies.
Quantization is the part of the process that actually allows for compression. The
quantization matrix can be altered to create an acceptable balance between image
quality and compression ratio. Once quantization has occurred, the data are
encoded to a bit-stream in which form they are stored or transported.
7
Image decompression occurs in basically an opposite flow. First it is decoded,
then it is dequantized. Finally, the inverse DCT (IDCT) is used to reconstruct the
original signal. The IDCT is defined in Figure 2.2.
Figure 2.2-Inverse DCT
All variables are defined in Figure 2.1
Being very computationally intensive and thus somewhat cumbersome, the DCT
does not lend itself to real-time applications when implemented as software. To
combat this speed lag, we are proposing to implement the DCT in hardware.
Through the years, a variety of algorithms have been presented to quickly
compute the DCT. We have chosen to implement the recursive algorithm
presented by Cvetkovic and Popovic. The benefits of this algorithm are myriad.
First, it is spatially efficient, requiring less adders and multipliers than direct rowcolumn approaches. Second, it is a direct computation of the DCT, i.e. it does not
require the use of the FFT or some other transform. Finally, it provides excellent
speed gains, which, after all, is the reason for porting the transform to hardware in
the first place. A signal flow graph for this algorithm follows in figure 2.3, and
Figure 2.4 shows how we have extended it for the two-dimensional case.
X(0)
x(0)
C1/4
x(1)
C1/8
x(2)
C5/8
x(3)
X(2)
C1/4
X(6)
C1/16
x(4)
X(1)
C5/16
x(5)
x(6)
x(7)
X(4)
bitrev
C1/4
C9/16
C1/8
C13/16
C5/8
X(5)
X(3)
C1/4
X(7)
bitrev
scramble
fwd_butterflies
fwd_sums
Figure 2.3. Signal flow graph for 1D-DCT.
Circles indicate addition, squares subtraction, and arrows multiplication.
Ca/b=(2cosa/b)-1. Named sections indicate routines in C code.
8
COLUMNS
ROWS
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
x(9)
x(10)
x(11)
x(12)
x(13)
x(14)
x(15)
x(16)
x(17)
x(18)
x(19)
x(20)
x(21)
x(22)
x(23)
x(24)
x(25)
x(26)
x(27)
x(28)
x(29)
x(30)
x(31)
x(32)
x(33)
x(34)
x(35)
x(36)
x(37)
x(38)
x(39)
x(40)
x(41)
x(42)
x(43)
x(44)
x(45)
x(46)
x(47)
x(48)
x(49)
x(50)
x(51)
x(52)
x(53)
x(54)
x(55)
x(56)
x(57)
x(58)
x(59)
x(60)
x(61)
x(62)
x(63)
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1-D
FCT
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
X(0)
X(8)
X(16)
X(24)
X(32)
X(40)
X(48)
X(56)
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
X(1)
X(9)
X(17)
X(25)
X(33)
X(41)
X(49)
X(57)
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
X(2)
X(10)
X(18)
X(26)
X(34)
X(42)
X(50)
X(58)
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
X(3)
X(11)
X(19)
X(27)
X(35)
X(43)
X(51)
X(59)
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
X(4)
X(12)
X(20)
X(28)
X(36)
X(44)
X(52)
X(60)
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
X(5)
X(13)
X(21)
X(29)
X(37)
X(45)
X(53)
X(61)
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
X(6)
X(14)
X(22)
X(30)
X(38)
X(46)
X(54)
X(62)
1/4
1/4
1/4
1/4
1/4
1/4
1/4
1/4
X(7)
X(15)
X(23)
X(31)
X(39)
X(47)
X(55)
X(63)
Figure 2.4 – 2-D DCT Signal Flow
Arrows indicate multiplication.
9
Chapter 3: Algorithm Choices
In developing our project, it became necessary to choose algorithms for the actual
implementation of the various stages of the code. While some of these issues were
discussed in earlier sections, we will once again review those decisions along with
some other choices that have not previously been discussed.
First, it was necessary to choose an algorithm to compute the DCT. Many papers
have been written about this, and we read several of them. Because of its
simplicity and its spatial and temporal efficiency, we chose the algorithm
presented by Cvetkovic and Popovic for the one-dimensional DCT. From that, we
chose from a simplicity standpoint to compute the two-dimensional DCT using a
simple row-column computation of the one-dimensional DCT. This provided a
good combination of speed, simplicity, and size. The signal flow-graphs for these
algorithms are shown above in Figures 2.3 and 2.4.
A variety of encoding schemes is available which would have served well, but we
chose to use Huffman encoding. Several factors influenced this decision. First,
this is the encoding scheme most often used in the Joint Photographic Experts
Group (JPEG) image format—the industry standard for natural image
compression. It is also simpler than arithmetic encoding which is sometimes used
for JPEG compression, and the technology is not proprietary, as is the case with
each of several arithmetic encoding algorithms. These were the primary
considerations when choosing an encoding algorithm.
The remaining portions of the code did not require the development of a “formal”
algorithm.
10
Chapter 4: Algorithm Solutions
The purpose of this project is to present a method of compressing images with a
FPGA, combining 2-Dimensional Discrete Cosine Transforms (DCT) and
Huffman encoding. A camera will take an 8-bit grayscale image and send it to a
PC in which it is stored as a bitmap (BMP) file. In the computer, header
information is removed from the BMP file, and the remaining part of the BMP
file is the luminance information, one byte per pixel. This is broken up into 8byte by 8 blocks and sent to the FPGA for compression/ decompression. Since
this is the computationally intensive part, the FPGA should allow for quicker
processing. See Figure 4.1.
Personal
Computer
Serial
Hardware
Camera
FPGA
Figure 4.1-Data Flow
11
System Description
An enlarged image of the FPGA from Figure 4.1 is shown below in Figure 4.2.
The FPGA is broken up into the serial interface, control module, bus control,
DCT data bus, compression, and decompression algorithms. The serial interface
receives the bits from the computer byte by byte and put the bytes into the DCT
data bus until the bus control tells the control module it is full and ready for
compression/ decompression to take place. Depending on what signal is sent to
the FPGA the control module will call the correct compression/decompression
algorithm.
Serial
Interface
Control
Module
Bus
Control
DCT DATA BUS
Compression
Decompression
Figure
FPGA 4.2 - Components of FPGA
Within the compression components there are three parts: DCT, Quantizer, and
Encoder. Figure 4.3 illustrates this. The encoder/decoder is where the Huffman
encoding occurs. When the information leaves the encoder the information is in
it’s compressed form.
Figure 4.3 - Compression/Decompression Breakdown
12
DCT Explanation
The DCT is the most computationally intensive portion of the process. Figure
10.7 below shows that the DCT/IDCT takes about 75% of the processor cycles for
the entire project. The DCT is calculated for each row sequentially and then
repeated for the columns. This allows for a 2-dimensoinal DCT to be
implemented using just a 1-dimensional DCT. The two-dimensional DCT for a
square N x N matrix is defined in Figure 4.4.
Figure 4.4-DCT Definition
c(i,j) is given by c (0,j) = 1/N, c (i,0) = 1/N, and c (i,j) = 2/N for both i and j  0.
The input matrix is s, and t is the output matrix.
where c(i,j) is given by c (0,j) = 1/N, c (i,0) = 1/N, and c (i,j) = 2/N for both i and j
 0. The input matrix is s, and t is the output matrix. The output matrix provides
the values for the low frequencies in the upper left-hand corner while the lower
right-hand corner contains the high frequencies. The human eye is more
responsive to the low frequencies, but fortunately they tend to retain higher
values. This causes more of the important information to be retained during
quantization while the high frequencies often go to zero.
Quantization Factor
This is the part that actually allows for compression. The quantization factor can
be altered to create an acceptable balance between image quality and compression
ratio. The pixel information in the DCT matrix is divided by the quantization
factor thus producing a fewer number of values for the Huffman coding to deal
with.
Huffman Coding Explanation
The idea behind Huffman coding is simply to use shorter bit patterns for more
common numbers, and longer bit patterns for less common numbers. The first
step in Huffman encoding is to find the frequency of the numbers involved and
then to create an encoding tree with this information. In the encoding tree each
different value is a node in the tree and this is how the different bit stream values
are determined. The tree is used to compress the information and to send a
smaller bit stream to represent the original bit stream. This along with the
encoding tree is sent in the compressed file.
13
Breakdown of C code
We broke the software into three main programs: compress.c, DCT.c, and
huffman.c. Each file had it's own header file. The DCT.c contains the procedures
that handle the DCT. The huffman.c contains the parts to do the Huffman
encoding which actually does the compression. The compress.c file integrates the
other programs. The code in the dct.c and huffman.c files was found online and
referred to in our reference section. The modifications to this code were minimal.
In the compress.c file, first, size of the header and the height and width of the
image are obtained from the image file. Second, the pixel information is
extracted from the BMP file. Then, this information is divided into 8x8 blocks
and processed by the DCT. Next, the overall matrix is quantized. Finally, the
encoding routine is called.
At this point, the file is in its compressed form and we have achieved the goal of
our project. Now, the reverse process can begin. First, the file is decoded using
the huffman.c code again. Then pixel information is extracted and put into a
matrix so the dequantization can be done. Next the matrix is broken up into 8x8
blocks and the IDCT code within DCT.c is used. Now we must clean up the
values to ensure all values are between 0 and 255 inclusive. (This might not be the
case because uncertainty can be introduced by quantization.) Finally, the matrix
is reattached to the header and a new BMP file is created.
In the DCT.c file we have a bit reverse function, inverse/forward sums,
unscramble/scramble, inverse/forward butterflies, initialize cosine array, the
ifdct/fdct noscale function which calls the above functions, and a main that calls
the functions in the correct order. As previously mentioned the DCT code was
accessed from the web is capable of handling any size block divisible by two. We
should have updated the code and made it explicitly for 8 by 8 blocks. This will
be discussed in greater detail in a later section.
In the huffman.c file we have to first count the number of occurrences of
each byte in the data. Next we build the initial heap from the frequency count.
Now the code tree can be built and this is what is used to generate the
compression code. Now the image is compressed. The reversal of this process is
much easier since the code tree is passed along with the compressed code. Next
all we do is decompress the image using the code tree.
14
Verilog code
In the beginning we thought the translation from C to Verilog was going to be
straightforward and easy. This was NOT the case. Things as simple as add,
subtract, and multiply took over 2 pages of Verilog code to implement. The first
step was to take all 8-bit numbers and extend them to 24-bit numbers 12 bits to
the left of the decimal and 12 bits to the right. This allows for precision to
approximately 10-3. After the DCT was run some of the numbers might be
negative, so we had to assign a sign array to follow the matrix information. We
did this with the sign convention 1 = negative and 0 = positive. The add/subtract
tasks were long because they had to handle the signs of the values separately from
the actual values. Granted, this is just one big if-then-else statement, but it is an
example of something that was simple in C and not so simple in Verilog. To get
around the problem of initializing the cosine array we simply translated the cosine
values into the format above and saved them as constants in Verilog.
In the Verilog code we made one module for the DCT and one module for the
IDCT. The modules are basically the same so I will just discuss the DCT module
here. We created the DCT module and made tasks for all of the steps we had to
do inside. We made a task to extend the numbers to 24 bit numbers and to zero
out the sign array. Next we created tasks for add, subtract, and multiply. The
tasks for bit reverse, half-bit reverse, scramble, forward sums, and forward
butterflies are all represented in the signal flow graph below (figure 4.5). We had
to add a special task to round all of the values to whole integers and to handle the
quantization, we chose a quantization factor of 64 for the project.
X(0)
x(0)
C1/4
x(1)
C1/8
x(2)
C5/8
x(3)
X(2)
C1/4
X(6)
C1/16
x(4)
X(1)
C5/16
x(5)
x(6)
x(7)
X(4)
bitrev
C1/4
C9/16
C1/8
C13/16
C5/8
X(5)
X(3)
C1/4
X(7)
bitrev
scramble
fwd_butterflies
fwd_sums
Figure 4.5 - Signal flow graph for 1D-DCT.
Circles indicate addition, squares subtraction, and arrows multiplication.
Ca/b=(2cosa/b)-1. Named sections indicate routines in Verilog code.
15
Chapter 5: System Components
FPGA internal layout
At the heart of our system is our Field Programmable Gate Array (FPGA). Our
final design utilizes a Xilinx Virtex 400HQ240 chip, which has 166 User
Input/Output pins and a 40 x 60 CLB array. We chose this package based on our
CLB needs which we estimated from a similar transform algorithm. Xilinx also
recommended this particular FPGA series for DSP (digital signal processing).
Figure 5.1 shows the FPGA itself and how it is seated on our board. The pins on
the FPGA were closer together than we expected so we had to modify the board
to allow it to fit properly. Further below it is a more detailed description of the
core logical units that were designed for the Virtex Chip.
FIGURE 5.1
16
DCT/IDCT blocks
Figure 5.2 shows the DCT module as generated from our Verilog code. The DCT
module interfaces with both the control module and the Bit-Stream to Bus Control
Module. The DCT_Start and DCT_Stop signals are connected to the control
module. The DCT_Start signal is raised by the control module after it has been
alerted that all the data on the input bus ( FF[0:511] ) is ready and stable. Once
the DCT_Start signal is raised the DCT module begins compressing the input
data. Once the DCT module has completed its tasks, it assets the DCT_Stop
signal to alert the Control module that the output is ready for transfer back to the
PC via the serial Module/Hardware.
Figure 5.2
DCT Module Pin descriptions
INPUTS
GO
FF[0:511]
OUTPUTS
DCT_STOP
OUT_MAT[0:511]
F_SIGN[0:63]
DONE
See the appendix for the Verilog code that fully defines this and the IDCT macro.
Also, see chapter 4 for detail on the internals of DCT/IDCT.
17
Main Control Module
The main control module is a state machine implemented in Verilog. It uses
handshakes with the other modules to move the data safely and reliably through
the system.
Initially when the system is started and the control module resets the serial
module via the SERIALRSET signal. The control module then asserts the
SERIALGO signal to allow the serial module to begin receiving data from the PC.
When the MEMDONE and SERIALDONE each are asserted the control module
knows that the data has arrived successfully and is now on the input bus of the
DCT/IDCT. It then initiates the DCT compression via the DCTGO signal. The
DCT module will then return the DCTDONE signal when it has finished.
Continuing, the control module then re-asserts the SERIALGO signal and the data
begins transfer back to the PC by way of the bit stream to bus control module.
The bit stream to bus control module is the controlling mechanism determining
that data should be transferred and not received during this process.
Figure 5.3 shows a detail of this module and the wires that connect it to the other
modules in the FPGA.
Figure 5.3
Main Control Module descriptions
INPUTS
DCTDONE
MEMDONE
SERIALDONE
OUTPUTS
DCTGO
SERIALGO
SERIALSTOP
SERIALRSET
Also reference the appendix (imagecompress.zip) for full Verilog description of
this module.
18
Serial Module
The serial module is largely taken from the PDACS project done in the spring of
1999. We modified it to fit into our project. It acts in one of two ways. Either it
receives data from the PC and delivers it to the bit-stream-to-bus control module,
or it receives data from the bit-stream-to-bus control module and delivers it to the
PC. The bit-stream-to-bus control module itself controls the direction of data
transfer and the two modules together receive “begin” and “stop” instructions
from the main control module.
The serial module is unlike any other module because it interfaces between the
FPGA and the serial hardware. Let’s take a closer look at this process. (Note: the
pin descriptions below may aid in this explanation) Initially, the control module
resets the serial module. As mentioned, this action affects the hardware as well,
as the two are tightly coupled. When the SERIALSTART signal is given, the
serial module, and hardware work together to either receive data on the DIN[7:0]
bus (from the PC) and deliver it to the MEMDOUT[7:0] bus, or receive data on
the MEMDIN[7:0] bus and deliver it to the hardware (and eventually the PC) on
the DOUT[7:0] bus. The bit-stream-to-bus control module and ultimately the
main control module control the direction. The majority of the other signals are
for timing, preventing the buffers on either side of the module from being overrun
or otherwise corrupted. Figure 5.4 shows the Serial module schematic. Figure
5.5 is a detail of the I/O array that interfaces the DIN[7:0] and DOUT[7:0] busses
with the serial hardware.
Figure 5.4
19
Figure 5.5
Main Control Module Pin descriptions
INPUTS
CLK
DSRCLK
DSR
SERRSET
TXRDY
RXRDY
MEMEND
MEMBUSY
MEMDIN[7:0]
DIN[7:0]
OUTPUTS
MR
RD
WR
MEMRD
MEMWR
DOUT[7:0]
MEMDOUT[7:0]
SERIALSTART
SERIALSTOP
A[2:0]
See the appendix for our full Verilog description of this module.
20
Byte-Stream-to-Bus Control Module
This module is what allows the DCT and serial modules to interface smoothly.
The DCT module requires a parallel input. It needs to act on all 64 bytes of
information simultaneously. This is also how it naturally returns the compressed
information. IDCT behaves similarly but inversely.
The serial module, when receiving, supplies data to this module one byte at a time
(not a true bit stream). The bit-stream-to-bus control module essentially validates
that information and lines it up on the DCT bus. When all the data is there, it
signals the main control module to proceed. Without this module data would be
lost and never processed. On the return trip the same is true. When the data is
returned from the DCT/IDCT stage, the bit stream to bus control module breaks it
up and delivers it back to the serial module one value at a time.
Figure 5.6 below shows this module and the wires we use to connect it to the rest
of the components in the system.
Figure 5.6
Bit stream to Bus Control Module Pin descriptions
INPUTS
SerialDataIN[0:7]
DCTDataIN[0:511]
DCTSignIN[0:63]
Write
Read
OUTPUTS
SerialDataOUT[0:7]
DCTDataOUT[0:511]
Busy
Done
Also reference the appendix (imagecompress.zip) for full Verilog description of this module.
21
Serial Hardware
The serial hardware was built based on the explanations of the PDACS group
from the spring. We were able to fully implement our proposed design as seen in
Figure 5.7. In this diagram the two center blocks represent the hardware that was
physically built on our board. On the left and right end are the proposed block
representation of the Computer and FPGA, respectively.
Figure 5.7
22
Hardware Detail
Figure 5.8 is the full detail diagram that we used in constructing the circuit. The
PDACS group constructed this diagram. Their work was very well done and easy
to follow. We used a different crystal because there weren’t any matching the
original specifications. We then had to reconfigure the UART to match the new
9600 speed. This maybe a little slow, but with a different crystal it the rate could
be increased.
Figure 5.8
23
Final FPGA
Our final design utilizes a Xilinx Virtex 400HQ240 FPGA. These are the
standard specifications as listed on the Xilinx web site. For additional details see
the full data sheet in the appendix of this document. We chose this particular chip
because of cost, I/O pins, and CLB concerns. Xilinx also recommended it for use
in DSP applications.
XCV400HQ240
CLB Array
(Row x Col.)
Logic Cells
System Gates
Max. Block
RAM Bits
Max. Distributed
RAM Bits
40x60
10,800
468,252
81,920
153,600
Delay Locked
Loops (DLLs)
4
I/O Standards
Supported
17
Speed Grades
I/O Pins
4, 5, 6
166
Figure 5.9 (See Page 61 of Xilinx Data Sheet)
24
Developmental Components
There were also several hardware and software components that were developed
during the course of the project that were not part of our final design. These
things helped us to learn about, develop, and test our system.
The first of these components is the original camera control. This involved
hardware and FGPA programmed logic. It was our original goal to use this
methodology for controlling the camera and capturing images. We based the
design around the Basic Stamp 2 device as detailed in Chapter 22 of the online
tutorials. This was fairly simple and only took about two weeks to get up and
working. The FPGA could capture an image from the camera and then return it to
the PC by way of the Windows Hyperterminal. In this way we could test the
module by itself before we built it into our large systems. However, our camera
had faulty wiring and replacing it proved to make our prior work useless. The
new camera was incompatible with our hardware, and to make things worse the
manufacturer provided no datasheets. This portion had to be scrapped and
redesigned. Figure 5.10 shows the camera hardware. The above mentioned
Chapter 22 documentation is also included in the appendix.
Figure 5.10
25
During our development, we also utilized two other FPGAs that are worth
mentioning. We realized early that we would need to order a much larger FPGA
to complete our project. While we were waiting for the new FPGA to arrive we
continued working by using the 4003E demo-boards and also a free standing 4010
FPGA. The 4003E was utilized for testing the Camera interface. When it was
time to start the Serial Module, we had to upgrade to the 4010 because it required
more CLBs. Both chips, especially the stand alone, required significant amounts
of setup time, and it would have been more efficient if we would have had the
large FPGA from the start. Below are photos of these two FPGAs.
Figure 5.11
26
Chapter 6: Development Tools
Hardware

Basic Wire-wrapping Tools: (wire-wrapper, wire, various pin connectors,
snips, wire strippers, etc.) We used these simple tools to build our circuits on
the board.

Multimeter: This we used to test voltages, resistances, and various other
signals.

Logic Probe: The logic probe was most useful of any of the devices we used
to analyze the hardware. It was easy to test for the discrete values of different
wires.
Software

Microsoft Visual C++ 6.0/GNU C Compiler: These were the main
environments in which we developed the original software version. They were
helpful in debugging the software and getting it to a useful state.

Xilinx Foundation Series: This was used at length in the design of the serial
module and the various control modules. It also would have been the final
implementation tool for the whole project. While it was very helpful in design,
its simulation capabilities were lacking.

Microsoft Visual Basic: Modules for the testing of the serial and control
modules were developed in Visual Basic.

VeriLogger/VeriWell: These Verilog simulators were used extensively in
designing and testing the hardware-based DCT. They provided much simpler
simulation and verification than did the Xilinx Foundation Series.

Microsoft Visual FoxPro (HexEdit): We used the HexEdit tool within
FoxPro to view the contents of image files so that we could understand their
composition and verify the accuracy of our results.

Microsoft Photo Editor/Paint: These somewhat buggy image editors were
used to manipulate images received from the QuickCam and to view the
results of our compression/decompression software.
27
Chapter 7: Possible Improvements
Serial Alternatives
Having another way to transfer the information to the FPGA instead of using the
serial interface we borrowed from other groups. This was a huge bottleneck for
our project. Perhaps USB or even Fire Wire would be better alternatives.
Regardless, serial communication is not a good choice when shooting for speed
improvements.
Memory and Camera Considerations
If the new QuickCam we received would have been grayscale and worked with
the FPGA camera interface we designed, then we could have gone directly to the
FPGA and avoided the Serial communication slow down on one side. If we could
have done this we would have also needed to include enough memory outside of
the FPGA to hold the entire image. Again speed is important here.
PC User Interface
A better PC interface is another major improvement that would benefit our
project. Having a button to click on to take the picture and then one to compress
would be much easier for the end user. Because the new camera requires
proprietary software, for now we have to use the QuickCam software to take the
picture and then use Microsoft Photo Editor to save the picture as an 8-bit
grayscale image. All of these things should be done as a part of the PC interface,
but were not planned for in the original proposal. We were unaware that the new
camera would require extra development time.
28
Chapter 8: Challenges & Solutions
Software Algorithm
The first major challenge we faced was in the development of the software
version of the project. While it was relatively simple to get each of the major
portions—DCT, quantization, encoding, decoding, dequantization, IDCT—of the
code to work individually, the integration of the components proved to be a
formidable task. By rigorous restructuring and clarification of the code we were
able to overcome this obstacle and develop a software project that worked
satisfactorily.
Verilog Algorithm
Many challenges were faced in the Verilog design of the DCT component. We
naïvely thought that the conversion from C to Verilog would be relatively simple.
In the final analysis, however, it appears that the best course would have been to
ignore the C code when developing the Verilog code. By going back to the
original algorithm, a much simpler Verilog solution could have been provided.
As mentioned earlier, one of the first problems we tackled was that of
implementing floating point. While this was a challenge, it was by no means
insurmountable.
We were able to overcome obstacles along the way. The second major issue we
faced in the Verilog design was the fact that Xilinx does not allow for arrays of
vectors, a standard Verilog construction. Because of this we had to begin viewing
our 8-byte by 8 input not as an array of 64 8-bit vectors, but as one 512-bit vector.
All arrays and vectors had to be converted to this one-dimensional perspective.
Once the dimensionality hurdle was cleared, many problems appeared which this
initial issue had masked. The most notable of these was the fact that we were not
able to use non-constant ranges for the indices of our vectors. For example we
often had a for loop with the index i. To choose the i-th byte of a given vector a,
we would select a[8 * i : 8 * i + 7]. However, this was not allowable, so we had to
take the byte one bit at a time.
There were many other issues with which we had to deal before we were ever
able to get the simulation of the DCT running. Once running, it appeared that the
timing between different tasks within the DCT module. This issue has not yet
been resolved.
We faced many obstacles in the development of the compression/decompression
software and hardware, but we were able to overcome most of them. Given more
time the rest of these problems could be sufficiently eliminated.
29
The Camera Hardware
The QuickCam is an important piece of our project. In order to compress an
image, you first have to have an image. The question then was how to capture
this image and manipulate it such that it could be compressed on the FPGA.
The solution started off to be fairly simple. Several previous projects used
QuickCams and there was enough existing documentation to re-create their earlier
work. This is where we started work. We were issued a camera and we built
interfacing hardware utilizing a Basic Stamp 2 chip. By using Hyperterminal we
could see that the camera was indeed returning a bitmap but there was something
odd about the resulting image. After a little diagnosis, it was discovered that the
camera was broken and wasn’t able to take a clear reliable picture.
We ordered a new camera, and after working with it only a few hours, it was
found that the old camera was incompatible with the hardware. Furthermore, the
company that makes the camera didn’t have any datasheets on that camera. The
Basic stamp 2 method for interfacing with the camera was scrapped after about 2
½ weeks of development efforts.
Our final design conceded that a new FPGA interface for the camera would be far
too complex for our timeframe. Instead we turned to our serial interface as a
means for delivering the image to the FPGA. This meant that the previous serial
interface would have to be redesigned to incorporate this new need. This is what
ended up being done.
Serial Considerations
Our original plan for the serial interface was to simply use it as a means of
sending the final compressed image back to the PC. This unidirectional and
relatively simple plan was laid out and researched.
We found that other groups (especially PDACS) had used a serial connection
before and we were able to build from the existing base of their projects. The
physical hardware was built identically to their design with the exception of a few
resisters and a different crystal. Our choice of crystals was limited by the supply
in the design lab, but we were able to find one that delivered a transfer rate of
9600. We were temporarily satisfied with this speed as we knew that it could
always be increased later.
30
The internal program for the FPGA was more difficult. After running test cases
to ensure that the serial hardware was working, we redirected our efforts toward
this task. A need developed for a module to interface with the serial module. The
legacy code would never be able to provide the parallel data needed by the
DCT/IDCT algorithms. A large bus was used and a new module (bit-stream-tobus) was designed to handshake with the serial module and to enable this large
bus to deliver the data for compression.
This turned out to be a very successful method, as the existing serial module
needed only minor tweaking. The other module essentially emulates a memory
controller, at least from the point of view of the serial module.
FPGA(s)
We knew from the start that we would need a big FPGA to make our project
successful. As mentioned before, we ordered a Virtex 400HQ240 based
primarily on our CLB needs, but we couldn’t afford to wait until it arrived to start
developing our hardware. Instead we began work testing individual components
of the system on small chips that were available in the lab.
Two smaller FPGAs were used along the way, the first was a 4003 that was part
of the standard demo boards. This we were able to successfully utilize this chip
while we were building the first camera hardware. It was a bit messy with all the
jumper wires that were needed to tie the chip into our main board. On the other
hand, it was extremely convenient because the boards already had LEDs, switches
and buttons that proved useful. This FPGA had to be retired when it came time to
start the serial interface. It simply wasn’t big enough.
Searching for more CLBs, we were given a 4010. This was a sizable step up from
the previous FPGA, but it was not incorporated into a demo board. We had to
wire the chip into a new board and connect not only the power and ground pins
but also all of the LEDs and switches we would need. This was a simple matter,
but nevertheless took some amount of time. We were eventually successful at
getting the serial hardware and FPGA logic working on this chip.
When the FPGA we ordered finally arrived, we had about 2 ½ weeks remaining.
Much of the project was completed, but it proved to be a very time-consuming
task to integrate the new FPGA. The primary problem was the size of the wire wrapping socket. The pins were smaller than our wire-wrapping tools, and each
pin could only hold about two wires before the wrap would touch neighboring
pins. This was one of the few obstacles that were never overcome.
31
Chapter 9: Time Line
Date
Week 3
9/12–9/18
Week 4
9/19-9/25
Week 5
9/29-10/2
Week 6
10/3-10-9
(Oct. 7)
Week 7
10/10-10/16
Week 8
10/17-10/23
(Oct. 21)
Week 9
10/24-10/30
Week 10
10/31-11/6
(Nov. 4)
Week 11
11/7-11/13
Week 12
11/21-11/27
(Nov. 18)
Week 13
11/21-11/27
Week 14
11/28-12/3
Week 15
12/4-1210
Task
Research DCT and subsystems
Begin Proposal
Finalize Proposal
Present Proposal
Status
Done
Done
Done
Done
Begin software Compression
Algorithm
Begin Camera Interface
Begin Software Decompression
Algorithm
Finish Software Compression
Algorithm
Finish Camera Interface
Begin Serial Hardware
Biweekly Report 1
Begin FPGA Compression
Module
Continue Serial Interface
Testing
Continue Serial Testing
Done
Begin Control Modules
Biweekly Report 2
Begin FPGA Decompression
Module
Continue test bank cases on
Serial Module
Begin Midterm Presentation
Continue FPGA
Decompression
Mid-term Presentation
Begin Integration of Serial and
DCT modules with the Control
Module
Continue Integration of Modules
Done
Done
Done
Continue IDCT on FPGA
Begin Setup of New FPGA for
programming
Biweekly Report 3
Begin final report and
documentation
Continue working with the new
FPGA
Thanksgiving
Finalize secondary modules
Finish working with the new
FPGA
Finish IDCT
Finish Reports and
Documentation
Done
Done
Done
Notes
Lots of good references found on the
web and from other groups.
Pretty smooth – few things to
reconsider.
Choose Data flow graph
methodology.
Using Chapter 22 Tutorial
Hard to test without FPGA
Done
Done
Done
Done
Done
Tested with 4003 and HyperTerm
PDACS group is very helpful.
Done
Different Crystal used – 9600
Done
Test bank seem to be working on
4010 FPGA
Main Control Module for Data Flow
Done
DTR select and spooling test cases
working!
Done
Done
Done
Done
Done
Can’t test this until the big FPGA
arrives – but it should work.
Timing is critical. Handshaking used
for stability.
Yeah! – The new Chip is here.
Done
Done
Done
Small pins cause difficulties
Done
X
Umm - Turkey
Main Control, Stream to Bus
This proved way more time
consuming and largely impossible.
X
Done
Turned in: 12/9/99
32
Chapter 10:Results and Discussions
Software Results
We tested our software originally with an 8x8 image. Our final test runs were
based on a 320x240 image. Below, our results from these tests are shown. The
original picture is presented (Figure 10.1) along with the pictures corresponding
to several different quantization factors. These compressed images (Figures10.210.5) degrade in quality as the quantization factor (indicated below each image) is
increased.
Figure 10.1 - Original Picture
Figure 10.2 - Quantization Factor = 16
33
Figure 10.3 - Quantization Factor = 32
Figure 10.4 -. Quantization Factor = 64
Figure 10.5 - Quantization Factor = 128
34
Figure 10.6 below illustrates the differences between the compression
percentages, The chart shows the actual values. There is not much difference in
these just between twenty and twenty-five percent compression. Now we have to
compare the compression ratios to determine which balance of image quality and
compression ratio was the best. The biggest difference from the quantization
factors is in the image quality. Shown in figures above.
Quantization
Factor
128
64
32
16
Initial File (bytes)
76800
76800
76800
76800
Compressed File
(bytes)
15845
16396
17386
19906
Compression
Percentage
20.63
21.35
22.64
25.92
Chart 10.1 – File Compression Data
Com pression Percentages
30
25
Percentage of Original
20
15
10
5
0
128
64
32
16
Quantization Factors
Figure 10.6
35
Since the difference in compression percentage is not that dramatic a quantization
factor of 16 has been chosen as the most beneficial. To calculate the time it takes
for the entire compression/decompression to run we calculated the amount of
processor cycles used. This is depicted in Chart 10.2. The DCT/IDCT consumes
over 75% of the process. Thus, we determined that it was the part that most
needed to be implemented on the FPGA. The encoding takes a little more time
than decoding because it has to make the encoding tree. Chart 10.2 below shows
the data collected and the graph (Figure 10.7) illustrates this.
Quanti
zation
Factor
128
AVG
128
64
AVG
64
32
AVG
32
16
AVG
16
AVG
Break
up
header
100
100
101
100
100
DCT
Quantize
Encode Decode Dequan
tize
IDCT
Remak
e file
221
221
230
231
230
226.6
30
30
20
30
30
28
70
60
40
40
50
52
30
40
40
40
30
36
30
30
20
30
30
28
220
250
230
240
221
232.2
110
151
111
101
100
100
90
90
100
90
231
221
221
240
230
228.6
20
30
20
30
30
26
50
50
50
51
60
52.2
60
40
40
40
30
42
20
30
20
30
30
26
230
250
280
210
221
238.2
111
121
121
110
110
100
90
100
100
90
230
230
210
231
221
224.4
40
20
30
30
40
32
50
50
50
50
60
52
40
40
41
40
30
38.2
40
20
30
30
40
32
221
231
210
220
250
226.4
120
110
120
110
121
90
91
100
100
100
230
220
220
231
220
224.2
30
20
30
30
31
28.2
50
60
50
60
50
54
50
40
40
40
50
44
30
20
30
30
31
28.2
241
261
221
230
220
234.6
170
120
150
120
110
225.95
28.55
52.55
40.05
28.55
232.85
Chart 10.2 - Processor Ticks Per Activity
36
Time Consumption for Image Compression
IDCT
37%
Dequantization
5%
DCT
37%
Decoding
7%
Encoding
9%
Quantization
5%
Figure 10.7
The working part of our project ends with the C implementation of the
entire design. Granted, we have designed the rest but are far from
finishing the implementation.
37
Individual Contributions
John Hill
Researched legacy components from previous project groups and determined their
relevance to our project. Build serial, camera, and stand along FPGA hardware
(for two FPGAs) and worked on modifications to old designs to fit our project.
Designed all modules for download onto the FPGA with the help of my other
group members. Built, modified, and ran test cases on the hardware individually
to verify correctness. Integrated FPGA program with hardware and built the final
schematics in the Xilinx Foundation Series. Worked with the new FPGA to
attempt to get it integrated into our project. Wrote outline for, and compiled the
proposal, midterm, and final reports and presentations, and contributed my part to
each biweekly report.
David Oltmanns
Researched the DCT heavily and looked into the other sub-systems. Contributed
to all documentation associated with the project. Worked jointly with Delayne to
get the compression, decompression, and interface all working and integrated in
C. Again in a joint effort with Delayne, translated the C code into Verilog. The C
code for the DCT and quantization. This is what we wanted to download to the
FPGA since these were the computationally intensive parts of the project.
Delayne Vaughn
Mainly responsible for testing and debugging the C/Verilog code. Spent most of
his time finding and fixing the problems mentioned in the challenges section.
Also collaborated with David to write several of the functions in compress.c.
Divided the C code from one megalithic file to the six more manageable files
which comprise the final program. Assisted David in design of Verilog DCT
module.
38
References
N. Ahmed, T. Natarajan, and K.R. Rao, "Discrete cosine transform," IEEE
Trans. Comput., vol. C-23, pp. 90-93. Jan. 1974.
K. Aldrich, D. Brandenberger, C. Chilek, and B. Raymond, "Sign Language
Aquisition and Recognition System,"
www.cs.tamu.edu/course-info/cpsc483/common/99b/g3/Final.htm (Sept. 20,
1999)
M. Berger, J. Curtin, T. Griffin, A. King, M. Nordfelt, and J. Whitted,
"Portable Digital Compression/Decompression System,"
www.cs.tamu.edu/course-info/cpsc483/common/99a/g5/g5.html (Sept. 20, 1999)
J. Berglund, R. Cuaycong, W. Day, A. Fikes, and K. Shah, "Autonomous
Tracking Unit," www.cs.tamu.edu/course-info/cpsc483/common/99a/g1/g1.html
(Sept. 20, 1999)
N.I. Cho and S.U. Lee, "Fast algorithm and implementation of 2-D discrete
cosine transform," IEEE Trans. CAS, Mar. 1991, pp. 297-305
N.I. Cho and I.D. Yun, and S.U. Lee, "On the regular structure for the fast
2D DCT algorithm," IEEE Trans. CAS, Apr. 1993, pp.259-266
S.C. Chan and K.L. Ho, "A new 2D fast cosine transform algorithm," IEEE
Trans. SP, Feb. 1991, pp.481-485
H.S. Hou, "A fast recursive algorithm for computing the discrete cosine
transform," IEEE Trans. ASSP, Oct. 1987, pp. 1455-1461.
C.W. Kok, "Fast Algorithm for Computing 2D Discrete Cosine Transform,"
Unpublished article, pp. 1-4
39
R. Mahapatra, A. Kumar, and B. Chatterji, "Performance Analysis of 2-D
Inverse Fast Cosine Transform Employing Multiprocessors," Article, pp. 1-31
Cvetkovic, Popovic, "New fast recursive algorithms for the computation of
discrete cosine and sine transforms," IEEE Trans. Aug. 1992, pp 2083-2086
http://dmsun4.bath.ac.uk/dcts/fastdct.html
(location for DCT info 10/2/99)
http://bbs.galilei.com/libs/tools.htm
(location for Huffman encoding info. 10/2/99)
Previous Project Groups
 PDACS – Spring 1999
http://www.cs.tamu.edu/course-info/cpsc483/common/99a/g5/g5.html
 Sign Language Acquisition and Recognition System – Summer 1999
http://www.cs.tamu.edu/course-info/cpsc483/common/99b/g3/Final.htm
40
Appendices Index
Proposal
Proposal presentation
Midterm report
Midterm report presentation
Biweekly Report 1
Biweekly Report 2
Biweekly report 3
HW DATA SHEETS
Dct.h
Dct.c
Compress.c
Compress.h
Huffman.h
Huffman.c
Fdct.v
Ifdct.v
Serial control code from PC
Serial control on FPGA
Write into wires and call the *.v modules code
41
Download