HW/SW Implementation of JPEG Decoder ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 Division of Labor Software Profiling – Arindam/Eric Timing analysis – Arindam/Eric Interface to hardware - Arindam Test data for hardware - Eric Hardware – Mert C to Verilog Conversion Scheduling & Resource Allocation on FPGA Bus Communication Interface Outline What is JPEG? Project Description JPEG Algorithm Profile Data Software Design Hardware Design Results Conclusion What is JPEG? Image codec released by the Joint Photographic Experts Group in 1992 Joint committee between the ISO/IEC JTC1 and ITU-T standards committees Informally used to describe the file format JPEG- encoded images are packed in Although the file format specified in the original standard, JPEG Interchange Format (JIF), is rarely used Exif or JFIF, both based JIF, are commonly used What is JPEG? (cont.) Optimized for realistic images and photographs Color transitions should be smooth for best results Lossy compression, which can be tuned to produce compressions of varying quality and size Up to 20:1 without loss in quality for appropriate images Better ratios than other algorithms such as GIF, but slower to compress and decompress Has lossless mode, but not widely used Project Description Selected an existing software JPEG implementation we could modify and increase performance Criteria Small enough to be easily understood and modified Reasonably fast, but not optimized Project Description (cont.) Most common JPEG implementation out there is libjpeg, from the Independent JPEG Group Fast, but hard modify due to complexity Various other open source implementations Tiny Jpeg Decoder jpeg-compressor Project Description (cont.) We ended up choosing NanoJPEG, written by Martin Fiedler Reasonably fast, but not optimized Very small code size (< 1000 lines) in a single file Easy to understand I/O Decompresses grayscale or YCbCr images Outputs grayscale or RGB raw images Other details Written in C No floating point JPEG Algorithm Step 1 Convert the image to the YCbCr color space (typically from RGB) Y for brightness Cb and Cr for blue and red color components The human eye is less sensitive to color changes than it is too brightness changes JPEG takes advantage of this JPEG Algorithm (cont.) Step 2 Downsample the color data (CbCr) by averaging together rows and vertically Factor of two on rows Factor of one or two on column Data can thus be reduced by 1/2 or 1/3 Imperceptible loss in quality JPEG Algorithm (cont.) Step 3 For each component, split the pixel data into 8x8 blocks Run each block through a discrete cosine transform (DCT) End up with a matrix containing one DC value and 63 AC components JPEG Algorithm Step 4 Divide each cell of the matrix by values defined in a quantization matrix, then round to the nearest integer The quantization matrix has values of customizable size The larger the values, the more cells are reduced to zero, and hence lost JPEG Algorithm (cont.) Step 5 Take the reduced blocks and perform Huffman encoding (or Arithmetic encoding) to eliminate redundant values Lossless compression Step 6 Wrap data in a standard file format, along with compression data including quantization and Huffman tables JPEG Algorithm (cont.) Decoding is simply the reverse of the encoding process Get the reduced matrixes back Multiply it with the quantization matrix Run an inverse DCT (IDCT) Upsample Convert to RGB Profile Data Profiled NanoJPEG on sample image with armsd simulator 55.10% of total time spent converting the image to RGB upsampling Logically separate from decode phase 38.34% of total time spent decoding the 8x8 blocks So really 85.39% of time not spend converting/upsampling Row and column IDCTs were about half of the block decode time Our main focus for speedup, since took about 42% of decode time, and were an obvious candidate for FPGA implementation Software Design Block decoding code Row and column IDCT calls Software Design Row IDCT Column IDCT Software Design Interface – Write 8x8 integers to FPGA addresses- D3000100-1FF Read 8x8 integers from D3000200-2FF (o/p of RowIDCT) Read 8x8 bytes from D3000300-33F (o/p of ColIDCT) Code – Replace calls to IDCT functions with r/w to FPGA addresses Hardware Design - Architecture 1. ARM writes row 0 2. Row IDCT: row 0 ARM writes row 1 3. … 4. Row IDCT: row 7 ARM reads row 0 5. Col IDCT: col 0 - 7 ARM reads rest of the block 6. ARM reads colIDCT results ROW IDCT AMBA BUS BUS COMM. IF 8x8x8b COL_OUT Register File 8x8x32b BLOCK Register File COL IDCT IDCT CORE Hardware Design - Optimizations Register Files are used instead of RAMs to allow random access to any word in the block matrix Arithmetic operations were distributed in multiple stages to share resources and therefore reduce area Column IDCT and Row IDCT have a lot of common operations – Use only a single datapath for both = Core IDCT Hardware Design – Core IDCT Row IDCT Column IDCT Hardware Design – Optimizations (2) The hardware speed is limited by the ARM – FPGA bus transactions (block transfers). Optimize bus state machine: Started with 6 state bus machine of Lab 2 Reduced it to only 3 states !!! Total # of FPGA cycles per 8x8 block process: 3 x (64 Writes + (64+16) Reads ) = 432 Cycles 432 Cycles for 8 Row and 8 Column IDCTs Results Hardware produces correct outputs in simulation Integrated system does not yet match simulation Communication overhead between ARM and FPGA is the major bottleneck Expected speed-up: ARM: 8 x 60 + 8 x 120 = 1440 ARM Cycles (optimistic appr.) FPGA: 3 x (64 Writes + (64+16) Reads ) = 432 FPGA Cycles Conclusion Work Completed Parallelized IDCT routines for each block decode in FPGA Work to be completed Get interface working What we would have done differently Used DMA to reduce communication overhead even more Parallelize ARM and FPGA block processing Additional speed-up possible by moving njConvert (upsampling & color conversion) into FPGA References Joint Photographic Experts Group http://www.jpeg.org/jpeg/index.html Introduction to JPEG http://www.faqs.org/faqs/compression-faq/part2/ NanoJPEG http://keyj.s2000.ws/?p=137 Questions ?