ARM-Optimized JPEG Decoder

HW/SW Implementation of JPEG Decoder ARINDAM GOSWAMI ERIC HUNEKE MERT USTUN ADVANCED EMBEDDED SYSTEMS ARCHITECTURE SPRING 2011 Division of Labor  Software  Profiling – Arindam/Eric  Timing analysis – Arindam/Eric  Interface to hardware - Arindam  Test data for hardware - Eric  Hardware – Mert  C to Verilog Conversion  Scheduling & Resource Allocation on FPGA  Bus Communication Interface Outline  What is JPEG?  Project Description  JPEG Algorithm  Profile Data  Software Design  Hardware Design  Results  Conclusion What is JPEG?  Image codec released by the Joint Photographic Experts Group in 1992  Joint committee between the ISO/IEC JTC1 and ITU-T standards committees  Informally used to describe the file format JPEG- encoded images are packed in   Although the file format specified in the original standard, JPEG Interchange Format (JIF), is rarely used Exif or JFIF, both based JIF, are commonly used What is JPEG? (cont.)  Optimized for realistic images and photographs  Color transitions should be smooth for best results  Lossy compression, which can be tuned to produce compressions of varying quality and size    Up to 20:1 without loss in quality for appropriate images Better ratios than other algorithms such as GIF, but slower to compress and decompress Has lossless mode, but not widely used Project Description  Selected an existing software JPEG implementation we could modify and increase performance  Criteria   Small enough to be easily understood and modified Reasonably fast, but not optimized Project Description (cont.)  Most common JPEG implementation out there is libjpeg, from the Independent JPEG Group  Fast, but hard modify due to complexity  Various other open source implementations  Tiny Jpeg Decoder  jpeg-compressor Project Description (cont.)  We ended up choosing NanoJPEG, written by Martin Fiedler    Reasonably fast, but not optimized Very small code size (< 1000 lines) in a single file Easy to understand  I/O  Decompresses grayscale or YCbCr images  Outputs grayscale or RGB raw images  Other details  Written in C  No floating point JPEG Algorithm  Step 1  Convert the image to the YCbCr color space (typically from RGB)   Y for brightness Cb and Cr for blue and red color components  The human eye is less sensitive to color changes than it is too brightness changes  JPEG takes advantage of this JPEG Algorithm (cont.)  Step 2  Downsample the color data (CbCr) by averaging together rows and vertically    Factor of two on rows Factor of one or two on column Data can thus be reduced by 1/2 or 1/3  Imperceptible loss in quality JPEG Algorithm (cont.)  Step 3  For each component, split the pixel data into 8x8 blocks  Run each block through a discrete cosine transform (DCT)  End up with a matrix containing one DC value and 63 AC components JPEG Algorithm  Step 4  Divide each cell of the matrix by values defined in a quantization matrix, then round to the nearest integer  The quantization matrix has values of customizable size  The larger the values, the more cells are reduced to zero, and hence lost JPEG Algorithm (cont.)  Step 5  Take the reduced blocks and perform Huffman encoding (or Arithmetic encoding) to eliminate redundant values  Lossless compression  Step 6  Wrap data in a standard file format, along with compression data including quantization and Huffman tables JPEG Algorithm (cont.)  Decoding is simply the reverse of the encoding process      Get the reduced matrixes back Multiply it with the quantization matrix Run an inverse DCT (IDCT) Upsample Convert to RGB Profile Data  Profiled NanoJPEG on sample image with armsd simulator  55.10% of total time spent converting the image to RGB upsampling  Logically separate from decode phase  38.34% of total time spent decoding the 8x8 blocks  So really 85.39% of time not spend converting/upsampling  Row and column IDCTs were about half of the block decode time  Our main focus for speedup, since took about 42% of decode time, and were an obvious candidate for FPGA implementation Software Design Block decoding code  Row and column  IDCT calls Software Design Row IDCT Column IDCT  Software Design  Interface –  Write 8x8 integers to FPGA addresses- D3000100-1FF  Read 8x8 integers from D3000200-2FF (o/p of RowIDCT)  Read 8x8 bytes from D3000300-33F (o/p of ColIDCT)  Code –  Replace calls to IDCT functions with r/w to FPGA addresses Hardware Design - Architecture 1. ARM writes row 0 2. Row IDCT: row 0 ARM writes row 1 3. … 4. Row IDCT: row 7 ARM reads row 0 5. Col IDCT: col 0 - 7 ARM reads rest of the block 6. ARM reads colIDCT results ROW IDCT AMBA BUS BUS COMM. IF 8x8x8b COL_OUT Register File 8x8x32b BLOCK Register File COL IDCT IDCT CORE Hardware Design - Optimizations  Register Files are used instead of RAMs to allow random access to any word in the block matrix  Arithmetic operations were distributed in multiple stages to share resources and therefore reduce area  Column IDCT and Row IDCT have a lot of common operations –   Use only a single datapath for both = Core IDCT Hardware Design – Core IDCT Row IDCT Column IDCT  Hardware Design – Optimizations (2)  The hardware speed is limited by the ARM – FPGA bus transactions (block transfers).   Optimize bus state machine:   Started with 6 state bus machine of Lab 2 Reduced it to only 3 states !!!  Total # of FPGA cycles per 8x8 block process:  3 x (64 Writes + (64+16) Reads ) = 432 Cycles  432 Cycles for 8 Row and 8 Column IDCTs Results  Hardware produces correct outputs in simulation  Integrated system does not yet match simulation  Communication overhead between ARM and FPGA is the major bottleneck  Expected speed-up:  ARM: 8 x 60 + 8 x 120 = 1440 ARM Cycles (optimistic appr.)  FPGA: 3 x (64 Writes + (64+16) Reads ) = 432 FPGA Cycles Conclusion  Work Completed  Parallelized IDCT routines for each block decode in FPGA  Work to be completed  Get interface working  What we would have done differently  Used DMA to reduce communication overhead even more  Parallelize ARM and FPGA block processing  Additional speed-up possible by moving njConvert (upsampling & color conversion) into FPGA References  Joint Photographic Experts Group  http://www.jpeg.org/jpeg/index.html  Introduction to JPEG  http://www.faqs.org/faqs/compression-faq/part2/  NanoJPEG  http://keyj.s2000.ws/?p=137 Questions ?

ARM-Optimized JPEG Decoder

Related documents

Products

Support

ARM-Optimized JPEG Decoder

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib