A NOVEL ARCHITECTURE OF OPTIMIZED TRUNCATION FOR JPEG2000 NING CHEN, SHIGEN XIE, XIANG XIE, LEIBO LIU, LI ZHANG, ZHIHUA WANG Integrated Circuits and Systems Lab, Department of Electronics Engineering, Tsinghua University Beijing, 100084 CHINA Abstract: - To cope with the recent mobile scenes where images are used aggressively, a novel optimized truncation architecture dedicated to JPEG2000 image coding is presented in this paper. With a rate-distortion slope based method, the proposed architecture is aimed to achieve a low computational cost and small working memory size yet maintaining high image quality. By deciding the adequate number of coding passes to truncate as soon as each code-block coding ends, the proposed architecture eliminates iterative computations and reduces working memory size for bitstream buffering from full image size down to three-code-block size and provides the same rate-distortion performance as the software architecture. It is also implemented and verified inside a JPEG2000 encoder based on a Xilinx xc2v4000-ff1152 FPGA chip. Key-Words: - Optimized Truncation, EBCOT, JPEG2000, VLSI, FPGA 1 Introduction With the recent advances in functionalities of mobile systems and transmission bandwidth of network systems, there are strong demands of digital image processing including still and moving picture coding. In the case of transmitting image/video data over communication channel with specified bandwidth, such as broadcasting or wireless environment, so-called rate control of image codestream is indispensable. Rate control is useful to meet a particular target bitrate or transmission time, and assures that the desired number of bytes used in the codestream while assuring the highest image quality possible. JPEG2000 [1], standardized in 2001, has a great ability of rate control based on wavelet transformation and embedded block coding. It is highly advantageous to adopt embedded block coding since this scheme allows for post-compression rate-distortion optimization (PCRD) algorithm [2]. In other words, JPEG2000 encoder can truncate block codestream in an optimal way when the required bitrate is attained. Therefore, JPEG2000 can be regarded as a viable image coding scheme in the coming network era. However, looking into details of the algorithm, the process of coefficient bit modeling and arithmetic coding is executed in code-block basis, which is small rectangle portion of decomposed image. These code-blocks are encoded independently each other, and hence to control the total number of bits of coded image, a set of truncation (coding termination) points for all code-blocks must be resolved according to the specified bitrate. There is an approach to rate control called rate-distortion optimization [3]. Given a target bitrate, this scheme evaluates distortion incurred in re-constructed images attained by all candidates each for a set of truncation points, and then selects the set which gives the minimum image distortion. This scheme attains fairly good image quality, however suffers from high computational costs since the coefficient bit modeling and arithmetic coding process must be completed before starting rate control, which demands a large buffer for the whole image, and to find the appropriate truncation point set, iterative computations are needed. Hence there still remains difficult for implementing optimized truncation on chip. Current JPEG2000 implementations employ two methods to solve this problem: 1. to use quantization coefficients instead of optimized truncation for rate control, 2. to leave the optimized truncation for software on MCU. Both methods sacrifice flexibility and the second even imposes much burden on system throughput and costs. Motivated by this problem, a new rate control architecture, which executes optimized truncation in parallel with the process of arithmetic coding, is devised. This architecture first stores codestream and code-block information of a code-block in separate buffers, then estimates rate-distortion slope for each truncation point and selects the monotonically decreasing subset. When all the rate-distortion slope metrics available, the optimal truncation point for current block can be easily decided. Referring to information buffer to get truncated block length, the architecture accomplishes rate control by simply shift the buffer address to truncate the block stream. Certainly, an arbiter is needed to manage buffers and output stream when necessary. As a result, considerable reduction of computational costs can be achieved with avoiding iterative truncations. At the same time, buffer size is reduced from tile size to code-block size. Our optimized truncation architecture, reducing computational cost in high degree, is especially effective when used in low target bitrate applications which demand for flexible rate control and optimal rate-distortion performance. 2 JPEG2000 Coding and Rate Control 2.1 JPEG2000 Coding Algorithm In JPEG2000 coding scheme, first a target image is divided into square regions, called tiles. Then 2-D discrete wavelet transformation (shortly DWT) decomposes a tile into LL, HL, LH, and HH subbands. LL subband is a low resolution version of the original tile and again is to be decomposed into four subbands recursively. Thus, for each level, there are three bands except the last has four. This decomposition is called Mallat decomposition. A subband is divided into code-blocks, typically 64x64, each of which is coded individually by coefficient bit modeling described later. Wavelet coefficients in a code-block are quantized, then quantized coefficients are separated to sign bits and absolute values, and so-called bit-planes are generated from the bits of absolute values such that each bit-plane refers to all the bits of the same magnitude in all coefficients of the subband. Coefficient Bit modeling is a process to label bits of a bit-plane based on the statistical information through three different coding passes, which allows efficient compression by succeeding MQ-coder, an arithmetic coder. MQ-coder generates compressed image data and information for each code-block independently. 2.2 Rate Control Mechanisms In the encoder, rate control can be achieved through two distinct mechanisms: 1) the choice of quantizer step sizes, and 2) the selection of the subset of coding passes to include in the code stream. When the first mechanism is employed, quantizer step sizes are adjusted in order to control rate. Although this rate control mechanism is conceptually simple, it does have one potential drawback. Every time the quantizer step sizes are changed, the tier-1 encoding 3 Parallel Optimized Truncation Architecture As mentioned before, the rate-distortion optimized truncation can find the set of truncation points which attains the best image quality. However, this scheme requests to execute coefficient bit modeling and arithmetic coding of all bit-planes of code-blocks in entire image, which is the most computationally intensive process in JPEG2000 coding. Furthermore, referring to working memory size, this scheme must keep compressed code-block data of whole image and considerable information concerning coded data distortion values. On the contrary to this, our approach truncation points for each code-block as coding process ends. amount of sizes and determines soon as its Info Control optimize_valid SRAM optimized_length Code Control SRAM Bu ff er Ma na ge r RD_Slope I/F 3. 1 Proposed Optimized Truncation Scheme The suggested architecture is composite of three modules: code truncation, info truncation, buffer arbiter. The diagram below shows the architecture. Arith I/F must be performed again. Since tier-1 coding requires a considerable amount of computation, this approach to rate control may not be practical in computationally-constrained encoders. When the second mechanism is used, the encoder can elect to discard coding passes in order to control the rate. The encoder knows the contribution that each coding pass makes to rate, and can also calculate the distortion reduction associated with each coding pass. Using this information, the encoder can then include the coding passes in order of decreasing distortion reduction per unit rate until the bit budget has been exhausted. This process stands for optimized truncation in EBCOT. This approach is very flexible in that different distortion metrics can be easily accommodated (e.g., mean squared error, visually weighted mean squared error, etc.). A rate-distortion optimization adopted in JPEG2000 verification model (shortly VM) [4] tries to achieve the best image quality for the specified bitrate. Since in JPEG2000 every code-block is coded completely independent each other, this scheme first calculates pairs of coded data amount and distortion value at the end of all passes in each code-block. Among those sets of pairs which maintain target bitrate, select the one gives lowest value in terms of total image distortion. Then for each code-block, truncation point is set to the end of a pass in the selected set. Fig.1. Proposed Optimization Truncation Block Diagram As we all know, data needed to construct a JPEG2000 codestream can be divided into two categories: code and info. Code means those bytes generated by entropy encoder while info stands for code-block info necessary for decoder. Info consists of three types of data: zero bit-plane, pass number, cumulative length for each pass in the code-block. Along with these data, rate-distortion slope for each pass is necessary for optimized truncation. It is worth noting that only the truncated block length is necessary not all the cumulative lengths. These characteristics enable some simplification in the implementation, which will be discussed later. In this design, separate handlers for code and info simplify and clarify the architecture and minimize memory addressing efforts. Though the packetization process is not included in this encoder, the independent info buffer enables easy integration with such friendly interface. But at the other hand, it demands for double blocks of memories, which impose burden on backend design, especially when multi-band encoders are used which will be 3.2 Parallel for three band encoders Since almost all JPEG2000 encoders employ three entropy encoders to exploit speed, we use three duplicates to handle them except there is only one arbiter. Fig. 2 shows the JPEG2000 Encoder architecture block diagram. Since the proposed optimized truncation scheme truncate each block independently as the entropy encoder, so it can be seamlessly integrated into the three-band parallel architecture. Fig. 3 shows the three-band parallel optimized truncation diagram. Entropy Coding Engine Quantizer HOST I/F Wavelet Engine Arbiter Entropy Coding Engine O p t T i r m u i n z c e a d t i o n CCD/CMOS I/F SDRAM I/F Control Unit Coefficent SRAM Capture Entropy Coding Engine Info Control optimize_valid SRAM optimized_length Code Control SRAM Info Control SRAM optimize_valid optimized_length Code Control SRAM Info Control SRAM optimize_valid Bu ff er Ma na ge HOST I/F r Arith I/F Arith I/F Fig.2. JPEG2000 Encoder System Bock Diagram Arith I/F discussed later. Another thing is, the buffer sizes are different for code and info. To avoid buffer overflow, both buffers must be double size of the maximum capacity of a code-block. For default block size 32x32, the code buffer and info buffer can be chosen to 1024x16 and 128x16 since it seldom generates more than 1024 bytes or 64 passes. The proposed optimized truncation method can be simply described as below: 1. when block coding starts, info handler gets zero_bit_plane and pass_number; 2. for each code_byte, code handler write it into buffer, address increases by one. 3. when a pass finishes, rd_slope give out the pass_length and info handler write it into the buffer, address increases by one. 4. when the block finishes, rd_slope give out all slopes for every pass one by one in a monotonously decreasing order. Slopes of those passes not suitable for optimized truncation are set to zero. 5. info handler compares these slopes with given threshold. if the slope > threshold optimized_pass_number = current_pass_number. else if slope == zero next else if slope < threshold break; As we know in the last section, rate-distortion slope is not necessary for constructing a codestream, so no memory access is needed. 6. info handler moves address to the block beginning and writes the optimized_pass_number 7. info handler increases address by optimized_pass _number and gets optimized_block_length from buffer and sets optimize_valid valid 8. when optimize_valid valid, code handler gets optimized_block_length and shift address to current_ address - block_length + optimized_block _length optimized_length Code Control SRAM Fig.3. Proposed Architecture Block Diagram Compared with the software architecture using in JPEG2000 Verification Model, they provide the same rate-distortion performance for they choose the same truncation points. At the same time, parallel optimized truncation needs much small buffer and is exempt from searching the entire codestream. It is worth pointing out that post-compression after the entire image is encoded implemented in VLSI must suffer from large addressing effort. 3.3 Comparison with other schemes Compared with other schemes, the proposed architecture achieves low computation efforts by enabling one-pass encoding process with parallel rate control and reduces the tile-size (512x512x8) memory demand down to three-block-size (32x32x8x3) without rate-distortion performance loss. Table 1 shows the comparison result. Quantization MCU-based Proposed Method Optimzed Architecture Truncation Mem N/A Tile-size Area N/A Depent block-size x3 on 7.9 kgates on Parallel MCU Time Multi-pass Depent encoding MCU Speed process PSNR Good one-pass encoding Best Best Table 1.Comparison with other schemes 4 Conclusion In this paper, the novel optimized truncation architecture is proposed dedicatedly for JPEG2000 image coding. The architecture runs concurrently with the process of code-block coding including coefficient bit modeling and arithmetic coding successfully. With reducing considerable part of computational labor and working memory size, the high image quality which is the same with the software architecture is achieved. Thus the proposed architecture is suitable to be used in ASIC. It has been integrated into the entire JPEG2000 encoder system and verified on FPGA. 5 Acknowledgement This work was supported in part by the National Basic Research Priorities Program of China (973 Program) Grant G2000036508 and the National High Technologies Research and Development Program of China (863 Program) Grant 2002AA1Z1420. References: [1] ISO/IEC JTC1/SC29/WG1 N2165, JPEG2000 verification model 9.1 (technical description), June 2001. [2] D. Taubman, High Performance Scalable Image Compression with EBCOT, IEEE Trans. Image Processing. Vol. 9, no. 7, pp. 1158-1170, June 2000 [3] Jin Li, Shawmin Lei, An Embedded Still Image Coder with Rate-Distortion Optimization, IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 8, NO. 7, JULY 1999 [4] ISO/IEC JTC 1/SC 29/ WG 1 N1684, JPEG2000 Verification Model 7.0 (Technical description), April 2000