International Journal of Engineering Trends and Technology (IJETT) – Volume 14 Number 3 – Aug 2014 Design and Implementation of Multi-Standard Video Encoder Supporting Different Coding Standards Karthika Sudersanan#1 , R. Ramya*2 #1 Student, *2Associate Professor, Department of Electronics and Communication, Sree Buddha College of Engineering, University of Kerala, Kerala, India Abstract- In this paper we develop the VLSI architecture of Multi-standard video encoder which supports different video coding standards such as MPEG 2/4, VC-1, H.264/AVC and AVS. A multi-standard transform unit with reduced path delay generates transform coefficients for different video coding standards through static reconfiguration. It is suitable for the real-time processing of videos with less motion such as video calls. This architecture is synthesised in Xilinx ISE Design Suite and implemented in FPGA Virtex 5 board. Keywords— Motion estimation, Video coding standards I. INTRODUCTION In a computer a still image is represented as an array of integers and is usually two dimensional (2-D) if it’s a black and white image and three dimensional (3-D) for a colour image. Each number in the array represents an intensity value at a particular location in the image, called a picture element or pixel, for short. The pixel values are positive integers, which ranges between 0 and 255 for an 8 bit representation of pixel value. This means that each pixel of a black and white image occupies 1 byte in a computer memory and the image has a gray scale resolution of 8 bits per pixel (bpp). On the other hand, a colour image has a triplet of values for each pixel one each for the red, green, and blue primary colours. Hence, it will need 3 bytes of storage space for each pixel. A video signal is composed of series of still images referred as frames as shown in Fig 1. Fig 1: Series of still images representing “scenes in motion” The number of frames displayed per second is referred as frame rate which determines how smooth the changes in the frames will be perceived. The human visual system can ISSN: 2231-5381 process 10 to 12 separate images per second, perceiving each image individually. Therefore, the higher the frame rate, the smoother the moving picture is perceived. A video source may produce 30 or more frames per second, in which case the raw data rate will be much higher. There are no such practical channels in existence that will allow such a huge transmission bandwidth. It is very clear that efficient data compression schemes are required to bring down the huge raw video data rates to manageable values so that practical communications channels may be employed to carry the data to the desired destinations in real time. Compression is the process of reducing the size of the data sent, thereby, reducing the bandwidth required for the digital representation of a signal. Compression technology can result in reduced transmission time due to less data being transmitted. It also decreases the storage requirements because there is less data. Text files, pictures, voice, video and in fact any data that contains redundancy can be made smaller by employing compression. Currently there exist many video coding standards such as MPEG-2, MPEG-4, H.264/AVC, VC-1, AVS and HEVC. The intercommunications between the video devices using different standards are so much inconvenient, thus video codec supporting multiple standards are more useful and more attractive. In this paper we develop a 2-D Multi-Standard transform unit with reduced path delay which generates transform coefficients through static reconfiguration for different video coding standards such as MPEG 2/4, VC-1, H.264/AVC and AVS. Using this transform unit, a Multi-Standard video encoder is developed which generates compressed bit stream for transmission or storage. The rest of this paper is organized as follows. Section II, reviews the important contribution of researches in the field of Video Coding units. In Section III the architecture of Multistandard video encoder is discussed. Section III deals with the architecture Multi-standard transform unit and finally in section V, experimental results are presented. In the last section, conclusion of the work is presented. II. RELATED WORK In [1] Kim and Koh presented an area efficient VLSI architecture of transform coding module for MPEG-2 video encoder. This design and implementation are applicable to http://www.ijettjournal.org Page 145 International Journal of Engineering Trends and Technology (IJETT) – Volume 14 Number 3 – Aug 2014 MPEG-2 video encoder. Malvar, et. al. [2] presented an overview of the transform and quantization designs in H.264. Unlike the popular 8×8 discrete cosine transform used in previous standards, the 4×4 transforms in H.264 can be computed exactly in integer arithmetic, thus avoiding inverse transform mismatch problems. complete architecture of a Multi-standard video encoder is shown in Fig. 2. Lee and Cho [3] presented an area-efficient architecture of a VLSI circuit that can perform various DCT-based transforms for a video decoder supporting multiple standards such as JPEG, MPEG-4, VC-1 and H.264. The proposed architecture uses a novel concept of a delta coefficient matrix and shares resources such as adders and shifters (instead of multipliers) as much as possible. Huang and Gao [4] presented a low-cost very large scale integration (VLSI) architecture is designed for multi-standard inverse transform. The proposed architecture is used in multistandard decoder of MPEG-2, MPEG-4 ASP, H.264/AVC and VC-1. Two circuit share strategies, factor share (FS) and adder share (AS), are applied to the inverse transform architecture for saving its circuit resource Fig. 2 Multi-standard video encoder Initially video sequence is converted into frames whose pixel values are fed as input serially to the input buffer section, which temporarily stores the pixel values of a frame and thus 64 pixels form a frame. From the input buffer the first frame and every consecutive 8th frame is send as Jignesh Patel et. al. [5] presented a VHDL implementation reference frame to the motion-estimation block and the of H.264 video coding standard. Depending upon applications, remaining frames are fed directly to the motion-estimation H.264 defines the Profiles and Levels specifying restrictions block. In video compression, discrete cosine transform on bit streams like some of the previous video standards. Design is developed for Transform and Quantization to (DCT) is widely used because it concentrates signal scaling the input video and converted for generating byte information in a few low-frequency components. But, if we stream of input video. Kanwen Wang, Jialin Chen in [6] want to implement the DCT with low error, an arithmetic with presents a reconfigurable VLSI architecture which is designed long word length and floating point implementation is for multi-transform codec in several video coding standards of required which makes the implementation complex which in turn makes the coder hardware more expensive. Thus, MPEG-2/4, VC-1, H.264/AVC and AVS. transform without float-point multiplication, namely, integer transform, has been raised in this Multi-standard Transform III. ARCHITECTURE OF MULTI-STANDARD VIDEO unit discussed in section IV. It is similar, but not identical, to ENCODER DCT. The transformed data are called transform coefficients Video compression exploits three kinds of data and they are passed to the quantization unit. Most modern multimedia codecs (both encoder and redundancy within a video scene. In a standard lossy video decoder) employ transform-quantization pair. The objective is compression such as H.264, the temporal redundancy between adjacent frames is the most important redundancy in video- to minimize the number of bits which must be transmitted to type data and a large proportion of data dependency can be the decoder. Reduced quantization accuracy reduces the reduced through motion estimation. Spatial redundancy, number of bits which need to be transmitted to represent a which can be exploited within a frame, is reduced via given transform coefficient. The entropy encoding unit is the last stage in a video transform techniques. It also constitutes a significant portion compression system. The motion vectors output by the motion of redundancy since there is usually a high correlation estimation unit and quantized transform coefficients from the between neighbouring pixels. Lastly, statistical redundancy, which must occur in any kind of data source, is reduced by an transform unit are accepted in this stage to produce the compressed bit stream that can be transmitted or stored. After entropy coder in the last stage of the encoder. quantization the 8×8 blocks of transform coefficients are Every compression system involves complementary scanned in a zigzag manner to turn 2-D array into serial values units, an encoder (compression unit) and a decoder of quantized coefficients. It is the process of ordering the (decompression unit).The encoder exploits the redundancy transform coefficients from low to high spatial functions thus among the given data and converts it to a compressed data providing more compression. The next step in entropy stream. The decoder interprets the compressed data stream encoder is the run length encoder. It is the basic form of and restores it into the original format. The encode/decode lossless compression. Run length encoder replaces pair is often described as a codec (coder/decoder). The consecutive repeating occurrences of a symbol by one occurrence of the symbol followed by the number of ISSN: 2231-5381 http://www.ijettjournal.org Page 146 International Journal of Engineering Trends and Technology (IJETT) – Volume 14 Number 3 – Aug 2014 occurrences i.e., (run, length). Huffman encoder is a technique which assigns a variable length code word to an input data item that is characters are not coded to a fixed number of bits. Smaller code word is assigned to the more frequently occurring input. In order to get the results of matrix calculation, lots of multiplication and addition are required. The positions and signs of coefficients a to g are the same, although each standard as its own coefficient values in the transform matrix. The coefficients for different video coding standards are given in Table I IV. MULTI-STANDARD TRANSFORM UNIT Table I Coefficient among different video coding standards A. Design of Multi-standard Transform unit The transform unit reduces spatial redundancy within a picture. Its input is the residue picture calculated by the motion estimation unit. The 1-D 8-point forward transform coefficient matrix is described as follows: Coefficient value U a a a a a a a a ⎡b c d e −e −d −c −b⎤ ⎢ f g −g −f −f −g g f ⎥ ⎢ ⎥ c −e −b −d d b e −c Ctran8 =⎢ ⎥ a −a −a a ⎥ ⎢ a −a −a a c −c −e b −d⎥ ⎢d −b e ⎣ g −f f −g −g f −f g ⎦ a a a a f g −g −f Utran4 = a −a −a a g −f f −g Vtran4 = b c d e c −e −b −d d −b e c e −d c −b Note that the 4-point U matrix is also used in 4-point transform coding. The 1-D 8-point transform is illustrated as, Y8=C8×X8 Where, Y8 = [y0 y1 y2 y3 y4 y5 y6 y7]T X8 = [x0 x1 x2 x3 x4 x5 x6 x7]T C8 is the 8-point transform coefficient matrix. The 1-D 8point forward transform can be decomposed as, y0 y2 = Utran4 y4 y6 x0 + x7 x1 + x6 x2 + x5 x3 + x4 y1 y3 = Utran4 y5 y7 x0 − x7 x1 − x6 x2 − x5 x3 − x4 ISSN: 2231-5381 8 Point 4 Point VC-1 8 Point H.264 VC-1 4 Point H.264 AVS a 362 12 8 17 1 8 f 473 16 8 22 2 10 g 196 6 4 10 1 4 b 502 16 12 - - 10 c 426 15 10 - - 9 d 284 9 6 - - 6 e 100 4 3 - - 2 Matrix V Matrix Ctrans4 is the 8×8 transform coefficient matrix. In total, there are 64 coefficients in this matrix, with 7 different values of a to g. For different video coding standards the position of these coefficients a to g is same but the values of it changes for each video coding standard. By using fast algorithm from, an 8-point forward transform matrix can be decomposed into two 4-point forward transform matrices, which are, MPEG 2/4 B. Architecture of Multi-standard Transform unit The architecture of Multi-standard transform unit is given in figure 2. It mainly consists of the U matrix calculation block, the V matrix calculation block, the pre-processing block, and the adder tree block. The matrix calculation is multiplier less, which is made of only adders and shifters. Two kinds of constant multipliers are used to calculate each term of the matrix product. That is, an AFG constant multiplier is in charge of U matrix calculations, and a BCDE constant multiplier is in charge of V matrix calculations. The constant multipliers are responsible for calculating coefficients in parallel and can be reconfigured to support different standards. The constant multipliers are reconfigured to different standards depending upon the input at sel_standards. In the proposed multi-standard transform unit the ripple carry adder in the design is replaced by carry select adder by which the combinational delay and the memory usage for the design can be reduced. The pre-processing block is used to realize the butterfly structure of forward transform. The adder tree block is used to obtain the sum of the matrix product. http://www.ijettjournal.org Page 147 International Journal of Engineering Trends and Technology (IJETT) – Volume 14 Number 3 – Aug 2014 and s2 are used as the common select lines for all the three 8:1 multiplexers. Fig 5 shows the BCDE multiplier unit. With the help of six select lines (s0, s1, s2, s3, s4, s5), the b, c, d, e coefficients for different standards are generated. The select lines s0 and s1 are used as common select lines for the four 4:1 multiplexers. s4 and s5 are the select lines for 2:1 MUX m7 and m9 respectively. s2 and s3 are the select lines for the 4:1 MUX (m6). Fig. 2 1-D Multi-standard Transform VLSI architecture In the proposed design the structure of “standardsel_afg” consists of a select_function unit and afg_multiplier unit as shown in Fig 3. With the help of select_function unit seven select lines (m0, m1, s0, s1, s2, s3, s4) a, f, g coefficients of different standards are generated. The generated select lines are fed to the “afg_multiplier” unit. Similarly “bcde_coefficient” consists of a select_function unit and bcde_multiplier unit. For bcde_multiplier select_function unit generates six select lines. Fig 5 BCDE constant multiplier The 2-D multi-transform can be implemented with the 1-D multi-transform and a transpose memory in a row– column decomposition manner. Initially multi-transform is applied in the row of the 8×8 image block. Then, the obtained transform coefficients are stored temporarily in the transpose memory and then multi-transform is applied to the column of the result. Fig 3 Standardsel_afg unit V. EXPERIMENTAL RESULTS A. Performance Evaluation Main performance evaluation parameters considered in this architecture is combinational path delay. B. Simulation and Implementation setup Fig 4 AFG constant multiplier Fig 4 shows the AFG multiplier unit. The select lines m0 and m1 is used for add/sub module m0 and m7 respectively. When m0 and m1 is 1, add/sub module acts like a subtractor else as an adder. The select lines s0 and s1 is for the 2:1 MUX m5 and m9 respectively. The select lines s4, s3 ISSN: 2231-5381 The proposed Multi-standard video encoder is designed using the hardware description language Verilog in XILINX ISE Design Suite. The design is simulated using Isim simulator. The proposed design is implemented using Virtex-5 FPGA board. Throughout the design the value of the code stands for different standards as follows. Code – 0: MPEG 2 Code – 1: MPEG 4 Code – 2: 8 Point VC-1 Code – 3: 8 Point H.264 Code – 4: 4 Point VC-1 http://www.ijettjournal.org Page 148 International Journal of Engineering Trends and Technology (IJETT) – Volume 14 Number 3 – Aug 2014 Code – 5: 4 Point H.264 Code – 6: AVS VII. CONCLUSION In this paper, the design and architecture of multiFigure 6 shows the simulation result of the forward standard video coding is developed using the proposed multitransform applied on the first row of the 8×8 image block. ‘x0’ standard transform unit which generate transform coefficients to ‘x7’ is the input i.e., the first row of the image block. Code for different video standards through static reconfiguration. denotes different video standards. ‘y0’ to ‘y7’ are the transformed output. ACKNOWLEDGEMENTS I would like to thank Ms. R. Ramya for providing required assistance and extending all the facilities for carrying out my work successfully. REFERENCES [1] K. Kim and J. S. Koh, “An area efficient DCT architecture for MPEG-2 video encoder,” IEEE Trans. Consum. Electron., vol. 45, no. 1, pp. 62–67, Feb. 1999. [2] Henrique S. Malvar, Antti Hallapuro, Marta Karczewicz and Louis Kerofsky, “Low complexity transform and quantization in H.264/AVC”, IEEE Transaction on Circuits and Sytems for video technology, Vol.13, No.7, pp. 598-603, July 2003. Fig 6 Simulation result of 1 D forward transform unit Figure 7 shows the simulation results of the multistandard video encoder. ‘out_sig’ is the bit stream generated after the compression. This bit stream can be either transmitted or stored. ‘out_rdy’ is high for the bit stream of each pixel. [3] S.Lee and K.Cho,“Architecture of transform circuit for video decoder supporting multiple standards”, Electronics Letters, Vol. 44, No.5, Feb 2008 [4] H. Qi, Q. Huang, and W. Gao, “A low-cost very large scale integration architecture for multistandard inverse transform,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 7, pp. 551–555, Jul. 2010. [5] Jignesh Patel, Haresh Suthar, Jagrut Gadit, “VHDL Implementation of H.264 Video Coding Standard,” International Journal of Reconfigurable and Embedded Systems (IJRES), Vol. 1, No. 3, pp. 95-102 ISSN: 2089-4864, November 2012. [6] Kanwen Wang, Jialin Chen, “A reconfigurable multi transform VLSI architecture supporting video codec design”, IEEE Trans. Circuits Syst. II,Vol. 58, No.7, pp.432-436, July 2011. Fig. 7 Multi-standard video encoder D. Comparison of timing summary While comparing the existing multi-transform architecture with the proposed architecture it is found that the delay got reduced in the proposed architecture. Table II gives its comparison. Table II Comparison of Timing summary Area of comparison Maximum combinational path delay No. of slice registers Multi-transform architecture 24.118ns Proposed architecture 21.846ns 5% 5% ISSN: 2231-5381 http://www.ijettjournal.org Page 149