VIDEO COMPRESSION AND DECOMPRESSION USING ADAPTIVE ROOD PATTERN SEARCH Nirav Shah B.E., Dharmsinh Desai University, India, 2006 PROJECT Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in ELECTRICAL AND ELECTRONIC ENGINEERING at CALIFORNIA STATE UNIVERSITY, SACRAMENTO FALL 2010 VIDEO COMPRESSION AND DECOMPRESSION USING ADAPTIVE ROOD PATTERN SEARCH A Project by Nirav Shah Approved by: __________________________________, Committee Chair Jing Pang, Ph. D. __________________________________, Second Reader Preetham Kumar, Ph. D. ____________________________ Date ii Student: Nirav Shah I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the Project. __________________________, Graduate Coordinator Preetham Kumar, Ph. D. Department of Electrical and Electronic Engineering iii ________________ Date Abstract of VIDEO COMPRESSION AND DECOMPRESSION USING ADAPTIVE ROOD PATTERN SEARCH by Nirav Shah Video compression is getting more and more important in the electronic world with increased amount of bandwidth and storage requirement due to increase in the video usage over the internet. Pioneering advances in the video compression algorithms is important. The project discusses various algorithms that are currently available in the commercial market for its advantages and disadvantages. One of them is H.264 standard. H.264 is a motion-block oriented codec standard developed by ITU-T. Aim of this algorithm is to provide better video quality with fewer amounts of information transfer. The final goal of the project was to implement a video encoder and decoder using Matlab. A video captured in RGB format was encoded using the encoder with each frame processed by dividing it into several motion-blocks. In the encoder part, several motion estimation algorithms were studied. The algorithms were compared with respect to number of calculations done by each algorithm and arithmetic complexity involved in them. Peak signal to noise ratio for multiple frames was also calculated for different algorithms to get information about quality of the algorithm. From the discussed iv algorithms, ARPS (Adaptive Rood Pattern Search) algorithm was used in the final encoder. Motion vectors generated by the ARPS were given to Motion Compensation to generate compensated image. The compensated image was transformed using DCT (Discrete Cosine Transform). Finally, the transformed vectors were Quantization and encoded using RLE (Run length encoding). The encoded video stream was successfully decoded by a decoder following reverse process to re-generate the video in its actual format. _______________________, Committee Chair Jing Pang, Ph. D. _______________________ Date v ACKNOWLEDGMENTS Before starting discussing the project, I would like to put some heartfelt words for those who motivated and helped me for completing this project successfully. I am thankful to Dr. Jing Pang for providing me an opportunity to work on this project, which was a great exposure to the field of video processing. I thank her for providing all resources, help and guidance whenever needed to successfully finish the project. Her knowledge and expertise in the field was very helpful for me to understand the project and finish it successfully. Without her guidance and help this project would not have been completed successfully. I would like to thank Dr. Preetham Kumar for reviewing my report and providing valuable suggestions that helped me to improve my project report. I would like to thank my family members for providing me strength and inspiration during the critical phases of this project in the last one year. Finally, I would like to thank all the faculty member of Department of Electrical and Electronics Engineering at California State University, Sacramento for the help and support to complete my graduation successfully. vi TABLE OF CONTENTS Page Acknowledgments vi List of Tables ix List of Figures x Chapter 1. INTRODUCTION 1 1.1 Purpose of the Project 2 1.2 Significance of the Project 2 1.3 Organization of Report 3 2. BASIC COMPONENTS OF H.264 ENCODER AND DECODER 5 2.1 Types of Redundancy 5 2.2 Encoder Module 8 2.2.1 Intra Frame Prediction 11 2.2.2 Inter Frame Prediction 12 2.2.3 Motion Compensation 14 2.2.4 Residual Frame 15 2.2.5 Residual Frame Encoding 16 2.3Decoder Module 16 3. IMAGE COMPRESSION COMPONENTS 19 3.1 Transformation 20 3.2 Quantization 22 3.3 Entropy Encoding and Decoding 24 4. INTER FRAME PREDICTION BASICS 25 5. INTER FRAME PREDICTION ALGORITHMS 29 5.1 Exhaustive Search 31 5.2 Three Step Search 32 5.3 Diamond Search 35 vii 5.4 Adaptive Rood Pattern Search 37 6. SIMULATION RESULTS 43 7. CONCLUSION 47 References 48 viii LIST OF TABLES Page 1. Table 3-1: 8x8 Discrete Cosine Transform Matrix ............................................... 21 2. Table 3-2: 8x8 Pixels of Block ............................................................................. 21 3. Table 3-4: DCT of a Block Shown in Table-2...................................................... 22 4. Table 3-5: Quantization Matrix Example ............................................................. 23 5. Table 3-6: DCT Matrix Before Quantization ....................................................... 23 6. Table 3-7: De-quantized DCT Matrix .................................................................. 23 7. Table 5-1: Average Cost in ES, TSS and NTSS ................................................... 35 ix LIST OF FIGURES Page 1. Figure 2-1: Special Redundancy Example.............................................................. 5 2. Figure 2-2: Temporal Redundancy ......................................................................... 6 3. Figure 2-3: Difference Between Consecutive Frames ............................................ 7 4. Figure 2-4: Transformation Distribution of The Image Shown in Figure-2-1 ....... 7 5. Figure 2-5: Block Diagram of H.264 Encoder ....................................................... 9 6. Figure 2-6: Intra Frame Prediction Block Diagram .............................................. 12 7. Figure 2-7: Prediction Using One Reference Frame ............................................ 13 8. Figure 2-8: Prediction Using Two Reference Frames .......................................... 13 9. Figure 2-9: Compensated Frame........................................................................... 14 10. Figure 2-10: Residual Frame ................................................................................ 15 11. Figure 2-11: Residual Frame 3D Mesh Plot ......................................................... 16 12. Figure 2-12: Block Diagram of H.264 Decoder ................................................... 17 13. Figure 3-1: Image Compression Components ...................................................... 19 14. Figure 4-1: A16x16 Macro Block Breaking ......................................................... 25 15. Figure 4-2: Area of Search for a Macro Block ..................................................... 26 16. Figure 4-3: Motion Vector for a Macro Block ..................................................... 27 17. Figure 5-1: Searching Block and Current Macro Block ....................................... 29 18. Figure 5-2: Basic Three Step Search .................................................................... 32 19. Figure 5-3: New Three Step Search ...................................................................... 34 20. Figure 5-4: LDSP and SDSP ................................................................................ 36 x 21. Figure 5-5: Example of Diamond Search ............................................................. 36 22. Figure 5-6: Matching Error Surface...................................................................... 38 23. Figure 5-7: Types of Prediction of Motion ........................................................... 40 24. Figure 5-8: First step of ARPS.............................................................................. 41 xi 1 Chapter 1 INTRODUCTION Usage of HD (High Definition) Video is increasing day by day in applications like television, video over internet, gaming and video surveillance [1]. Considerable storage space is required to store such information. Computational requirement to process large amount of data is also very high. Pervasive, seamless, high-quality digital video has been the goal of companies, researchers and standards bodies. Areas like television and consumer video storage has already captured huge market of consumer electronics. Applications like videoconferencing, video email, and mobile video are also increasing day by day demanding intense video processing. Getting digital video from its source (a camera or a stored clip) to its destination (a display) involves a chain of components or processes. Key to this chain are the processes of compression (encoding) and decompression (decoding), in which bandwidth-intensive ‘raw’ digital video is reduced to a manageable size for transmission or storage, then reconstructed for display. Getting the compression and decompression processes ‘right’ can give a significant technical and commercial edge to a product, by providing better image quality, greater reliability and/or more flexibility than competing solutions. Challenge is adoption of a common process that can be used by variety of audiences. Two major groups are actively involved in providing standards for video compression: Moving Picture Experts Group (MPEG) and Video Coding Experts Group (VCEG). The MPEG and VCEG have developed a new standard that promises to outperform the earlier MPEG-4 and H.263 standards, providing better compression of 2 video images. The new standard is entitled ‘Advanced Video Coding’ (AVC) and is published jointly as Part 10 of MPEG-4 and ITU-T Recommendation H.264 [2]. 1.1 Purpose of the Project Video stream needs to be processed from several steps in order to encode and decode the video such that it is compressed efficiently with available limited resources of hardware and software. Each step can be implemented with different algorithms to accomplish required task. All advantages and disadvantages of available algorithms should be known to implement a codec to accomplish final requirement. The purpose of this project is to implement all basic building blocks of H.264 video encoder and decoder. 1.2 Significance of the Project The significance of the project is the inclusion of all components required to encode and decode a video in Matlab. This project contains several algorithms for inter frame prediction. Inter frame prediction does prediction of position of a macro block within current frame by taking past and future frames as reference. Along with doing prediction of a macro-block, challenge is processing huge amount of data in each frame. That is because usage of high definition video is increasing widely. High definition videos are captured at several frames/second rates. Such videos have large amount of macro blocks in each frame. Components designed in Matlab for the project will be helpful while implementing them on hardware to get information on factors like complexity and performance. 3 1.3 Organization of Report Chapter two contains brief information on all basic components of a video encoder and decoder. First top-level diagram of encoder module explained including use of inter frame prediction, intra frame prediction and motion compensation inside the encoder to encode video information. Then top-level diagram of decoder module is explained including each component inside decode to decode video information. Residual image is compressed with image compression concepts. Chapter three contains information on components involved on image compression and decompression. In that first discrete cosine transformation in explained which transforms given image in frequency domain. Because of that frequencies containing more information and less information are separated. Chapter four contains basic information of inter frame prediction. Using inter frame prediction, location of a macro-block in current frame is matched with a macroblock of past or future frame. Each macro-block in current frame is compared with several macro-blocks of past or future frame to find best match. Because of that Inter frame prediction is the most computational part of video encoding. Various algorithms used for inter frame prediction are explained in Chapter five. Exhaustive search algorithm is very basic inter-frame prediction algorithm. Implementing this algorithm is simple and PSNR achieved by this algorithm is very high. Amount of calculations required in this algorithm is very large. Later adaptive rood pattern search algorithm is explained. It has very less computational requirement compared to other 4 algorithms. However, PSNR achieved by this algorithm is very close to exhaustive search algorithm. Chapter six contains simulation results done using adaptive rood pattern search and diamond search. For the comparison, computational complexity involved in each algorithm and peak signal to noise ratio were compared. 5 Chapter 2 BASIC COMPONENTS OF H.264 ENCODER AND DECODER Intension of video compression algorithms is to remove different types of redundant information from a video. Video compression algorithms operate by removing redundancy in the temporal, spatial and/or frequency domains. Algorithms are implemented in such a way that before sending video information, redundant information is removed. Because of that amount of space required to save a video is largely reduced. On the decoder side with help of algorithms used, redundant information is extracted back from available compressed data. 2.1 Types of Redundancy Special refers to redundancy available within a picture. Notice special redundancy available in following figure [4]: Figure 2-1: Special redundancy example In the figure above notice are surrounded by black border. In this area, there is not a significant change in information. In this case size can be saved by sending information of very small part of the block and then mentioning to use same information for remaining area inside the black border. 6 Temporal redundancy refers to redundancy between two multiple frames. In a second multiple frames are captured to form a video. In normal video capture, objects are not moving with a rapid amount. Also, all objects in a frame are not moving with respect to frames located nearby each other. Following figure shows two consecutive frames: Figure 2-2: Temporal redundancy Above two are consecutive figures. Notice that there is not much difference between two consecutive figures. In other words, almost all objects are steady between two frames. In this case, while sending second frame we can just mention related information compared to previous frame and save huge amount of information that needs to be transferred or saved. Following figure shows difference of above two frames using Matlab. Notice that there is very less movement between two frames. With help of inter frame prediction technique mentioned in Chapter-4, the movement related information is transferred using motion vector of each macro block. On the receiver end, with help of a reference frame and motion vector of each macro block the other frame is reconstructed. By sending only motion vector related information large amount of data that needs to be transferred or saved for a frame is saved. 7 Figure 3-3: Difference between consecutive frames The human eye and brain (Human Visual System) are more sensitive to lower frequencies [3]. Because of that even if we remove high frequency information, an image is still recognizable. Following figure shows discrete cosine transformation of the frame shown in figure-1: Figure 3-4: Transformation distribution of the image shown in figure-2-1 8 Notice that the image is of 190x256 pixels. After doing transformation, the information is concentrated towards the center. Because of the characteristic of concentration towards lower frequency, while doing compression, higher frequency information is removed from a frame to achieve higher compression. In video compression, by removing different types of redundancy (spatial, frequency and/or temporal) it is possible to compress the data significantly at the expense of a certain amount of information loss (distortion). Further compression can be achieved by encoding the processed data using an entropy coding scheme such as Huffman coding or Arithmetic coding. 2.2 Encoder Module An H.264 video encoder mainly comprises of inter-frame prediction, motion compensation, intra frame prediction, discrete cosine transformation, quantization and entropy encoding [1]. Below is a block diagram showing connection of all components with each other: 9 Fn (Current) T and Q Entropy Encoder Encoded Stream Inter Prediction MC F’n-1 (Reference) Choose Intra Prediction Fn’ Intra Prediction Filter Inverse T and Q Figure 2-5: Block diagram of H.264 encoder As shown in the figure, intra frame prediction is done almost same way as image compression. Notice the feedback path in the figure while doing intra prediction. The feedback is to prevent difference at decoder side compared to encoder side [5].Intra predicted frames are known as I-frame. Inter frame prediction is done based on current frame and reference frame. The reference frame can be a past frame or future frame. Inter frame prediction can be done by taking one or more than one frames as reference. Based on that inter predicted frame can be a P-frame or B-frame. If a frame is predicted from one reference frame, then the predicted frame is known as P-frame. On the other hand, if a frame is predicted from more than one reference frame then the predicted frame is known as B-frame. Inter prediction gives information about motion vector (MV) for each macro block (MB). 10 Motion vector for a macro block gives information about location of the macro block in current frame with respect to location of the macro block in reference frame. Motion compensation block generates a compensated image by taking motion vector information of current frame and reference frame/frames as reference. Frame generated by this frame is almost same as current frame. That is because motion vector for current frame tells from which position the motion compensation block should take a macro block of reference frame to put in current frame. Motion compensated image is not exactly same as current frame. Because of that a residual image is generated from current frame and compensated image. Following example shows how residual image is helpful to regenerate current frame at the decoder end by taking only MVs of current frame and reference frame as reference: Cur_fr = Current frame Ref_fr = Reference frame Comp_fr = Compensated frame Res_fr = Residual frame MV= fun (Cur_fr, Ref_fr) Comp_fr = fun (MV, Ref_fr) Res_fr = Cur_fr – Comp_fr Notice that decoder already has Ref_fr and MV. Based on the information it can generate Comp_fr. Ref_fr is extracted at the decoder end after passing input through inverse discrete cosine transform and inverse encoding stages. From residual frame and Compensated frame, current frame is generated in following way: 11 Cur_fr = Res_fr – Comp_fr Notice that prior to sending residual frame, it is discrete cosine transformed, quantized and entropy encoded. These stages are similar to JPEG compression. By transforming a frame, low frequency information and high frequency information are separated from each other from a frame. Quantization is allowing more low frequency information to pass compared to high frequency information. Finally the M x N frame is converted to a serial stream by zigzag pattern and entropy encoded. 2.2.1 Intra Frame Prediction Intra frame prediction refers to prediction of position of a macro block with respect to position of another macro block within the same frame. During the intra predict mode selection, the prediction cost is computed and compared. The mode which has least cost is selected in video coding [5]. Following is a block diagram showing major components involved in intra frame prediction: Fn (Current) T and Q Reconstructed Frame Intra Prediction For coder Inverse T and Q Intra Prediction for search Mode Decision Entropy Encoder Encoded Stream 12 Figure 2-6: Intra frame prediction block diagram Intra frame prediction is done based on mode selection. Nine modes are defined in intra frame prediction in H.264. In practice, these intra prediction modes mentioned above take the correlation between two macro-blocks into consideration. During the intra predict mode selection, the prediction cost is computed and compared. The mode which has least cost is selected in video coding. 2.2.2 Inter Frame Prediction Inter frame prediction refers to prediction of position of a macro block with respect to position of another macro block in a reference frame. There can be one or more reference frame for predicting position of a macro block of current frame. Based on that inter frame predicted frame can be a P-frame or B-frame. If only one frame is used as reference, then the predicted frame is known as P-frame otherwise the predicted frame is known as B-frame. Following figure shows as example of prediction done using single reference frame: Reference Frame Current Frame 13 Figure 2-7: Prediction using one reference frame Notice that the object in reference frame has just moved to another position in the current frame. In inter prediction, only the displacement information is passed for current frame which is known as a P-frame. Figure 2-8: Prediction using two reference frames As shown in the figure above, the light blue object has moved a little with respect to past frame compared to future frame. Because of that prediction of position of the blue object is done with respect to position of the blue object in past frame. The square object has moved a little with respect to future frame compared to past frame. Because of that prediction of the square object is done with respect to position of the square object in the future frame. As the current frame is predicted from past frame as well as future frame, the predicted frame is known as B-frame. A frame is divided into macro block. After that each macro block is searched in reference frame in nearby search area. Computation required to do the prediction is very large. Several algorithms are developed to reduce required computations. Chapter-5 contains algorithms used to do inter frame prediction. 14 2.2.3 Motion Compensation In motion compensation, a compensated image is generated by taking reference image and motion vectors of current frame as reference [6]. The motion compensation block has information of motion vector for each macro block. Motion vector is giving information about from where to take a macro block for the current frame from a reference frame. Following is an example shown how a macro block for current frame is taken from reference frame with help of motion vector: MV (i, j) = (p, q); (2.1) MBCF (i, j) = MBRF (i+p, j+q); (2.2) Where MV is location of a macro block MBRF is location of a macro block in reference frame MBCF is location of a macro block in current frame Figure 2-9: Compensated frame Above figure is generated from motion vectors of current frame and reference frame. Notice that the compensated frame is almost matching with current frame. Difference between compensated frame and current frame is transmitted so that decoder 15 can regenerate the current frame with help of motion vectors for the frame and reference frame. 2.2.4 Residual Frame Compensated image is not exactly equal to current frame. Because of that current frame is subtracted from compensated image and send to decoder. The decoder already has information of compensated frame because of motion vector and reference frame. The decoder adds a residual frame with a current frame to regenerate current frame. Following figures shows residual frame: Figure 2-10: Residual Frame Notice that the difference between compensated frame and current frame is very small. In the figure above black area indicates no difference between compensated frame and current frame. Following is 3D mash plot of the frame to get more information about difference between compensated frame and current frame: 16 Figure 2-11: Residual Frame 3D mesh plot In the figure above notice that throughout the frame difference between compensated frame and current frame is negligible. Because of less difference throughout the frame huge spatial redundancy is available in the residual frame. 2.2.5 Residual Frame Encoding Residual frame is in M x N format. As residual frame contains difference between compensated frame and current frame, it has huge special redundancy. This redundancy is removed in frequency domain by doing discrete cosine transform of the residual frame. Finally to transmit or save the encoded data it is converted to a serial stream and entropy encoded. 2.3 Decoder Module Decoder regenerates residual frame by doing entropy decoding of input serial stream and doing inverse discrete cosine transform. Decoder also receives information about motion vector for each macro block in of current frame. Based on available 17 information the decoder is able to generate current frame at the decoder side. Following figure shows basic building components in a decoder module. Notice that except motion estimation component, the decoder is having all components same as encoder module. Because of no motion estimation part in decoder, implementation is a decoder is relatively simple than the encoder module. F’n-1 (Reference) MC Intra Prediction F’n (Reconstructed) Filter Inverse T and Q Entropy decoder Input stream Figure 2-12: Block diagram of H.264 decoder Residual frame in form of serial stream is received as an input to the decoder. As shown in the block diagram the input stream is first entropy decoded. By doing that serial redundancy removed while sending the information is achieved back. Similar to entropy coding, entropy decoding also can be considered as 2-step process. The first step converts the input bit stream into the intermediate symbols. The second step converts the intermediate symbols into the quantized DCT coefficients. The quantized DCT co-efficient information is de-quantized with same algorithm used to quantize DCT co-efficient. The de-quantized co-efficient of DCT are passed through inverse discrete cosine transform to get residual image back. The process is 18 similar to decompression method used to decompress an image. This process is described more in detail in Chapter-3. If a received frame is a P-frame or B-frame, with help of motion vector associated with the frame, compensated frame is generated by taking past frame as reference. The compensated frame is added with the residual frame to reconstruct current frame. If a received frame is an I-frame, with help of intra-frame prediction component the frame is reconstructed. 19 Chapter 3 IMAGE COMPRESSION COMPONENTS In the block diagram of H.264 video encoder shown in figure-5 notice that components seen towards the end side are matching with components involved in image compression. These components are re-drawn in the following figure in more detail: Input Frame Breaking in 8x8 blocks QUANTIZATION ZIG ZAG DCT of each block Entropy Encoding Output serial stream Figure 3-1: Image compression components As shown in the figure above, input image is first divided into 8x8 blocks. On each block discrete cosine transformation is applied. By doing that time domain information is transferred in frequency domain. In this frequency domain representation, low frequency information is separated from high frequency information. An 8x8 quantization matrix is formed to do quantization of an 8x8 discrete cosine transformed block. Human eyes are more sensitive to low frequency components compared to high frequency components [3]. Because of that quantization matrix is implemented in such a way that low frequency components are less quantized compared to high frequency components. 20 By doing zigzag patterning 8x8 blocks are transferred to serial stream from lowest frequency information to highest frequency components. Finally the serial data is entropy encoded to compress serial stream without losing any information. 3.1 Transformation A discrete cosine transform (DCT) is a Fourier related transform. It is similar to the discrete Fourier transform without imaginary numbers. In a DCT, a sequence of numbers represented in terms of sum of cosine function [7]. There are different types of discrete cosine transform. From them type-|| DCT is used in applications like image and video compression [1]. A DCT of an NxN block can be expressed as following: Y=AXB In above equation, A is a discrete cosine transform matrix, X is a block of samples of which we want to do discrete cosine transform, B is transpose of matrix A and Y is discrete cosine transformed matrix. An inverse DCT of an NxN block can be expressed as following: X=BYA Notice that for doing discrete cosine transformation and inverse discrete cosine transformation, a discrete cosine transform matrix is used. This discrete cosine transform is a constant which is shared by all 8x8 blocks of a frame. With following equation elements of the discrete cosine transform matrix can be achieved: (3.1) Where, 21 and (3.2) Following is the discrete cosine transform matrix used for 8x8 pixels frame block: 0.353553 0.353553 0.353553 0.353553 0.353553 0.353553 0.353553 0.353553 0.490393 0.415818 0.277992 0.097887 -0.097106 -0.277329 -0.415375 -0.490246 0.461978 0.191618 -0.190882 -0.461673 -0.462282 -0.192353 0.190145 0.461366 0.414818 -0.097106 -0.490246 -0.278653 0.276667 0.49071 0.099448 -0.414486 0.353694 -0.353131 -0.354256 0.352567 0.354819 -0.352001 -0.355378 0.351435 0.277992 -0.490246 0.096324 0.4167 -0.414486 -0.100228 0.491013 -0.274673 0.191618 -0.462282 0.461366 -0.189409 -0.193822 0.463187 -0.46044 0.187195 0.097887 -0.278653 0.4167 -0.490862 0.489771 -0.413593 0.274008 -0.092414 Table3-1: 8x8 discrete cosine transform matrix Above set of 64 numbers are created by multiplying a horizontally oriented set one-dimensional 8 point cosine basis function by vertically oriented set of same functions. In above table horizontal frequencies are represented by set of the horizontal set of co-efficient while vertical frequencies are represented by set of the vertical set of co-efficient. Following tables shows a block of 8x8 pixels and it’s discrete cosine transform respectively: 140 144 152 168 162 147 136 148 144 152 155 145 148 167 156 155 147 140 136 156 156 140 123 136 140 147 167 160 148 155 167 155 140 140 163 152 140 155 162 152 155 148 162 155 136 140 144 147 179 167 152 136 147 136 140 147 175 179 172 160 162 162 147 136 -14 14 -20 19 7 -1 Table 3-2: 8x8 pixels of block 186 21 -10 -18 -34 -24 15 26 -2 -9 -9 6 23 -11 -18 -9 11 3 22 -8 -3 4 9 0 -5 10 -2 1 -8 14 8 -18 -3 -2 -15 1 8 4 2 -8 -11 8 -1 1 -3 18 -4 -7 4 -3 18 1 -1 -6 8 15 -7 -2 0 Table 3-3: DCT of a block shown in table-2 In transformed matrix low frequency information is concentrated on left top side of a matrix. Notice how more information is concentrated in the table above. 3.2 Quantization Each element of an 8x8 pixels of block is occupying 8-bit of memory. Each element 0f discrete cosine transformed matrix is occupying more memory than the original block. That is because value of an element within discrete cosine transformed matrix can be from a low -1024 to a high 1023. To store the information of each element 11-bit of memory is required. Human eyes are more sensitive to lower frequency compared to higher frequency [3]. Huge amount of memory can be saved by quantizing high frequency information from a discrete cosine transformed matrix. A quantization matrix is implemented in such a way that low frequency information is less quantized compared to high frequency information. By doing quantization, number of bits required to save an element is reduced without much compromising with distortion in lower frequency components. Following is an example of a quantization matrix: 3 5 7 9 11 5 7 9 11 13 7 9 11 13 15 9 11 13 15 17 11 13 15 17 19 13 15 17 19 21 15 17 19 21 23 17 19 21 23 25 23 13 15 17 15 17 19 17 19 21 19 21 23 21 23 25 23 25 27 25 27 29 27 29 31 Table 3-4: Quantization matrix example Notice that with larger quantization value large error can be generated in the DCT output while doing de-quantization. As errors generated in high frequency components has less effects on human eyes almost similar frame is seen after doing de-quantization and inverse discrete cosine transformation of a block. Following table shows DCT table before and after quantization. 92 -39 -84 -52 -86 -62 -17 -54 3 -58 62 -36 -40 65 14 32 -9 12 1 -10 49 -12 -36 -9 -7 17 -18 14 -7 -2 17 -9 3 -2 3 -10 17 3 -11 22 -1 2 4 4 -6 -8 3 0 0 4 -5 -2 -2 -2 3 1 2 2 5 0 5 0 -1 3 Table 3-5 DCT matrix before quantization 90 -35 -84 -45 -77 -52 -15 -51 0 -56 54 -33 -39 60 0 19 -7 -9 0 0 45 0 -19 0 0 11 -13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Table 3-6 de-quantized DCT matrix Even though there is difference between DCT matrix before and after quantization, this difference is not visible to human eyes. Also, notice that the discrete cosine transformed matrix is divided by a number that occupies 4-bit. Because of that 24 only seven bits are required to save each element of quantized DCT matrix compared to eleven bits required to save each element of un-quantized DCT matrix. 3.3 Entropy Encoding and Decoding Entropy encoding refers to compressing information without losing any information from the source. In H.264 two types of entropy encodings are used: context based adaptive variable length encoding and context based adaptive variable length encoding. In context based adaptive variable length coding, input 4x4 macro block is first converted to a serial stream using zigzag patterning. The serial data is encoded in form of different parameters [6]. To send each type of parameters, input serial stream is passed through an algorithm to decide parameter value. Following are parameters transmitted to represent serial information in decreased amount of bits. 1. Number of nonzero co-efficient (numCoef) and trailing ones (T1) 2. The pattern of trailing ones (T1) 3. The non-zero co-efficient (Levels) 4. Number of zeros embedded in the non-zero coefficients (Total Zeros) 5. The location of those embedded zeros (run_before) 25 Chapter 4 INTER FRAME PREDICTION BASICS Inter frame prediction refers to prediction by taking more than one frames as a reference. By doing inter frame prediction, temporal redundancy is removed between consecutive frames. For doing that each frame is divided into 16x16, 8x8 or 4x4 macro blocks. In H.264, macro blocks can be of variable size [1]. Larger macro-block size is used for portion of a picture where picture is continuous in given area. To capture small changes in a picture from one frame to another smaller macro-block size is used. 16x16 16x8 8x16 8x8 Figure 4-1: A16x16 macro block breaking As shown in the figure-14 above, a 16x16 macro block can further be divided into 16x8, 8x16 or 8x8 macro blocks. To send a 16x16 macro block, only one motion vector is required. If a macro block is further divided in sub-macro blocks, then in that case one motion vector is required to give displacement related information of each sub macro block. In the above case, two motion vectors are required for a 16x16 macro block with 16x8 and 8x16 sub-macro block breaking. In case of processing a 16x16 macro block with 8x8 sub-macro block size, four motion vectors are required to give information of position of one macro-block. Thus there is tradeoff between precision in a movement and required computation to achieve that precision. To get higher precision in movement, a macro block needs to be further divided into multiple sub-macro blocks and in that case 26 multiple motion vectors are required to locate position of a macro block for which higher calculations are required. mb-x Figure 4-2: Area of search for a macro block Each macro block in current frame is searched within a small area surrounded by the current macro block in a reference frame to find best match for location of a macro block of current frame in a given reference frame. In a video capture, multiple frames are captured in a second. Because of that there is not much movement of an object from one to another consecutive frame. Because of that macro block in current frame is not searched for a matching macro block in entire reference frame. Refer to figure-15 for an example. Notice that the mb-x is of a current frame is searched within dotted area of a reference frame. Searching for matching mb-x in reference frame is started from top-left corner of the dotted line box shown in the figure above. SAD (Sum of absolute difference) or MAD (Mean of absolute difference) is calculated at that point and saved in a temporary variable associated with the search point. After that mb-x is moved in the box shown by dotted line of a reference frame and calculated SAD or MAD at that point 27 and saved in another temporary variable associated with the search point. Following are equations of SAD and MAD respectively: (4.1) And (4.2) Where, N is number of rows M is number of columns Cij is a point in current frame Rij is a point in reference Motion vector for the mb-x is taken as a search point where SAD or MAD is minimum compared to all other search points. Following figure is an example of motion vector found at (-1,-1) location in the reference frame compared to location of mb-x in the current frame. mb-x Figure 4-3: Motion vector for a macro block 28 Notice that macro block located at location mentioned by a macro block of a current frame is not exactly matching inside the reference frame. A frame generated by calculated motion vectors and reference frame is known as compensated frame. As the compensated frame is not exactly matching with the current frame, difference between the current frame and compensated frame is transferred to the decoder which is known as a residual frame. PSNR (Peak signal to noise ratio) can be calculated by following equation indicating peak signal to noise ratio between current frame and compensated frame. (4.2) Where (4.3) For adapting appropriate algorithm, PSNR calculated between compensated frame and current frame for different algorithms. Based on complexity involved in an algorithm and average PSNR received associated with each algorithm, appropriate algorithm is chosen. 29 Chapter 5 INTER FRAME PREDICTION ALGORITHMS Finding a motion vector for a macro block within current frame with respect to a macro block in reference frame is the most computational expensive task in any video encoding [8]. Several algorithms are available to do inter frame motion estimation providing range of computational complexity and PSNR. In this project four algorithms are studied in detail. From that exhaustive search is very simple search algorithm that provides best PSNR [8] among all available search algorithms that has maximum computational complexity. New three step search and diagonal search algorithms has less computational complexity by compromising with PSNR. Finally adaptive rood pattern search algorithm is discussed that has less computational complexity compared to new three step search and diagonal search and has higher PSNR compared to tree step search algorithm and diagonal search. Following figure shows current macro block and search block within which current macro block is searched in reference frame for motion estimation. Note that same process is repeated for each macro block in current frame. Search Block Current MB P P Figure 5-1: Searching block and current macro block 30 In the figure above current MB is a macro block of current frame with size MxM pixels while P is a search area starting from position of current MB in the reference frame. Comparison between several algorithms is done based on two factors: cost function and PSNR. PSNR is calculated after a motion vector is found for current MB. It is explained more in detail in chapter-4. Cost function calculates number of calculations required to complete search for a motion vector for a macro block. Based on algorithm used, cost function can be different for different macro block. (5.1) Where M is size of number of rows or column of a frame m is size of a macro block In high definition video value of M is very large. Size of a macro block is indicated by ‘m’ in the above equation. Notice that for larger value of ‘m’ fewer computations are required. Because of that value of ‘m’ is kept in area of a picture in which position of pixels in nearby area is not changing from one frame to another. Cost during one MB search depends on algorithm used. P is a search area within which best match is calculated. Value of higher P is leading to more computations during search for motion vector of each macro block. Because of that in case of area in a picture where there is fewer movement of an object from one frame to another, value of P is kept less. 31 5.1 Exhaustive Search Exhaustive search algorithm is calculating cost function at each possible location in the search window. Because of that this algorithm is also known as full search algorithm. By finding cost function at each possible position in a reference frame, best possible match is found for a current macro block within reference frame. Maximum search in a search area is leading to the highest PSNR among any other block matching algorithm. Fast algorithms try to achieve PSNR matching with PSNR of exhaustive search. Fast algorithms are implemented to achieve that PSNR with reduced computational complexity. Following equation shows computations required in exhaustive search: (5.2) Where M is size of a macro block P is search area In the equation above m is length of current macro block and p is a search area. The equation is written assuming that the macro block has equal height and width. Note that the square of m term in the above equation refers to calculations required to compare current macro block of a current frame with a macro block of reference frame located within the searching range. That term will be common in any algorithm being used. Special purpose processors are used to compare one macro block with another in parallel [8]. By doing that computational time is saved which is required to do one comparison. Notice that square of (p + m) term is the main factor which is making cost of effective 32 search very large. Fast search algorithms are implemented in such a way that the (p + m) factor is reduced to minimize overall cost for search of motion vector for a macro block in current frame. 5.2 Three Step Search Three step search algorithm is one of the earliest attempt to calculate motion estimation quickly [9]. Generation information about the algorithm is mentioned in the figure below: First Step Second Step Third Step Figure 5-2: Basic three step search As shown in the figure above for P=7, initial step size is taken as 4. With step size of four, current macro block is compared with a macro block in reference frame at nine positions as mentioned in the figure. At the end of first step of search at nine points, best point is taken as center for second step. Best point is the point where MAD is minimum compared to eight other search points. For the second step, step size is reduced to two and 33 again the current macro block is compared with reference frame at nine points. With the best point found in step two, again search is done at nine points with step size of one. Notice how drastically number of steps required to do motion estimation has decreased in the basic three step search. In case of P=7, number of comparisons required in exhaustive search is 255. Here in case of three step search it has reduced to 25. Following is breaking of 25 searches: 9 points: During first step 8 points: During second step 8 points: During third step During first step of the three step search, cost function at the best matching point is already calculated. Because of that at eight points search is required at the second step. The same reason applies to the third step. The TSS is using uniformly allocated checking for motion detection. We can say that because every step the step size for searching is decreasing. The three step search pattern the algorithm stops at 25th step irrespective of PSNR received by using motion vector of the last step. Because of that it is prone to missing small motions [9]. A new three step search algorithm is an improved method compared to actual three step search algorithm. In this algorithm, center biased searching is provided by adding extra eight search points in the original three step search method. It was widely used in earlier standards like MPEG 1 and H.261. Following figures shows an example of how a motion vector is calculated in case of new three step search: 34 Outer 1st step Inner 1st step 2nd step (5 points) 3rd step (3 points) Figure 5-3: New Three Step Search As shown in the figure above, in case of the new three step search, current macro block is searched at 17 places in the first step. In those 17 cost functions, if minimum cost is found is found in the center, then searching for motion vector is stop at that point. Otherwise ordinary three step method is continued by dividing the search step by two and cost functions are calculated at required points. The number of block required to match can be estimated by following equation: (5.3) Where P1 is the probability of finding a good match in 1st step P2 is the probability of finding a good match in 2nd step P3 is the probability of finding a good match in 3rd step 35 Note that in case of the new three step search method, searching is continued till best cost function is found at center. Because of that in best case motion vector can be found in 17 iterations on the other hand it can take up-to 33 iterations if cost function is not found in the center. Following table shows average computations required in exhaustive search, three step search and new three step search: ES TSS NTSS Board move 137 25 18 Taxi move 165 25 21 Table 5-1: Average cost in ES, TSS and NTSS In the table above notice that number of calculations required in TSS and NTSS are very less compared to ES. Peak signal to noise ratio is of these methods is less compared to ES. Detail about PSNR difference is explained more in detail in chapter 5.4 5.3 Diamond Search Diamond search algorithms are providing PSNR almost similar to exhaustive search algorithms [10]. In this algorithm, computation is done till minimum cost function is found in the center of a diamond. Following are two fixed types of search patterns used in the diagonal search algorithm: 36 Figure 5-4: LDSP and SDSP As shown in the figure above, in case of LDSP (large diagonal search pattern), comparisons are done at nine different places. Step size is two in case of the large diagonal search pattern. In case of SDSP (small diagonal search pattern), comparisons are done at five different places. Step size is one in case of the small diagonal search pattern. Area covered by the large diagonal search pattern is large. By that in broader area current macro block is matched with reference frame with comparatively lower precision compared to small diagonal search pattern algorithm. Following figure shows how both methods are used together for finding a motion vector: LDSP 1 LDSP 2 LDSP 3 LDSP 4 SDSP Figure 5-5: Example of Diamond search 37 Large diagonal search pattern is used starting from the beginning till minimum cost function is found in the center. Once a minimum cost function is found in the center, small diagonal pattern is used to find précised location of motion vector in the reference frame. Because of small diagonal search pattern as a last step of diamond search, PSNR of the diamond search algorithm is almost same as PSNR of exhaustive search algorithm. Diamond search algorithm is good compared to NTSS in most of the conditions as far as required number of computations is concerned [10]. That is because motion vectors for objects that move faster are quickly caught by large diamond search pattern that gives lower computations required to find a motion vector compared to the new three step search algorithm. 5.4 Adaptive Rood Pattern Search Algorithms used so-far follow a fix pattern to find motion vector for given macro block in current frame. Adaptive rood pattern search is based on a fix pattern to find motion vector for given macro block in current frame as well as a natural property of a video. Experiments have shown that objects nearby each other are moving with same rate in same direction [11]. Fast block based motion estimation techniques explained so-far are using fixed set of search patterns. These algorithms are implemented based on assumption that the motion estimation matching error decrease gradually as the searching point moves towards the position where minimum error is found. Following figure shows Sum 38 Absolute Difference at different points for a macro block for current frame and reference frame shown in chapter-3: Figure 5-6: Matching error surface In the figure above notice that sum of absolute difference is decreasing gradually towards the point where error is minimum compared to all other points. The mesh plot showing sum of absolute difference was plotted based on exhaustive search pattern algorithm. Plot for three step search pattern and diamond search pattern are similar to exhaustive search pattern algorithm. Advantage of algorithms mentioned so-far is that they are simple to implement as far as complexity in algorithm is concerned. Implementation of these algorithms is regular at different steps. Because of that these algorithms were widely used in prior to prior to H.264. Except diagonal search, all other algorithms are less efficient in tracking large motions. Because of the advantage of diagonal search pattern algorithm, it was accepted in the MPEG-10’s model of verification. Both derivatives of the diagonal search 39 are center biased. Because of that the diagonal search algorithm is providing higher search accuracy with a little increase in computational complexity. Instead of using pre-determined search pattern, algorithms are available that exploits the correlation between the current clock and the reference block to predict the target motion vector [12]. In such methods prediction is done by doing statistical average of motion vectors of neighboring macro block and according to that decide size of a macro block and step size. Thus, in these kinds of methods, search window is redefined. These methods are giving comparable performance with expense of higher computation. Notice that additional memory is required for storing neighboring motion vectors in this method. There are algorithms that do motion estimation using multi-resolution frame structure [12]. These algorithms are implemented based on the fact that images at difference resolutions are identical at different resolutions. Motion vectors are calculated with lower macro block size and then they are used to predict direction of motion vector in actual larger size frame. These algorithms lead to poor performance if the assumption is not true in the prediction process. Block size are kept in the lower resolution while prediction direction of motion vector. By doing that motion vector found in lower resolution frame is easily mapped with associated block in the larger frame. Algorithms explained so-far calculate MAD by taking all horizontal and vertical pixels are reference. There are algorithms that use sub sampled pixels for calculating mean absolute difference [11]. That is based on the fact that pixels located by each other possess spatial redundancy within a macro block. By doing sub pixel level calculations 40 several computations are reduced at each position while finding cost function to decide appropriate motion vector. Each class of algorithm explained so far achieves different trade off. Algorithm complexity involved, speed for doing search for a motion vector and picture quality after doing prediction are some examples based on which trade-off is calculated. With algorithms explained so-far, there are two major parts on which the algorithm is focused on: Prediction of direction of a motion vector for finding motion vector for a block quickly and most suitable size and shape of the search pattern. Adaptive search algorithm uses direction of motion vector of previous block at the beginning of the algorithm to get prediction of a motion vector direction. Following figure shows four scenarios to predict direction of motion of current block with reference to other blocks: Type-1 Type-2 Type-3 Type-4 Figure 5-7: Types of prediction of motion In the figure above, in case of type-1 of prediction, direction of adjacent four macro blocks are checked. Based on motion vector of these four block’s motion vector, prediction of motion vector for current macro block is done. In case of direction of motion vector are different among adjacent macro block, majority rule is applied to calculate prediction of location of motion vector for current macro block. In existing 41 project, type-4 is used for prediction. In this type, only direction of motion vector related to previously processed macro block is taken as reference for prediction of direction of motion vector for current block. Following figure gives an example of adaptive rood pattern search: Predicted MV Step Size Figure 5-8: First step of ARPS As shown in the figure above, during first step of the adaptive rood pattern search, cost function are found at five locations used in small diamond search algorithm as well as at the point where motion vector was found in the previous macro block. Notice that step size used in this small diagonal search pattern is not equal to step size used in ordinary small diamond search explained in chapter-5.3. In case of adaptive rood pattern search, step size is selected such a way that it matches with size of predicted motion vector. Following equation is used to find the step size: (5.4) Where, 42 (5.5) In the equation above, MV’ is a motion vector of previous macro block. Notice that to achieve step size value in hardware, lots of calculations are required. To minimize that complexity in calculation, step size is chosen such a way that maximum of horizontal vector and vertical vector of the predicted motion vector from the previous frame. 43 Chapter 6 SIMULATION RESULTS In this project following videos were taken as a reference to implement several components of a video encoder and decoder including various motion estimation algorithms mentioned in the report. Following table shows list of videos taken as reference: Bit Rate Frame Search Format (kbps) Rate(fps) Range(pixels) Car phone QCIF 190 29 8 Room CIF 48 18 8 Crossing QCIF 232 14 16 Board CIF 512 15 8 Foreman CIF 313 25 8 Table 6-1: Frames taken as reference In Matlab, each video was read into a data structure. Frames were extracted from the data structure based on format of video, bit rate and frame rate information. After extracting frames, they were passed into Matlab function along with macro block size and search range as parameters. Following table shows average number of calculations required to perform motion estimation with exhaustive search algorithm, new three step search algorithm, diamond search algorithm and finally adaptive rood pattern search algorithm: 44 ES DS ARPS Car phone 260 24 10 Room 255 19 10 Crossing 934 28 16 Board 255 14 8 Foreman 275 29 13 Table 6-2 Average number of search points per MV generation From the table above it is clear that in case of adaptive rood pattern search, a search point for a macro block is found very quickly compared to other algorithms. Notice that difference between numbers of computations required between NTSS and DS is not major. Because of that NTSS and DS were widely interchangeably used in past video encoding algorithms. Codec requiring less complexity were using NTSS and codec capable to do complex calculations were using DS. Note that the adaptive rood pattern search algorithm requires very less number of searches to find appropriate motion vector for a macro block within current frame. On the other hand complexity involved in the algorithm is very large compared to the past algorithm. Because of that Industries related to video transmission and storage are more concentrating on implementing this algorithm in hardware to do processing of a high definition videos at faster rate and efficiently [10]. Finding a search point quickly is not the only requirement in videos demanding high quality. Good peak signal to noise ratio is also important for good quality of video. Following figure shows comparison of PSNR between for 20 frames: 45 Figure 6-1: comparisons of PSNR (dB) for 20 frames between ES and ARPS In the figure above notice that peak signal to noise ratio found with adaptive rood pattern search algorithm for 20 frames is around one dB lesser than peak signal to noise ration found with exhaustive search algorithm. On the other hand, from Table-9 notice that number of computations required for finding motion vector of a given macro block in exhaustive search are very large compared to number of computations required in the adaptive rood pattern search algorithm. Following is a pseudo block diagram of encoder module used in the project. Prior to giving frames as input to the encoder, a function of Matlab was used to generate frames from given video. Those frames were saved in a folder for reference for the encoder module. The encoder module was separating current frame and reference frames from each other and then as-per that input frames were processed by following video processing components written in Matlab. 46 Video_encode_n.m Current and reference frame extractor AdaptiveRoodPattern.m Motion_vect Compansated_motion.m Comp_frame Residual_image.m Resi_frame FrameEncodeTx.m EncoderModule.m Figure 6-2: Pseudo block diagram of the video encoder In the figure above adaptive rood pattern search is the file that generates motion vectors and cost of doing the motion estimation. The motion vector related information is then given to motion compensation module to generate compensated frame. Compensated frame is not exactly equal to current frame. Because of that difference of those frames is calculated by residual image module and transmitted. 47 Chapter 7 CONCLUSION H.264 is providing mechanism for compressing video very efficiently without much loss in video quality. Because of that, this standard is meeting practical multimedia communication requirements. Matlab implementation for all major components within H.264 encoder and decoder gives information on complexity involved in implementing the codec on a hardware platform. Inter frame prediction for doing estimation of movement of a macro block with respect to a macro block in reference frame was the main focus of the project. High definition videos are taken at larger frame rate. Because of that huge amount of temporal redundancy is available between consecutive frames. By removing the temporal redundancy very less amount of information is required to send by using effective inter frame prediction algorithms. Intra frame prediction and variable length encoding are also attractions in the encoder compared to previous algorithms. Inter frame prediction is the most complex component of an H.264 encoder. Different algorithms for doing inter frame were discussed were discussed in the project and factors like computational complexity and peak signal to noise ratio associated with each algorithm were discussed. Adaptive rood pattern search algorithm mentioned in this project requires minimum computations to find a motion vector. Average peak signal to noise ratio achieved by this algorithm is almost matching with average peak signal to noise ratio of exhaustive search algorithm which has maximum peak signal to noise ratio. 48 REFERENCES 1. T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview of the H.264/AVC Video Coding Standard”, IEEE Trans. on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, July 2003. 2. Sadka, Compressed Video Communications, John Wiley & Sons, 2002. 3. Edwin Paul J. Tozer , Broadcast engineer's reference book, Elsevier, 2004 4. Current and Previous, http://extra.cmis.csiro.au/IA/changs/motion/taxi.0.gif 5. Ilker Hamzaoglu, Ozgur Tasdizen, Esra Sahin, “An Efficient H.264 Intra Frame Encoder System Design”, Faculty of Engineering and Natural Sciences, Sabanci University 34956, Tuzla, Istanbul, Turkey 6. Hirohisa Jozawa, Kazuto Kamikura and Atsushi Sagata, “Two-Stage Motion Compensation Using Adaptive Global MC and Local Affine MC”, IEEE Trans. on Circuits and Systems for Video Technology , VOL. 7, NO. 1, FEBRUARY 1997 7. N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete Cosine Transform”, IEEE Trans. Computers, 90-93, Jan 1974 8. Jianhua Lu and Ming L. Liou, “A Simple and Efficient Search Algorithm for Block -Matching Motion Estimation”, IEEE Trans. on Circuits and Systems for Video Technology, VOL. 7, NO. 2, APRIL 1997 9. Renxiang Li, Bing Zeng, and Ming L. Liou, “A New Three-Step Search Algorithm for Block Motion Estimation”, IEEE Trans. Circuits AndSystems For Video Technology, vol 4., no. 4, pp. 438-442, August 1994. 10. Yao Nie, and Kai-Kuang Ma, “Adaptive Rood Pattern Search for Fast BlockMatching Motion Estimation”, IEEE Trans. Image Processing, vol 11, no. 12, pp. 1442-1448, December 2002 11. B. Liu and A. Zaccarin, “New fast algorithms for the estimation of block motion vectors,” IEEE Trans. Circuits Syst. Video Technol., vol. 3, no.2, pp. 148–157, 1993. 12. L.-J. Luo, C. Zou, and X.-Q. Gao, “A new prediction search algorithm for block motion estimation in video coding,” IEEE Trans. Consumer Electron., vol. 43, pp. 56–61, Feb. 1997.