!" #$ % $ &'' ()' , * * Avishek Saha , Jayanta Mukherjee and Shamik Sural / - + . Department of Computer Science and Engineering, IIT Kharagpur, India avishek, jay @cse.iitkgp.ernet.in School of Information Technology, IIT Kharagpur, India shamik@sit.iitkgp.ernet.in Keywords: Video compression, SAD computation, Heuristic approach, Quality tradeoff, Low-power architecture Abstract This paper presents a heuristic approach to reduce the computational power requirement of SAD cost function used in the Motion Estimation phase of video encoding. The proposed technique selects the best matched macroblock based on partial SAD sums. This results in lower power consumption with marginal loss in image quality. Different levels to trade-off image quality with computational power have also been proposed. The mathematical intuition in support of the heuristics is discussed. Our approach shows good performance, especially at low bitrates. 1 Introduction In the past few years, multimedia mobile services have witnessed a manifold growth. This has resulted in a strong demand for efficient implementation of image and video applications in next generation wireless multimedia systems. However, limited power supply being one of the major constraints in such designs, designers are often forced to trade-off quality with power. Moreover, in applications such as portable multimedia devices, the best image quality may not always be required. Therefore, algorithmic/architectural approaches for low power design based on trade off between image quality and power consumption are required. In video coding, similarities between video frames can be exploited to achieve high compression ratios. In order to achieve this higher compression efficiency, motion estimation is used. It is performed for all MacroBlocks (MBs) of size 16 16 pixels. A best match of a MB in the current frame is searched in the reference frame. In order to evaluate the quality of the match, the most commonly used metric is the “sum of absolute differences” (SAD). SAD computation is quite time consuming due to the complex nature of the absolute operation and the subsequent multitude of additions. 0 Accurate estimation of motion vectors is an important step in low bit-rate video coding. However, in all video encoders, block motion estimation is the most computationally intensive module [15]. Fast algorithms to speed up motion estimation exist and can be divided into the following categories. The first category tries to improve motion estimation by reducing the number of search points [13, 19]. The second category tries to reduce the number of pixels from a block that is to be used for evaluating the match [3, 4, 10, 16, 18]. The third category attempts to reduce the number of operations for measuring the distortion [7, 13]. In the final category, only part of all the blocks within a predefined search window is searched for the block-motion vector [8]. Here, we focus on the first three categories only. Initially, Bierling used an orthogonal sampling lattice with a uniform 4:1 subsampling pattern to select pixels for the search of motion vectors [3]. This scheme resulted in reduced accuracy motion vectors since the fixed pixel pattern always left out some pixels. Later, Liu and Zaccaring [10] ensured that all pixels in the current block are used in the evaluation of block match by selecting one out of four different alternating 4:1 subsampling patterns in each step. However, better coding efficiency can be achieved by using adaptive patterns as compared to fixed patterns [3, 10, 18] but with an additional overhead of selecting the most representative pattern. Compared to random 4:1 subsampling, Chan and Siu [4] presented a more efficient local pixel-decimation scheme which divides a block into several regions and selects more number of pixels from regions containing higher details or edges. Wang et al. [16] extended the concept of adaptivity from local to global by suggesting an algorithm that directly looks for edge pixels in 1-D space with the help of Hilbert scan. Experimental results show that the proposed global adaptivity is more efficient than the existing locally adaptive techniques. Another fast motion estimation algorithm, the N-queen decimation lattice, was proposed by [15]. It tries to find the most representative sampling lattice for a block that represents the spatial information in all directions. Simulation results show that the proposed technique is faster than existing ones with some loss in image quality. In this paper, we propose a heuristic to lower the power requirements of SAD computation. It is based on efficient trade off between image quality and computational complexity. In this approach, the best match is selected based on the partially computed SAD values. This approach requires less computation and lower power, and hence is of great advantage for implementations in power-constrained devices like PDAs and mobile phones. Our approach deals with reduction in SAD computation, rather than selecting the best representative block. So, our proposed approach can be combined with any of the previously mentioned techniques to select the best representative block and then subsequently reduce the SAD computation between the selected block and the current block. The rest of the paper is organized as follows. The next section describes our heuristic approach to SAD computation. Subsection 1 provides the necessary background, subsection 2 describes our proposed approach and subsection 3 presents the experimental results. A low power reconfigurable architecture based on our SAD computation technique is proposed in Section 3. Finally, Section 4 concludes the discussion followed by Acknowledgment and References. 2 Heuristic Approach to SAD Computation 2.1 SAD Computation The block diagram of a typical video encoder is shown in Fig. 1. ME Input Sequence DCT IDCT Q DQ EC Encoded Output Sequence Figure 1: Motion-Compensated Video Encoder Each input video frame is encoded individually. Input frames, which are further subdivided into MacroBlocks (MBs), first undergoes motion estimation. The motion estimated frame is then transformed from spatial to frequency domain on a block level (8 8) using Discrete Cosine Transformation (DCT). The transformed coefficients are quantized and then entropy coded to generate the output bitstream. Again, on the encoder side, the quantized coefficients are de-quantized and inverse transformed to obtain the reconstructed frame. The reconstructed frame is then subsequently used to estimate the motion in the next input frame. 0 Motion-compensated prediction assumes that the pixels within the current picture can be modeled as a translation of those within a previous picture. In forward-prediction based video compression, each MB is predicted from the previous frame. This implies an assumption that each pixel within the MB undergoes the same amount of translational motion. This motion information is represented by two dimensional displacement vectors or motion vectors. Due to the block-based picture representation, many motion estimation algorithms employ blockmatching techniques. In such techniques, the motion vector is obtained by minimizing a cost function measuring the mismatch between a current MB and the reference MB. Although several cost measures have been introduced, the most widely used one is the SAD defined by 132547698 :<;>=@?BADC!EI FHG I FBG N O 6>8 :<;QPR?TSUCWV O 69XZY<8 :[XZ\];QPR?TSUC N JLK F MBK F O 6>8 :<;QPR?TSUC ;QPR?TSUCH^9_ ;>`a?cbdC O 69XZY<8 :[0 XZ\ ;QPR?TSUC ;ePf?gSUC ^9_ ;9`a?cbdC ;>=@?BADC (1) where represents the pixel of a 16 16 MB from the current image at the location . represents the pixel of a candidate MB from a reference picture at the spatial location displaced by the vector . To find the MB producing the minimum mismatch error, we need to calculate SAD at several locations within a search window. The simplest but the most computationally intensive search method, known as the full search or exhaustive search method, evaluates the SAD at every possible pixel location in the search area. To lower the computational complexity, several algorithms that restrict the search to a few points have been proposed [13, 19]. 2.2 Proposed Approach ` ^9_ Camera motion of a video being encoded is usually translational in nature. If we consider a particular MB of the frame, then any new object that is visible in the same MB in the frame, usually enters the MB through one of its 4 boundaries. We aim to catch the motion of the new object entering the MB in the boundary of the MB itself. To detect any new object entering the concerned MB, we compute the SAD values of the boundary rows and columns only. Our conjecture is that we need not find out the complete SAD sum but can take a decision based on the first few partial sums only. While doing this, we would also like to maintain the quality of the encoded sequence as close as possible to the sequence encoded with full SAD. This motivates us to study the effect of video reconstruction from a compressed sequence encoded with partial SAD computation. ;>`ZhjikC ^9_ Let us suppose that the pixels of the MB are visited following a spiral scan from the boundary to the interior as shown in Fig. 2(a)-(e). We can position the sequence of visited pixels (a) k = 1 (b) k = 2 (c) k = 4 (d) k = 6 (e) k = 8 Figure 2: Spiral Scan based SAD calculation ` ^9_ by different layers where pixels of each layer lie at the same distance from the boundary (considering the chess board distance as a metric). For example, at the pixel layer, the pixels should have a chess board distance from the boundary as ,0 i N/2 . The layer is visited starting from the pixel at position, situated on the diagonal of the MB. Let j denote the position of the pixel in the sequence of layer, 0 j 4(N-i-1). Let the pixel at the layer and sequential position of that layer for the N N MB X be denoted as . Similar is the case for the pixel of the N N MB Y. ;9`@hlimC n 9; `T?g`HoqC p r o n t 6u: `s^9_ v 6u:0 ` ^9_ 0 b ` ^9^9__ ` ^9_ 1 6 E:I wUx N t 6u: V v 6e: N Kzy { 6 E}|~;V`zV1z6 ikC n ` op U r XF I I I N t V v N EQ36TB 16 1324;>?TC!E w w w Ky In the proposed computational model, we consider partial computation of SAD values by performing the spiral scan. Let us denote a partial SAD sum at the layer as, where SAD sum in terms of (2) 2.3 Results and . Hence, the full can be expressed as shown below. (3) where (m,n) are usual array indices. E E - - t y ? ? t F ? ? t ?L?LLLL? ? t~ X X ? ? t~ X X F . vN y N v F v N N v v F . I I N N E 6 t 6B? N N E 6 v 6 Ky Ky N N N N N N N N P P Let X and Y be any two partial SAD sum sequences of length P, i.e., and . Then and can defined as, We denote , iff of given following equation, for k = 0, 1, 2, 3,. . . , P, where P = N/2 . P (4) . Let be the probability . Then can expressed by the P E ¢¡m£U¤¦¥ N N N N9§§§ I6 t 6 §§ K y p r I6 v 6>¨ Ky (5) We have carried out an experiment to estimate the probability for different values of k. We generate data for the test in such a manner that all the partial SAD sums in the set X belong to the current MB and all the partial SAD sums in the set Y belong to the reference MB. Every partial SAD sum in the set X is compared with only those partial SAD sums in Y, which belong to the search window where the block from X would have been searched for the best match. These comparisons have been made for all possible values of k. Probability estimation was carried out on the standard sequences, namely - Miss America, Foreman, Carphone and Akiyo. The discrete probability distribution ( ) for the sequence Carphone is given in Table 1. The probability distributions generated for other test sequences yielded similar results. P P k 1 0.7 2 3 4 5 6 7 8 0.75 0.78 0.82 0.87 0.92 0.97 1.00 Table 1: P After k=6, there is only marginal improvement in the probability. So, a k-value of 6 or higher can be used for most practical purposes. We can also select lower values of k, depending on the amount of image quality loss or the increase in bitrate that can be afforded. for different k-values The above results show that the probability of making a correct decision based on partial SAD sum is quite high. Higher the value of k, higher is the chance of selecting the optimal block. The proposed SAD computation technique was implemented on a baseline H.263 [2] encoder and tested on standard test sequences. Fig. 3 shows the Rate-Distortion (RD) curves for the test sequence Carphone at 30 fps, 15 fps and 10 fps, respectively. Similar results were obtained for other test sequences. We have plotted the RD curves for k = 1, 2, 4, 6 and 8. The MPEG committee uses an informal threshold of 0.5 dB PSNR to decide whether to incorporate a coding optimization [1]. As is evident from the figure, at low bit-rates the PSNR loss between k=8 and k=6 is less than 0.5 dB. So we can propose k=7 or k=6 as the preferred choice of partial SAD sum computation for most practical purposes. Since our application is aimed at mobile devices where the best quality may not always be required, k=5 and k=4 may also be considered. Typical mobile applications operate in the range of 10-15 fps. As is evident from Fig. 3b and Fig. 3c, our sub-optimal heuristic incurs no additonal quality loss at 10 and 15 frames per second. But it can be seen that at reduced frame rate the bandwidth consumption is highly reduced. Thus our approach is well suited for low frame rate mobile applications where both loss in image quality and bandwidth consumption decreases. It may be noted that, the plots for k = 1 and k = 2 are very close to each other. So given a choice, it is always better to select a k-value of 1 over a k-value of 2, as it incurs almost the same amount of loss in image quality but consumes much less power. The decrease in power consumption is due to the reduced number of addition operations. Let C(k) denote the number of operations (additions/subtractions) required for different values of k. C(k) can be expressed as a function of k as shown below in equation 6. ©7;ª]C!E¬«­i®Z¯°ªRh±;Hik®zV²ª]Ck¯]ª´³Bh - «­i®µ¯]ªRh±;Bi®RV²ª]C¯°ª´³TV¶i . (6) From the above equation we can compute the total number of additions and subtractions required for any value of k. Table 6 makes a comparison of the number of additions and subtractions required for the proposed approach. k-value 8 6 4 2 1 Subtractions 256 240 192 112 60 Additions 255 239 191 111 59 Total 511 479 383 223 119 Table 2: Comparative study of the proposed SAD schemes Fig. 4(a)-(b) shows the reduction in operations and the reduction in image quality at different values of k. Fig. 4(c) shows the change in bitrate with changes in k, at a fixed PSNR. Software profiling of the proposed approach with k = 1, 2, 4, 6 Rate Distortion Curve [Carphone @ 15fps] Rate Distortion Curve [Carphone @ 10fps] 38 37 37 37 36 36 36 35 35 35 34 34 33 33 32 32 k=8 k=6 k=4 k=2 k=1 31 30 0 50 100 150 200 250 300 PSNR 38 PSNR PSNR Rate Distortion Curve [Carphone @ 30fps] 38 33 32 k=8 k=6 k=4 k=2 k=1 31 30 350 400 34 0 50 100 Bitrate 150 Bitrate (a) (b) k=8 k=6 k=4 k=2 k=1 31 30 200 0 20 40 60 80 100 120 140 Bitrate (c) Figure 3: Rate-Distortion Curve 33.6 sumed by a particular block, with respect to the total power consumed by the entire encoder. We see that full SAD (k=8) consumes 72.28% of the total power. However, at lower values of k, the amount power consumed by the partial SAD sums reduce significantly. 500 450 33.5 400 PSNR (in dB) No. of Operations 33.4 350 300 250 33.3 33.2 200 33.1 150 100 33 8 7 6 5 4 3 2 1 8 7 6 5 k 4 3 2 1 k (a) (b) 200 In this section we present a architecture for our proposed SAD computation, which consumes less power at the cost of some degradation in image quality. Bitrate (in kbps) 180 160 140 120 100 8 7 6 5 4 3 2 1 k (c) Figure 4: For different values of k, (a) Reduction in number of operations at a fixed bitrate, (b) Reduction in PSNR at a fixed bitrate, (c) Change in bitrate at a fixed PSNR and 8 was done using the GNU gprof profiling tool with Carphone as the test sequence. The encoding scheme was broadly divided into 5 functional blocks based on the amount of power consumed. The functional blocks were named as - SAD (B1), Motion Estimation and Compensation (B2), DCT/IDCT (B3), Quantisation and Dequantisation (B4), VLC/VLD (B5) and Others (B6). All those functions which contribute insignificantly to the overall encoding time were grouped under Others. Table 3 summarizes the profiling results. k 8 6 4 2 1 B1 72.28 66.19 54.9 40.15 26.85 3 Low-Power SAD Architecture B2 16.85 21.23 31.36 41.53 47.48 B3 6.17 7.77 8.83 11.87 17.9 B4 2.93 4.45 4.31 5.63 7.75 B5 1.45 0.2 0.21 0.01 0.01 B6 0.32 0.16 0.39 0.81 0.01 Table 3: Profiling results of functional blocks. The values in the table denote the percentage of power con- The design of power efficient computational hardware involves two paradigms - behavioral and structural [11]. We focus on the behavioral domain, where power optimization is achieved at algorithmic level by reordering or reducing the basic operations involved in the compression tools. A large number of hardware implementations for the SAD operation are already available in [5, 6, 14, 17]. But most of the implementations are massively parallel and are typically infeasible for use in powerconstrained devices like mobile phones, PDAs, etc. Balanced adder tree structure is the most most commonly used approach and has been used in [5, 9, 14]. Other approaches for computing a large number of summands using a matrix reduction technique and by 4:2 ratio compressors technique has been discussed in [6]. Although the approaches suggested in [6] are efficient in terms of time and hardware complexity, they lack reconfigurability and hence are not suitable for our purposes. In this paper, we implement an efficient reconfigurable adder tree to calculate partial SAD sums based on a balanced adder tree structure. Fig. 5 shows a balanced adder tree structure for calculating partial SAD sums. The adder hardware has been constructed for a single row of SAD operation. As shown, AND gates are added to the input signals for configuring the hardware for different values of k. In the original full SAD, all the AND gates pass their input and the hardware works like a normal SAD processor without any modification. If we need to turn off any particular input, we set the control signal of the corresponding AND gate to zero. In Fig. 5, the adders in the dotted box are the adders turned off at different values of k. Table 4 shows the decrease in adder requirement, as we enforce our heuristic Figure 5: Balanced adder architecture for calculating partial SAD sums scheme. k-value 8 6 4 2 1 Critical Path 4 4 4 4 4 adders ON 15 13 9 3 1 adders OFF 0 2 6 12 14 Table 4: Adder requirements for different schemes. As can be seen, the number of adders reduce drastically with decrease in the value of k. The hardware can be further optimized for power and critical path by using unbalanced tree approach or by using different types of optimized adders as in [12]. · Our architecture was implemented on 0.18 CMOS NatSem library. Power consumption was measured on the netlist level simulation with Synopsys Power Compiler of the Synopsys Design Compiler. Table 5 shows the power consumption (in mW) of the proposed SAD architecture for different values of k. k=8 21.4647 k=6 15.7039 k=4 9.887 k=2 4.099 k=1 1.0371 Table 5: Power consumption(in mW) at different values of k. At k = 6, the proposed architecture shows around 26.84% power savings with small image quality degradation. When further low power SAD operation is required, the reconfigurable SAD architecture can be changed to k value of 4, which has 53.94% power savings. Depending on the amount of power reduction required, the proposed scheme allows the selection of further lower values of k, thus saving power at the expense of acceptable image quality degradation. As is evident from the Tables and Figures in this section, this approach results in marginal quality loss for a k-value of 6. Similar approaches based on complexity vs quality tradeoff are also presented in [12],where the complexity of the encoder is reduced by modifying the DCT bases in a bit-wise manner. Table 6 presents a comparative analysis of our approach with that in [12]. In our approach, the PSNR and power values are those at low bitrates. As can be seen, the PSNR loss in [12] is very small initially but decreses sharply at lower trade-off levels. In our approach, the initial stages of PSNR fall is high, but the rate of fall decreases at lower trade-off levels. The total trade-off level between the original and the worst level is little more than 1dB. However, at out our preferred level k=6, the PSNR loss in only 0.5dB. The more interesting feature of our approach is the fall in power consumption at each level. We obtain substantial power savings of 26.83% and 53.94% at the initial levels of k=6 and k=4. This is much better compared to 15.48% and 30.95% power reduction at level=1 and level=2 of [12]. Moreover, our approach can be used in combination with any of the techniques in [3, 4, 8, 10, 13, 15, 16, 18, 19]. These techniques aim at reducing the number of locations where SAD has to be calculated. But at each location they calculate the Full SAD. Our approach emphasizes on reducing the number of operations involved in SAD computation. So we can use any of the aforementioned techniques to reduce the number of search locations and then based on the target application and the amount quality loss that can be afforded, we can select a particular value of k to use our technique. This hugely reduces the overall complexity of the encoder. 4 Conclusions and Future Work We proposed a heuristic to lower power requirements of the SAD function. SAD is the performance bottleneck in realtime implementations of video coders on power constrained handheld devices. Our proposed technique leads to substantial power savings without seriously compromising the image quality and is particularly suited for low bit rate mobile-based applications where the best quality may not always be required. At lower bitrates the loss in quality is lesser. The bandwidth consumption is also reduced at lower frame rates. Various trade off levels can be selected as per individual requirements. Future work lies in implementing our heuristic on H.264. H.264 employs the concept of variable-sized macroblocks. Our heuristic needs to be modified accordingly so that it may adapt itself with changing macroblock dimensions. Acknowledgment This work is partially supported by the Department of Science and Technology (DST), India, under Research Grant No. SR/S3/EECE/024/2003-SERC-Engg. Work of Shamik Sural is also supported by a grant from DST under Grant No. SR/FTP/ETA-20/2003 and a grant from IIT Kharagpur under ISIRD scheme No. IIT/SRIC/ISIRD/2002-2003. Approach [12] Our PSNR Power Savings PSNR Power Savings original 33 0 original (k=8) 31.75 0 level 1 32.9 15.48% k=6 31.28 26.83% level 2 32.7 30.95% k=4 30.74 53.94% level 3 32.6 45.82% k=2 30.54 80.90% level 4 30.5 59.46% k=1 30.54 95.17% level 5 27.9 67.42% Table 6: Comparison between [12] and our approach. ¸]ǵ¹sÅd¹ À9P»À9Ædº´È < U UÅd¤aÃm¼±Ã`c¡¾É ½¾{ ¤[¼»¿¡U¿ ªD{¿m¹ À a¿ Pµv à { ¿mÁ¡¾=z ¸° ¾¹ ½[¼Ê£m=ZÀc ¡UÃk¿ÄÅ<¡¾¿U ¾½ÃkÂÆU|D References [1] [2] “Video coding for low bit rate communication”, ITU-T Recommendation H.263, (1998). [3] M. Bierling. “Displacement estimation by hierarchical block matching”, Proceedings of SPIE Conference on Visual Communication, volume 1001, pp. 942-951, (1988). [4] Y. L. Chan, W. C. Siu. “New adaptive pixel decimation for block motion vector estimation”, IEEE Transactions on Circuits, Systems and Video Technology, volume 6, pp. 113-118, (1996). [5] R. Gao, D. Xu, J. P. Bentley. “Reconfigurable Hardware Implementation of an Improved Parallel Architecture for MPEG-4 Motion Estimation in Mobile Applications”, IEEE Transactions on Consumer Electronics, volume 49(4), pp. 1383-1390, (2003). [6] D. Guevorkian, A. Launiainen, P. Liuha, V. Lappalainen. “Architectures for the sum of absolute differences operation”, Proceedings of IEEE Workshop on Signal Processing Systems Design and Implementation (SIPS), pp. 5762, (2002). [7] K. Lengwehasatit, A. Ortega. “Probabilistic partialdistance fast matching algorithms for motion estimation”, IEEE Transactions on Circuits, Systems and Video Technology, volume 11, pp. 139-152, (2001). [8] R. X. Li, B. Zeng, M. Liou. “A new three step search algorithm for block motion estimation”, IEEE Transactions on Circuits, Systems and Video Technology, volume 4, pp. 438-442, (1994). [9] S. Lin, P. Tseng, L. Chen. “Low-power parallel tree architecture for full search block-matching motion estimation”, Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 313-316, (2004). [10] B. Liu, A. Zaccaring. “New fast algorithms for the estimation of block motion vector”, IEEE Transactions on Circuits, Systems and Video Technology, volume 3, pp. 148-157, (1993). [11] V. Muresan, N. Connor, N. Murphy, S. Marlow, S. McGrath. “Low Power Techniques for Video Compression”, Proceedings of IEE Irish Signals and Systems Conference, (2002). [12] J. Park, K. Roy. “A low power reconfigurable DCT architecture to trade off image quality for computational complexity”, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 5, pp. V-17-20, (2004). [13] A. M. Tourapis, O. C. Au, M. L. Liou, G. Shen, I. Ahmad. “Optimizing the MPEG-4 encoder- advanced diamond zonal search”, Proceedings of 2000 International Symposium on Circuits Systems, volume 3, pp. 674-677, (2000). [14] J. Tuan, T. Chang, C. Jen. “On the Data Reuse and Memory Bandwidth Analysis for Full-Search Block-Matching VLSI Architecture”, IEEE Transactions on Circuits, Systems and Video Technology, volume 12(1), pp. 61-72, (2002). [15] C. N. Wang, S. W. Yang, C. M. Liu, T. Chiang. “A Hierarchical N-Queen Decimation Lattice and Hardware Architecture for Motion”, IEEE Transactions on Circuits, Systems and Video Technology, volume 14(4), pp. 429-440, (2004). [16] Y. K. Wang, Y. Q. Wang, H. Kuroda. “A globally adaptive pixel-decimation algorithm for block-motion estimation”, IEEE Transactions on Circuits, Systems and Video Technology, volume 10, pp. 1006-1011, (2000). [17] S. Wong, S. Vassiliadis, S. Cotofona. “A Sum of Absolute Differences Implementation in FPGA Hardware”, Proceedings of the 28th Euromicro Conference, pp. 183-188, (2002). [18] Y. Yu, J. Zhou, C. W. Chen,. “A novel fast block motion estimation algorithm based on combined sub samplings on pixels and search candidates”, Journal of Visual Communication and Image Representation, volume 12, pp. 96-105, (2001). [19] S. Zhu, K. K. Ma. “A new diamond search algorithm for fast block matching motion estimation”, IEEE Transactions on Image Processing, volume 9, pp. 287-290, (2000).