International Journal of Communication Technology for Social Networking Services Vol.3, No.1 (2015), pp.23-32 http://dx.doi.org/10.14257/ijctsns.2015.3.1.03 A Fast Trace Based Spiral Search Architecture for Motion Estimation and its Implementation Using FPGA 1 Manu T. M, 2Linganagoud Kulkarni and 3Basavaraj. S. Anami 1,3 2 K.L.E Institute of Technology, Hubli,India B.V.B College of Engineering and Technology,Hubli, India manutmece@yahoo.com1, linganagoud@yahoo.co.uk2, anami_basu@hotmail.com3 Abstract H.264 / AVC offer many advanced coding tools to achieve higher compression ratio up to 50% more than the other previous standards. These coding tools substantially increase the computational complexity of the Motion Estimation (ME) which consumes up to 85% of the entire encoder’s computations. In this paper, we have proposed a computationally efficient and accurate model which skips some of the computations to speed up full search block motion estimation algorithm. Instead of calculating Sum of Absolute Difference (SAD) for exploiting the motion activity between adjacent frames directly, we examine the motion activity by comparing the trace and off diagonal sum of the current frame block and the previous candidate block in the first step. If the values are exactly matched or highly similar then in the second step, SAD is calculated to find the best match. Otherwise that candidate block is skipped. In traditional spiral search, the search point selection is not hardware friendly. So, we have also used a modified spiral search order which is easy for hardware implementation, begins the search from the search center of search window and expands in a spiral fashion until the boundaries of the search window is reached or sufficiently good match is found. Simulation result shows up to 80 to 85 % computations are reduced using trace and modified spiral search by ensuring good compression quality. Synthesis report by choosing Spartan-6 FPGA device shows that the maximum operating clock frequency is 297.39MHz with power consumption of 34.96mW. Keywords: Motion Estimation, Trace, Spiral search, FPGA, Synthesis 1. Introduction Due to the limitations in the available bandwidth and storage space for high quality multimedia content like - video broadcasting and DVD video data, video compression has become very much necessary to keep up the ever growing demand, by maintaining the quality in decoded video. Typically in a video sequence, multiple consecutive time frames are similar to each other. This redundancy is called the temporal redundancy, exploited by video compression algorithms to achieve better compression. Motion estimation (ME) is the one process which exploits the redundancy between two consecutive frames. After the first frame is transmitted, the next frame in the video sequence is only coded with the difference from the previous frame. Many ME algorithms already exist, among all the estimation algorithms, full-search (FS) block-matching algorithm is a popular ME algorithm. The concept of block matching algorithm to find the movement of each block between the previous and current frame called displacement vectors (MVs) is depicted in Figure 1. In Full search algorithm motion vector is calculated in two stages, namely the calculation of the SAD for each displacement vector, followed by methods for finding the smallest SAD values. ISSN: IJCTSNS Copyright ⓒ 2015 SERSC International Journal of Communication Technology for Social Networking Services Vol.3, No.1 (2015) Figure 1. Block Matching Motion Estimation Motion estimation is intensive computational process, which consumes more than 85% of encoding time of encoder. Most of the available algorithms exhibit a trade-off between quality and speed. Since ME is scene dependent, not a single technique is reliable to generate a good visual quality. So it needs variety combination of techniques, such as motion starting point, motion search patterns and adaptive search control to terminate the search and many more that makes ME a robust. Hence there is a requirement for improved algorithms and hardware architectures which are suitable for real time applications. To find the state-of-the-art method, a literature survey is carried out. 2. Literature Survey In all the existing video coding standards, Block based Motion Estimation (BME) has been adopted [2-5] to reduce the temporal redundancy between frames. Full search involves the computation of SAD at each location in the search window. For a search window of size +/- P pixels, the number of search locations is (2P+1)2. For a search window of 32x32 and a block size of 16x16, a total of 289 locations is searched to find the best match with the minimum SAD value. This results in significant computational complexity. Many algorithms have been proposed to reduce the computational complexity of full search motion estimation. Some of the popular ones are the Three Step Search (TSS [4]), the New Three Step Search (NTSS [5]), the Four Step Search (4SS [6]), the Diamond Search (DS [7]) and the Adaptive Rood Pattern Search (ARPS [8-9]). These algorithms try to do Small Square (TSS, NTSS, 4SS) or diamond shaped (DS) search around a search center, and refine the search around the best matching block. Early termination techniques based on the SAD threshold values are used to reduce the computation cost. Algorithms like ARPS employ sophisticated search center prediction as the start point. Though these algorithms address computational cost well, but the performance in terms of PSNR is close to FS algorithm. Full search is still attractive for the high end applications, at the cost of increased computation and power consumption because of its performance in terms of PSNR. Many hardware architectures are proposed for full search motion estimation. These architectures try to compute the SAD at all, such locations in the search window. Popular hardware architectures are the Partial propagate SAD architecture [13], the SAD Tree [14], the 1D Tree Architecture [15], and the 2D Architecture [15]. These hardware architectures access the blocks in a raster scan to maximize pixel reuse. The main disadvantage of these architectures is the inability to start the search around a search center of the search window of a reference frame and early termination techniques. To address the above issues, we have proposed modified spiral search order motion estimation with search step is more than one pixel compared to the traditional search. This results in skipping of a certain number of blocks. To further speed up, a trace and off diagonal sum of the current frame block and the reference frame block is matched to find the nearly matching block. If their values are matched, then SAD is calculated to find the 24 Copyright ⓒ 2015 SERSC International Journal of Communication Technology for Social Networking Services Vol.3, No.1 (2015) best match. Thus, there is a significant reduction in SAD computations as well as arithmetic operations. Hence, the time required to find the motion vector is drastically reduced. Organization of the Rest of the paper is as follows. In section 3, spiral search motion estimation is explained. In Section 4, description of proposed methodology is given. In Section 5, FPGA implementation details are explained. Method analysis and discussion are made in section 6, Finally, conclusion is drawn in Section 7. 3. Spiral Search Motion Estimation Motion Estimation algorithms and architectures proposed so for, the spiral search based motion estimation is more efficient because the best matching block concentrates around the search center and with the increasing distance the probability to find the best match decreases as shown in the Figure 2a. Figure 2. a) Precedence of Best Match b) Traditional Search Order The spiral search in H.264/AVC with reference software JM18 is as shown in Figure 2(b). The number in the small squares indicates the search order. Number zero is the search center and the search starts from this position. The search order proposed in H.264 is on only the basis of probability of selected points and predicted search center, which makes it difficult to implement in hardware [18]. To overcome this issue, a modified spiral search is proposed and it is explained in the next following sections. 4. Proposed Method The proposed method is divided into two parts namely, modified spiral search order and Trace and Off- Diagonal Sum (ODS) Match to find nearly Matching Block 4.1 Modified Spiral Search Order In modified spiral block search, the macro block of the current frame is matched for the corresponding macro block in the reference frame with that position as the center of the search window as shown in the figure 3. The numbers within the small block indicates the search order and the selection of the next block for search is by more than two pixel positions or with N/4 shift up, down, left or right because by observation, we found that there is no large difference of SAD values between two blocks located one pixel away from each other. The positional shift of the block used in our method is linear and hence accessing the searching block from the memory is easier than the traditional. Early termination technique is also employed by considering the minimum threshold value as the first position SAD value, because a good match will usually be near the starting location. If a sufficiently good match is found, then SAD minimum will be replaced with new one until the local minimum is found. Before the SAD calculation, Copyright ⓒ 2015 SERSC 25 International Journal of Communication Technology for Social Networking Services Vol.3, No.1 (2015) trace [17] and off diagonal sum are matched between the current and reference block is explained in the next section. In Table 1, number of average SAD calculations taken by the full search with different search order are given. Our proposed search order takes less number of SAD calculations compared to traditional search order. Figure 3. Proposed Modified Spiral Search Order Table 1. Average Number of SAD Calculations Required per Block Videos/ Scan Raster search Traditional Spiral search Akiyo 203 162 Modified Spiral search 128 Foot ball 212 170 132 Night 213 167 121 4.2 Trace and Off Diagonal Sum Match to Find Nearly Matching Block In matrix algebra, the sum of all diagonal elements is called trace and off-diagonal sum of a matrix [12], defined by the equation (2) and equation (3). Trace N 1 0 C ( x, x) (2) x N 1 x 0 0 N 1 Off _ diag C ( x, x) (3) x N 1 x 0 The Trace and off diagonal sum calculation has the following characteristics in terms matching the blocks: It contains two elements of each row and column of the entire block and therefore no Row or Column is left in the block matching. The diagonal elements considered for trace calculation are position wise linear and motion of pixels are uniform [17]. Hence, it is sufficient to find out a nearly matching block. If they match, then SAD is calculated to find the best match. The number of computations for trace and off diagonal seam is just O (2n) and for SAD calculation, it is O (n2) [21]. Hence, this method is more useful in reducing the arithmetic operation as required by the SAD calculation to find the match. In Table 2, average number of SAD calculations reduced after using trace and off diagonal sum along with spiral search to calculate one MV is tabulated. On an average 60 to 70 % SAD computations are skipped. Hence, by using above two methods a significant amount of execution time and power is saved. Table 2. Average Number SAD Calculations with Trace and Off Diagonal Sum Videos/ Scan Akiyo 26 Raster search Traditional Spiral search 203 162 Modified Spiral search 128 Raster search withTR &ODS 40 Copyright ⓒ 2015 SERSC International Journal of Communication Technology for Social Networking Services Vol.3, No.1 (2015) Foot ball 212 170 132 36 Night 213 167 121 31 5. FPGA Implementation of Proposed Method The top level architecture of modified spiral search with trace match is shown in Figure 4, it consists of two main modules 1) input frame processing block 2) Trace and Off diagonal sum matching and SAD calculation. Figure 4. Block Diagram of Proposed Method Cfdatainput and Rfdatainput ports are used to load the current frame macro block and reference macro block (Candidate Block) from memory when the load signal is high. Then the trace and off diagonal sum of the blocks is calculated using trace unit. If they match, then the load signal becomes low. When a load signal is low, motion estimation process begins by taking even blocks into the SAD Unit I and odd blocks calculation into SAD unit II. Each SAD unit has memory to store the calculated SAD values. Minimum SAD value is found by comparing the SAD values between the two units, which represents the best match. 5.1 Current/Reference Frame Processing Unit In Figure 5 RTL schematic for current/reference frame processing and the control signals used are shown. Load signal is used to control the frames. When “load” signal is high the frames are read and stored in the memory in the form of an array. A dedicated memory unit is used to store the pixel values one by one serially on application of the clock to make the process synchronous. A 3x3 current and reference block is used in design for convenience. Nine clock cycles are required to load these blocks. The Frame counter is used to keep track of the number of frames read from the input, which is of two bits wide as there will be a maximum of four neighbors to a pixel. Memory module provides input to the MB selector unit via a 4 bit “count” output as shown in the Figure 8. This “count” value controls the main memory unit. When the control signal “RST” is high or when the “START” signal is low the memory points to the 4th macro block location. Finally a multiplier unit is used to decide which input block to be given for the motion estimation block, which consists of 8 bit data line and an 8 bit current/reference number indicator. Copyright ⓒ 2015 SERSC 27 International Journal of Communication Technology for Social Networking Services Vol.3, No.1 (2015) Figure 5. Current / Reference Frame Processing Unit Figure 6. Current/Reference Macro Block values 5.2 Trace Matching and SAD Calculation Unit Calculation of motion vector and residue for the macro block values show in Figure 4 is as follows Overlap the current and reference frames. Match trace and sum of diagonal elements, if match Calculate SAD. Find the macro block with minimum SAD. Find the residue for the blocks with minimum SAD. Find the motion vector for block with minimum SAD. Consider the case when the current block is “2” and the reference block is “8” as shown in the figure 6. When “8” is the reference block it has five neighboring blocks, i.e., 9,6,5,4 and 7. When we subtract 2 with the neighbors, the block which gives least SAD is at the location 6 [23]. Hence, for the current block as 2 and reference block as 8 we get the motion vector as (2,6) and residue as 2. Similarly, all other locations were considered and simulation results were verified. At the output stage this unit having motion vector processor which stores the position of the matched block and the residue. Figure 7 shows the RTL schematic of Trace and SAD calculation unit. Motion vector is a 16 bit output which shows the relative motion of the frames and residue output indicates energy which is 9 bit signed number. Figure 7. RTL schematic of Trace Matching and SAD Calculation Unit 28 Copyright ⓒ 2015 SERSC International Journal of Communication Technology for Social Networking Services Vol.3, No.1 (2015) 5.3 Implementation Results For every individual unit test bench unit is written to verify the functionality. Each unit is integrated as a top level module to check the final results. Xilinx ISE 14.3 is used for simulation and Xilinx XST for synthesis. The simulation waveforms of current/reference frame module, Trace with SAD calculation and motion vector are shown Figure 8. Figure 8. Functional Simulation Waveforms 5.3.1 Estimation of Area and Power Power, area and delay are three major constraints for any digital design. Therefore to find these factors, the synthesis is done by using Xilinx XST, PlanAhead is used to estimate the area and Xpower analyzer is used for power estimation. Figure 9 and Table 4 shows the area estimation and resource utilization report. Power consumption is estimated by considering possible switching activity is as listed in the Table 5. Figure 9. Synthesis Report (Resource Utilization) Table 3. Resource Utilization after Parameter Utilized resource Slice registers Slice LUT Memory Bonded IOBs 907 out of 18224 3357 out of 9112 236 out of 2176 45 out of 232 % Utilization 4 36 10 19 Table 4. Power Consumption Summary Implementation Resources Clocks |Logic Signals IOs Quiescent Total Copyright ⓒ 2015 SERSC Power used(mw) 1.69 2.33 1.86 9.04 20.04 34.96 29 International Journal of Communication Technology for Social Networking Services Vol.3, No.1 (2015) Total power consumption is 34.96 mW per one MV calculation and it is very less compared to any other implementations. Synthesis report also gives maximum operating clock frequency is 297.390MHz and a delay on the critical path that is 5.134ns. 6. Proposed Method Analysis and Discussions The proposed method is analysis is done for these parameters 1) Speed Up(average execution time (CPU speed only)) 2)Coding Quality (PSNR) are tabulated along with other techniques namely, Full Search, Three Step Search (3SS), Four Step search (4SS) and Diamond search (DS) for comparison on standard videos "foreman, Caltrain, Stefan and Tennis". 6.1 Speed Up Speed up indicates the average time taken by the algorithms to find one Motion Vector. Our proposed method takes an average of 48 ms to find on MV as table 5, and it is 3 to 4 times faster than full search. Table 5. Average Time (ms) Taken to calculate a MV per Block Algorithms/Video FSA TSS 4SS Foreman 127.24 64.34 68.12 Caltrain 127.42 64.24 66.27 Stefan 12.8 65.02 64.17 Tennis 127.14 64.98 65.18 DS 52.72 52.42 48.12 53.12 50.19 47.42 54.84 52.40 48.12 55.18 52.00 48.00 SPWOT SPWT 6.2 Coding Quality The coding quality is indicated by quality of a reconstructed frame characterized by the Peak-Signal-to-Noise-Ratio (PSNR) as in the Equation 4. 2552 PSNR 10 log10 MSE (4) Where MSE is the mean square between the original frames and those compensated by the motion vectors. Degradation ratio PSNR(DPSNR ) is the ratio of differential PSNR between FS and modified spiral search to FS and the same is applied for standard algorithms as expressed in Equation 5 PSNRFS PSNRSPWTM DPSNR PSNRFS (5) In Table 6, PSNR and the (DPSNR) of different algorithms on the four video sequences are tabulated. Table 6. PSNR and Degraded Ratio (DPSNR) of Proposed and Other Algorithms Algorithms /Videos Container PSNR 30 DPSNR Foreman PSNR DPSNR Caltrain PSNR DPSNR Tennis PSNR DPSNR Copyright ⓒ 2015 SERSC International Journal of Communication Technology for Social Networking Services Vol.3, No.1 (2015) FS 43.18 0 31.69 0 31.51 0 35.74 0 TSS 43.10 -0.18 29.37 -7.32 30.27 -3.92 30.58 -14.52 4SS 43.12 -0.13 29.34 -7.44 30.24 -4.01 30.62 -14.32 DS 43.14 -0.09 31.19 -1.59 31.26 -0.79 33.98 -4.92 SPWOT 43.20 0.04 31.78 0.02 31.52 0.00 35.46 -00.08 SPWT 42.78 -0.92 29.98 -5.28 31.20 -1.60 32.98 -7.60 *SPWOT=Spiral Search Without trace, SPWT=Spiral search With Trace In the case of the slow-moving sequence Container, the PSNR values (the DPSNR ratios) of all block matching algorithms are similar. For the medium motion content sequences such as Caltrain and Foreman , the algorithms which are consistent with fixed patterns (TSS, 4SS and NTSS) exhibit the worst PSNR values (high DPSNR ratios) except for the DS algorithm. For the high motion sequences such as Stefan and Tennis also gives the same degradation. Since the motion content of these sequences is complex, the performance in general becomes worst for most of the algorithms. However, the PSNR and DPSNR ratios of the DS and the proposed with and without Trace has less degradation ratio gives a better video quality which is almost equal to full search. We have achieved nearly 75 to 85% reduction in SAD calculations with small degradation video quality , but it is well within the tolerance level. 7. Conclusion A modified spiral search order trace based full search block motion estimation method is simulated and implemented using FPGA. Reduction in the computations of the motion estimation process has been achieved in two steps. In the first step, modified spiral search is used with early termination gives best efficiency compared to traditional search order and the second step is Trace and Off Diagonal Sum for matching the blocks before computation of SAD to find the best match. This has given a significant reduction in number of SAD calculation to find the motion vector. Experimental results indicates the significant improvement in the performance compared to conventional full search with the computational complexity reduction up to 80 to 85 % by preserving a acceptable degradation ratio. Synthesis report using Spatran-6FPGA shows that the hardware cost is about 35k and maximum operating clock frequency of 297.39 MHz with power consumption of 34.956mW. Hence, our proposed modified method is best suited for real time applications. References [1] I. E. G. Richardson, “H.246 and MPEG-4 Video Compression – Video Coding for Next Generation multimedia”, John Wiley & Sons, ISBN: 0-470-84837-5, (2003). [2] I. E. G. Richardson, “Video Codec Design – Developing Image and Video Compression Systems”, John Wiley & Sons, ISBN: 0-470-84783-2, (2002). [3] M. J. Chen, L. G. Chen, and T. D. Chiueh, "One-dimensional full search motion estimation algorithm for video coding," IEEE Trans. Circuits Syst. Video Technology., vol. 4, (1994) October, pp. 504-509. [4] I. Ahmad, W. Zheng, J. Luo, and M. Liou, "A fast adaptive motion estimation algorithm," IEEE Trans. on Circuits and Systems for Video Technology, vol. 16, (2006) March, pp. 4280-438. [5] W. I. Chong, B. Jeon, and J. Jeong, “Fast motion estimation with modified diamond search for variable motion block sizes,” inProc. Int.Conf. Image Process. , vol. 3, (2003) September, pp. 371–374. [6] R. Li, B. Zeng, and M. L. Liou, “A new three-step search algorithm for block motion estimation,” IEEE Trans. Circuits Syst. Video Technology, vol. 4, no. 4, (1994) August, pp. 438–442. [7] L. M. Po and W. C. Ma, “A novel four-step search algorithm for fast block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 3, June, pp. 313–317. [8] A. W. Zheng, J. Luo, and M. Liou, “A fast adaptive motion estimation algorithm,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 3, (2006) March, pp. 420–438. Copyright ⓒ 2015 SERSC 31 International Journal of Communication Technology for Social Networking Services Vol.3, No.1 (2015) [9] S. Goel, Y. Ismail, and M. A. Bayoumi, “Adaptive search window size algorithm for fast motion estimation in H.264/AVC standard,” in Proc. Midwest Symp. Circuits Syst., (2005) August, pp. 1557–1560. [10] Z. Yang, J. Bu, C. Chen, and X. Li, “Fast predictive variable-block-size motion estimation for H.264/AVC,” in Proc. IEEE Int. Conf. Multimedia Expo, (2005) July, pp. 1–4. [11] W. Li and E. Salari, “Successive elimination algorithm for motion estimation,” IEEE Trans. Image Process. , vol. 4, no. 1, Jan. (1995) pp. 105–107. [12] X. Q. Gao, C. J. Duanmu, and C. R. Zou, “A multilevel successive elimination algorithm for block matching motion estimation,” IEEE Trans. Image Process., vol. 9, no. 3, (2000) March, pp. 501–504. [13] M. Br ¨unig and W. Niehsen, “Fast full-search block matching,” IEEE Trans. Circuits Syst. Video Technol. , vol. 11, no. 2, (2001) February, pp. 241–247. [14] S. Goel, Y. Ismail, P. Devulapalli, J. McNeely, and M. Bayoumi, “An efficient data reuse motion estimation engine,” in Proc. IEEE Workshop SIPS , (2006) October, pp. 383–386. [15] J.-C. Tuan, T.-S. Chang, and C.-W. Jen, “On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 1, (2002), January, pp. 61–72. [16] S. Senagupta and V. S. K. Reddy, “A Fast and Efficient Predictive Block Matching Motion estimation”, IJCNSN, vol. 7, no. 12, (2007) December. [17] S. Sundaravadivelu and S. Jayakumar, “An Efficient Motion Estimation Algorithm using Trace Match for Fast Video Compression “European Journal of Scientific Research ISSN 1450-216X, vol. 53, no. 4, (2011), pp. 546-554. [18] N. Song and T. Shimato, “A Novel Spiral Type Motion Estimation Architecture for H.264/AVC”, Journal of semiconductor technology and science, vol. 10, no. 1, (2010), March. [19] A. C. Vikram and S. R. Laddaha, “A Novel dual processing Architecture for implementation of Motion estimation Unit of H.264AVC on FPGA”, 2009 IEEE Symposium on industrial Electronics and Applications (ISIEA 2009),Kaula Lamper, Malaysia. [20] L. Kulkarni, TM Manu and B. S Anami, “A Two-Step Methodology for Minimization of Computational Overhead in Block Motion Estimation”, International Journal of u-and e-Service, Science and Technology, vol. 7, no. 4, (2014), pp. 339-348. 32 Copyright ⓒ 2015 SERSC