A Fast Trace Based Spiral Search Architecture for Motion Estimation

advertisement
International Journal of Communication Technology for Social Networking Services
Vol.3, No.1 (2015), pp.23-32
http://dx.doi.org/10.14257/ijctsns.2015.3.1.03
A Fast Trace Based Spiral Search Architecture for Motion
Estimation and its Implementation Using FPGA
1
Manu T. M, 2Linganagoud Kulkarni and 3Basavaraj. S. Anami
1,3
2
K.L.E Institute of Technology, Hubli,India
B.V.B College of Engineering and Technology,Hubli, India
manutmece@yahoo.com1, linganagoud@yahoo.co.uk2,
anami_basu@hotmail.com3
Abstract
H.264 / AVC offer many advanced coding tools to achieve higher compression ratio up
to 50% more than the other previous standards. These coding tools substantially increase
the computational complexity of the Motion Estimation (ME) which consumes up to 85%
of the entire encoder’s computations. In this paper, we have proposed a computationally
efficient and accurate model which skips some of the computations to speed up full search
block motion estimation algorithm. Instead of calculating Sum of Absolute Difference
(SAD) for exploiting the motion activity between adjacent frames directly, we examine the
motion activity by comparing the trace and off diagonal sum of the current frame block
and the previous candidate block in the first step. If the values are exactly matched or
highly similar then in the second step, SAD is calculated to find the best match. Otherwise
that candidate block is skipped. In traditional spiral search, the search point selection is
not hardware friendly. So, we have also used a modified spiral search order which is easy
for hardware implementation, begins the search from the search center of search window
and expands in a spiral fashion until the boundaries of the search window is reached or
sufficiently good match is found. Simulation result shows up to 80 to 85 % computations
are reduced using trace and modified spiral search by ensuring good compression
quality. Synthesis report by choosing Spartan-6 FPGA device shows that the maximum
operating clock frequency is 297.39MHz with power consumption of 34.96mW.
Keywords: Motion Estimation, Trace, Spiral search, FPGA, Synthesis
1. Introduction
Due to the limitations in the available bandwidth and storage space for high quality
multimedia content like - video broadcasting and DVD video data, video compression has
become very much necessary to keep up the ever growing demand, by maintaining the
quality in decoded video. Typically in a video sequence, multiple consecutive time frames
are similar to each other. This redundancy is called the temporal redundancy, exploited by
video compression algorithms to achieve better compression. Motion estimation (ME) is
the one process which exploits the redundancy between two consecutive frames. After the
first frame is transmitted, the next frame in the video sequence is only coded with the
difference from the previous frame. Many ME algorithms already exist, among all the
estimation algorithms, full-search (FS) block-matching algorithm is a popular ME
algorithm. The concept of block matching algorithm to find the movement of each block
between the previous and current frame called displacement vectors (MVs) is depicted in
Figure 1. In Full search algorithm motion vector is calculated in two stages, namely the
calculation of the SAD for each displacement vector, followed by methods for finding the
smallest SAD values.
ISSN: IJCTSNS
Copyright ⓒ 2015 SERSC
International Journal of Communication Technology for Social Networking Services
Vol.3, No.1 (2015)
Figure 1. Block Matching Motion Estimation
Motion estimation is intensive computational process, which consumes more than 85%
of encoding time of encoder. Most of the available algorithms exhibit a trade-off between
quality and speed. Since ME is scene dependent, not a single technique is reliable to
generate a good visual quality. So it needs variety combination of techniques, such as
motion starting point, motion search patterns and adaptive search control to terminate the
search and many more that makes ME a robust. Hence there is a requirement for
improved algorithms and hardware architectures which are suitable for real time
applications. To find the state-of-the-art method, a literature survey is carried out.
2. Literature Survey
In all the existing video coding standards, Block based Motion Estimation (BME) has
been adopted [2-5] to reduce the temporal redundancy between frames. Full search
involves the computation of SAD at each location in the search window. For a search
window of size +/- P pixels, the number of search locations is (2P+1)2. For a search
window of 32x32 and a block size of 16x16, a total of 289 locations is searched to find
the best match with the minimum SAD value. This results in significant computational
complexity. Many algorithms have been proposed to reduce the computational complexity
of full search motion estimation. Some of the popular ones are the Three Step Search
(TSS [4]), the New Three Step Search (NTSS [5]), the Four Step Search (4SS [6]), the
Diamond Search (DS [7]) and the Adaptive Rood Pattern Search (ARPS [8-9]). These
algorithms try to do Small Square (TSS, NTSS, 4SS) or diamond shaped (DS) search
around a search center, and refine the search around the best matching block. Early
termination techniques based on the SAD threshold values are used to reduce the
computation cost. Algorithms like ARPS employ sophisticated search center prediction as
the start point. Though these algorithms address computational cost well, but the
performance in terms of PSNR is close to FS algorithm.
Full search is still attractive for the high end applications, at the cost of increased
computation and power consumption because of its performance in terms of PSNR. Many
hardware architectures are proposed for full search motion estimation. These architectures
try to compute the SAD at all, such locations in the search window. Popular hardware
architectures are the Partial propagate SAD architecture [13], the SAD Tree [14], the 1D
Tree Architecture [15], and the 2D Architecture [15]. These hardware architectures access
the blocks in a raster scan to maximize pixel reuse. The main disadvantage of these
architectures is the inability to start the search around a search center of the search
window of a reference frame and early termination techniques.
To address the above issues, we have proposed modified spiral search order motion
estimation with search step is more than one pixel compared to the traditional search. This
results in skipping of a certain number of blocks. To further speed up, a trace and off
diagonal sum of the current frame block and the reference frame block is matched to find
the nearly matching block. If their values are matched, then SAD is calculated to find the
24
Copyright ⓒ 2015 SERSC
International Journal of Communication Technology for Social Networking Services
Vol.3, No.1 (2015)
best match. Thus, there is a significant reduction in SAD computations as well as
arithmetic operations. Hence, the time required to find the motion vector is drastically
reduced.
Organization of the Rest of the paper is as follows. In section 3, spiral search motion
estimation is explained. In Section 4, description of proposed methodology is given. In
Section 5, FPGA implementation details are explained. Method analysis and discussion
are made in section 6, Finally, conclusion is drawn in Section 7.
3. Spiral Search Motion Estimation
Motion Estimation algorithms and architectures proposed so for, the spiral search
based motion estimation is more efficient because the best matching block concentrates
around the search center and with the increasing distance the probability to find the best
match decreases as shown in the Figure 2a.
Figure 2. a) Precedence of Best Match
b) Traditional Search Order
The spiral search in H.264/AVC with reference software JM18 is as shown in Figure
2(b). The number in the small squares indicates the search order. Number zero is the
search center and the search starts from this position. The search order proposed in H.264
is on only the basis of probability of selected points and predicted search center, which
makes it difficult to implement in hardware [18]. To overcome this issue, a modified
spiral search is proposed and it is explained in the next following sections.
4. Proposed Method
The proposed method is divided into two parts namely, modified spiral search order
and Trace and Off- Diagonal Sum (ODS) Match to find nearly Matching Block
4.1 Modified Spiral Search Order
In modified spiral block search, the macro block of the current frame is matched for the
corresponding macro block in the reference frame with that position as the center of the
search window as shown in the figure 3. The numbers within the small block indicates the
search order and the selection of the next block for search is by more than two pixel
positions or with N/4 shift up, down, left or right because by observation, we found that
there is no large difference of SAD values between two blocks located one pixel away
from each other. The positional shift of the block used in our method is linear and hence
accessing the searching block from the memory is easier than the traditional.
Early termination technique is also employed by considering the minimum threshold
value as the first position SAD value, because a good match will usually be near the
starting location. If a sufficiently good match is found, then SAD minimum will be
replaced with new one until the local minimum is found. Before the SAD calculation,
Copyright ⓒ 2015 SERSC
25
International Journal of Communication Technology for Social Networking Services
Vol.3, No.1 (2015)
trace [17] and off diagonal sum are matched between the current and reference block is
explained in the next section. In Table 1, number of average SAD calculations taken by
the full search with different search order are given. Our proposed search order takes less
number of SAD calculations compared to traditional search order.
Figure 3. Proposed Modified Spiral Search Order
Table 1. Average Number of SAD Calculations Required per Block
Videos/
Scan
Raster
search
Traditional
Spiral search
Akiyo
203
162
Modified
Spiral
search
128
Foot ball
212
170
132
Night
213
167
121
4.2 Trace and Off Diagonal Sum Match to Find Nearly Matching Block
In matrix algebra, the sum of all diagonal elements is called trace and off-diagonal sum
of a matrix [12], defined by the equation (2) and equation (3).
Trace 
N 1
0
  C ( x, x)
(2)
x  N 1 x  0
0
N 1
Off _ diag 
  C ( x, x)
(3)
x  N 1 x  0
The Trace and off diagonal sum calculation has the following characteristics in terms
matching the blocks: It contains two elements of each row and column of the entire block
and therefore no Row or Column is left in the block matching. The diagonal elements
considered for trace calculation are position wise linear and motion of pixels are uniform
[17]. Hence, it is sufficient to find out a nearly matching block. If they match, then SAD
is calculated to find the best match. The number of computations for trace and off
diagonal seam is just O (2n) and for SAD calculation, it is O (n2) [21]. Hence, this method
is more useful in reducing the arithmetic operation as required by the SAD calculation to
find the match. In Table 2, average number of SAD calculations reduced after using trace
and off diagonal sum along with spiral search to calculate one MV is tabulated. On an
average 60 to 70 % SAD computations are skipped. Hence, by using above two methods a
significant amount of execution time and power is saved.
Table 2. Average Number SAD Calculations with Trace and Off Diagonal
Sum
Videos/
Scan
Akiyo
26
Raster
search
Traditional
Spiral search
203
162
Modified
Spiral
search
128
Raster search
withTR &ODS
40
Copyright ⓒ 2015 SERSC
International Journal of Communication Technology for Social Networking Services
Vol.3, No.1 (2015)
Foot ball
212
170
132
36
Night
213
167
121
31
5. FPGA Implementation of Proposed Method
The top level architecture of modified spiral search with trace match is shown in Figure
4, it consists of two main modules 1) input frame processing block 2) Trace and Off
diagonal sum matching and SAD calculation.
Figure 4. Block Diagram of Proposed Method
Cfdatainput and Rfdatainput ports are used to load the current frame macro block and
reference macro block (Candidate Block) from memory when the load signal is high.
Then the trace and off diagonal sum of the blocks is calculated using trace unit. If they
match, then the load signal becomes low. When a load signal is low, motion estimation
process begins by taking even blocks into the SAD Unit I and odd blocks calculation into
SAD unit II. Each SAD unit has memory to store the calculated SAD values. Minimum
SAD value is found by comparing the SAD values between the two units, which
represents the best match.
5.1 Current/Reference Frame Processing Unit
In Figure 5 RTL schematic for current/reference frame processing and the control
signals used are shown. Load signal is used to control the frames. When “load” signal is
high the frames are read and stored in the memory in the form of an array. A dedicated
memory unit is used to store the pixel values one by one serially on application of the
clock to make the process synchronous. A 3x3 current and reference block is used in
design for convenience. Nine clock cycles are required to load these blocks. The Frame
counter is used to keep track of the number of frames read from the input, which is of two
bits wide as there will be a maximum of four neighbors to a pixel. Memory module
provides input to the MB selector unit via a 4 bit “count” output as shown in the Figure 8.
This “count” value controls the main memory unit. When the control signal “RST” is high
or when the “START” signal is low the memory points to the 4th macro block location.
Finally a multiplier unit is used to decide which input block to be given for the motion
estimation block, which consists of 8 bit data line and an 8 bit current/reference number
indicator.
Copyright ⓒ 2015 SERSC
27
International Journal of Communication Technology for Social Networking Services
Vol.3, No.1 (2015)
Figure 5.
Current / Reference Frame Processing Unit
Figure 6.
Current/Reference Macro Block values
5.2 Trace Matching and SAD Calculation Unit
Calculation of motion vector and residue for the macro block values show in Figure 4
is as follows
 Overlap the current and reference frames.
 Match trace and sum of diagonal elements, if match
 Calculate SAD.
 Find the macro block with minimum SAD.
 Find the residue for the blocks with minimum SAD.
 Find the motion vector for block with minimum SAD.
Consider the case when the current block is “2” and the reference block is “8” as
shown in the figure 6. When “8” is the reference block it has five neighboring blocks, i.e.,
9,6,5,4 and 7. When we subtract 2 with the neighbors, the block which gives least SAD is
at the location 6 [23]. Hence, for the current block as 2 and reference block as 8 we get
the motion vector as (2,6) and residue as 2. Similarly, all other locations were considered
and simulation results were verified.
At the output stage this unit having motion vector processor which stores the position
of the matched block and the residue. Figure 7 shows the RTL schematic of Trace and
SAD calculation unit. Motion vector is a 16 bit output which shows the relative motion
of the frames and residue output indicates energy which is 9 bit signed number.
Figure 7. RTL schematic of Trace Matching and SAD Calculation Unit
28
Copyright ⓒ 2015 SERSC
International Journal of Communication Technology for Social Networking Services
Vol.3, No.1 (2015)
5.3 Implementation Results
For every individual unit test bench unit is written to verify the functionality. Each
unit is integrated as a top level module to check the final results. Xilinx ISE 14.3 is used
for simulation and Xilinx XST for synthesis. The simulation waveforms of
current/reference frame module, Trace with SAD calculation and motion vector are
shown Figure 8.
Figure 8. Functional Simulation Waveforms
5.3.1 Estimation of Area and Power
Power, area and delay are three major constraints for any digital design. Therefore to
find these factors, the synthesis is done by using Xilinx XST, PlanAhead is used to
estimate the area and Xpower analyzer is used for power estimation. Figure 9 and Table 4
shows the area estimation and resource utilization report. Power consumption is estimated
by considering possible switching activity is as listed in the Table 5.
Figure 9.
Synthesis Report (Resource Utilization)
Table 3. Resource Utilization after
Parameter
Utilized resource
Slice registers
Slice LUT
Memory
Bonded IOBs
907 out of 18224
3357 out of 9112
236 out of 2176
45 out of 232
%
Utilization
4
36
10
19
Table 4. Power Consumption Summary Implementation
Resources
Clocks
|Logic
Signals
IOs
Quiescent
Total
Copyright ⓒ 2015 SERSC
Power used(mw)
1.69
2.33
1.86
9.04
20.04
34.96
29
International Journal of Communication Technology for Social Networking Services
Vol.3, No.1 (2015)
Total power consumption is 34.96 mW per one MV calculation and it is very less
compared to any other implementations. Synthesis report also gives maximum operating
clock frequency is 297.390MHz and a delay on the critical path that is 5.134ns.
6. Proposed Method Analysis and Discussions
The proposed method is analysis is done for these parameters 1) Speed Up(average
execution time (CPU speed only)) 2)Coding Quality (PSNR) are tabulated along with
other techniques namely, Full Search, Three Step Search (3SS), Four Step search (4SS)
and Diamond search (DS) for comparison on standard videos "foreman, Caltrain, Stefan
and Tennis".
6.1 Speed Up
Speed up indicates the average time taken by the algorithms to find one Motion Vector.
Our proposed method takes an average of 48 ms to find on MV as table 5, and it is 3 to 4
times faster than full search.
Table 5. Average Time (ms) Taken to calculate a MV per Block
Algorithms/Video
FSA
TSS
4SS
Foreman
127.24
64.34
68.12
Caltrain
127.42
64.24
66.27
Stefan
12.8
65.02
64.17
Tennis
127.14
64.98
65.18
DS
52.72
52.42
48.12
53.12
50.19
47.42
54.84
52.40
48.12
55.18
52.00
48.00
SPWOT
SPWT
6.2 Coding Quality
The coding quality is indicated by quality of a reconstructed frame characterized by the
Peak-Signal-to-Noise-Ratio (PSNR) as in the Equation 4.
  2552 

PSNR  10 log10 
 MSE 


(4)
Where MSE is the mean square between the original frames and those compensated by
the motion vectors. Degradation ratio PSNR(DPSNR ) is the ratio of differential PSNR
between FS and modified spiral search to FS and the same is applied for standard
algorithms as expressed in Equation 5
 PSNRFS  PSNRSPWTM 
DPSNR   

PSNRFS


(5)
In Table 6, PSNR and the (DPSNR) of different algorithms on the four video sequences
are tabulated.
Table 6. PSNR and Degraded Ratio (DPSNR) of Proposed and Other
Algorithms
Algorithms
/Videos
Container
PSNR
30
DPSNR
Foreman
PSNR
DPSNR
Caltrain
PSNR
DPSNR
Tennis
PSNR
DPSNR
Copyright ⓒ 2015 SERSC
International Journal of Communication Technology for Social Networking Services
Vol.3, No.1 (2015)
FS
43.18
0
31.69
0
31.51
0
35.74
0
TSS
43.10 -0.18 29.37 -7.32 30.27 -3.92 30.58 -14.52
4SS
43.12 -0.13 29.34 -7.44 30.24 -4.01 30.62 -14.32
DS
43.14 -0.09 31.19 -1.59 31.26 -0.79 33.98
-4.92
SPWOT
43.20
0.04
31.78
0.02
31.52
0.00
35.46 -00.08
SPWT
42.78 -0.92 29.98 -5.28 31.20 -1.60 32.98
-7.60
*SPWOT=Spiral Search Without trace, SPWT=Spiral search With Trace
In the case of the slow-moving sequence Container, the PSNR values (the DPSNR ratios)
of all block matching algorithms are similar. For the medium motion content sequences
such as Caltrain and Foreman , the algorithms which are consistent with fixed patterns
(TSS, 4SS and NTSS) exhibit the worst PSNR values (high DPSNR ratios) except for the
DS algorithm. For the high motion sequences such as Stefan and Tennis also gives the
same degradation. Since the motion content of these sequences is complex, the
performance in general becomes worst for most of the algorithms. However, the PSNR
and DPSNR ratios of the DS and the proposed with and without Trace has less
degradation ratio gives a better video quality which is almost equal to full search. We
have achieved nearly 75 to 85% reduction in SAD calculations with small degradation
video quality , but it is well within the tolerance level.
7. Conclusion
A modified spiral search order trace based full search block motion estimation method
is simulated and implemented using FPGA. Reduction in the computations of the motion
estimation process has been achieved in two steps. In the first step, modified spiral search
is used with early termination gives best efficiency compared to traditional search order
and the second step is Trace and Off Diagonal Sum for matching the blocks before
computation of SAD to find the best match. This has given a significant reduction in
number of SAD calculation to find the motion vector. Experimental results indicates the
significant improvement in the performance compared to conventional full search with the
computational complexity reduction up to 80 to 85 % by preserving a acceptable
degradation ratio. Synthesis report using Spatran-6FPGA shows that the hardware cost is
about 35k and maximum operating clock frequency of 297.39 MHz with power
consumption of 34.956mW. Hence, our proposed modified method is best suited for real
time applications.
References
[1] I. E. G. Richardson, “H.246 and MPEG-4 Video Compression – Video Coding for Next Generation
multimedia”, John Wiley & Sons, ISBN: 0-470-84837-5, (2003).
[2] I. E. G. Richardson, “Video Codec Design – Developing Image and Video Compression Systems”, John
Wiley & Sons, ISBN: 0-470-84783-2, (2002).
[3] M. J. Chen, L. G. Chen, and T. D. Chiueh, "One-dimensional full search motion estimation algorithm for
video coding," IEEE Trans. Circuits Syst. Video Technology., vol. 4, (1994) October, pp. 504-509.
[4] I. Ahmad, W. Zheng, J. Luo, and M. Liou, "A fast adaptive motion estimation algorithm," IEEE Trans.
on Circuits and Systems for Video Technology, vol. 16, (2006) March, pp. 4280-438.
[5] W. I. Chong, B. Jeon, and J. Jeong, “Fast motion estimation with modified diamond search for variable
motion block sizes,” inProc. Int.Conf. Image Process. , vol. 3, (2003) September, pp. 371–374.
[6] R. Li, B. Zeng, and M. L. Liou, “A new three-step search algorithm for block motion estimation,” IEEE
Trans. Circuits Syst. Video Technology, vol. 4, no. 4, (1994) August, pp. 438–442.
[7] L. M. Po and W. C. Ma, “A novel four-step search algorithm for fast block motion estimation,” IEEE
Trans. Circuits Syst. Video Technol., vol. 6, no. 3, June, pp. 313–317.
[8] A. W. Zheng, J. Luo, and M. Liou, “A fast adaptive motion estimation algorithm,” IEEE Trans. Circuits
Syst. Video Technol., vol. 16, no. 3, (2006) March, pp. 420–438.
Copyright ⓒ 2015 SERSC
31
International Journal of Communication Technology for Social Networking Services
Vol.3, No.1 (2015)
[9] S. Goel, Y. Ismail, and M. A. Bayoumi, “Adaptive search window size algorithm for fast motion
estimation in H.264/AVC standard,” in Proc. Midwest Symp. Circuits Syst., (2005) August, pp.
1557–1560.
[10] Z. Yang, J. Bu, C. Chen, and X. Li, “Fast predictive variable-block-size motion estimation for
H.264/AVC,” in Proc. IEEE Int. Conf. Multimedia Expo, (2005) July, pp. 1–4.
[11] W. Li and E. Salari, “Successive elimination algorithm for motion estimation,” IEEE Trans. Image
Process. , vol. 4, no. 1, Jan. (1995) pp. 105–107.
[12] X. Q. Gao, C. J. Duanmu, and C. R. Zou, “A multilevel successive elimination algorithm for block
matching motion estimation,” IEEE Trans. Image Process., vol. 9, no. 3, (2000) March, pp. 501–504.
[13] M. Br ¨unig and W. Niehsen, “Fast full-search block matching,” IEEE Trans. Circuits Syst. Video
Technol. , vol. 11, no. 2, (2001) February, pp. 241–247.
[14] S. Goel, Y. Ismail, P. Devulapalli, J. McNeely, and M. Bayoumi, “An efficient data reuse motion
estimation engine,” in Proc. IEEE Workshop SIPS , (2006) October, pp. 383–386.
[15] J.-C. Tuan, T.-S. Chang, and C.-W. Jen, “On the data reuse and memory bandwidth analysis for
full-search block-matching VLSI architecture,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no.
1, (2002), January, pp. 61–72.
[16] S. Senagupta and V. S. K. Reddy, “A Fast and Efficient Predictive Block Matching Motion estimation”,
IJCNSN, vol. 7, no. 12, (2007) December.
[17] S. Sundaravadivelu and S. Jayakumar, “An Efficient Motion Estimation Algorithm using Trace Match
for Fast Video Compression “European Journal of Scientific Research ISSN 1450-216X, vol. 53, no. 4,
(2011), pp. 546-554.
[18] N. Song and T. Shimato, “A Novel Spiral Type Motion Estimation Architecture for H.264/AVC”,
Journal of semiconductor technology and science, vol. 10, no. 1, (2010), March.
[19] A. C. Vikram and S. R. Laddaha, “A Novel dual processing Architecture for implementation of Motion
estimation Unit of H.264AVC on FPGA”, 2009 IEEE Symposium on industrial Electronics and
Applications (ISIEA 2009),Kaula Lamper, Malaysia.
[20] L. Kulkarni, TM Manu and B. S Anami, “A Two-Step Methodology for Minimization of
Computational Overhead in Block Motion Estimation”, International Journal of u-and e-Service, Science
and Technology, vol. 7, no. 4, (2014), pp. 339-348.
32
Copyright ⓒ 2015 SERSC
Download