Speed-area optimized FPGA implementation for Full Search Block Matching

advertisement
Speed-area optimized FPGA implementation for Full Search Block Matching
Santosh Ghosh and Avishek Saha
Department of Computer Science and Engineering, IIT Kharagpur, WB, India, 721302
{santosh, avishek}@cse.iitkgp.ernet.in
Abstract
ture for FSBM was proposed in [3]. Both [13] and [1]
proposed low-power architectures based on removal of unnecessary computations. Finally, a novel low-power parallel tree FSBM architecture was proposed in [6], which
exploited the spatial data correlations within parallel candidate block searches for data sharing and thus effectively
reduces data access bandwidth and power consumption. [7]
proposed an FPGA architecture to implement parallel computation of FSBM. Systolic array and novel OnLine Arithmetic (OLA) based designs for FSBM were proposed in [8]
and [9], respectively. Customizable low-power FPGA cores
were proposed by [10]. [11] evaluated the performance of
FSBM hardware architectures [4] implemented on Xilinx
FPGA. The results show that, real-time motion estimation
for CIF (352 × 288) sequences can be achieved with 2-D
systolic arrays and moderate capacity (250 k gates) FPGA
chip. An adder-tree based 16 × 1 SAD FPGA hardware was
implemented by [17].
This paper presents an FPGA based hardware design
for Full Search Block Matching (FSBM) based Motion Estimation (ME) in video compression. The significantly
higher resolution of HDTV based applications is achieved
by using FSBM based ME. The proposed architecture uses
a modification of the Sum-of-Absolute-Differences (SAD)
computation in FSBM such that the total number of additions/subtraction operations is drastically reduced. This
successfully optimizes the conflicting design requirements
of high throughput and small silicon area. Comparison results demonstrate the superior performance of our architecture. Finally, the design of a reconfigurable block matching
hardware has been discussed.
1 Introuction
The aforementioned FSBM architectures can be divided
into two categories, namely, FPGA [7, 8, 9, 10, 11, 17] and
ASIC [4, 15, 18, 2, 3, 20, 5, 19, 13, 1, 6]. This work uses
FPGA technology to implement a high-performance ME
hardware with due consideration to (a) processing speed
and (b) silicon area. Almost all aforementioned VLSI architectures optimize any one of these parameters. The novelty
of the proposed architecture lies in its combined optimization of the aforementioned conflicting design requirements.
The proposed hardware uses an initially-split pipeline to reduce processing cycles for each MB and thus increases the
throughput. In addition, this design requires less number of
adders and only one Absolute Difference (AD) PE, which
drastically reduces the silicon area when compared to other
existing designs. The pixels of the search regions have been
organized in memory banks such that two sets of 128-bit
(16 8-bit pixels) data can be accessed in each clock cycle.
Rapid growth in High-Definition (HD) digital video applications has lead to an increased interest in portable HDquality encoder design. HD-compatible MPEG2 MP@HL
encoder uses Full Search Block Matching Algorithm (FSBMA) based Motion Estimation (ME). The ME module accounts for more than 80% of the computational complexity of a typical video encoder. Moreover, the power consumption of an FSBM-based encoder is prohibitively high,
particularly for portable implementations. Hence, efficient
ME processor cores need to be designed to realize portable
HDTV video encoders.
Parameterizable FSBM ASIC design to solve the input
bandwidth problem by using on-chip line buffers was proposed in [15]. [18] proposed a family of modular VLSI architectures which allow sequential inputs but perform parallel processing with 100 percent efficiency. A systolic mapping procedure to derive FSBM architectures was proposed
in [4]. The designs of ([2], [20]) and [5] focused on the
reduction of pin counts by sharing memory units and 2dimensional data reuse, respectively. [19] improved the
memory bandwidth by using an overlapped data flow of
search area which increased the processing element (PE)
utilization. A low-latency high-throughput tree architec-
1-4244-1258-7/07/$25.00 ©2007 IEEE
Section 2 gives an overview of FSBM-based motion estimation. Section 3 presents a brief discussion on SAD
modifications and describes the proposed FSBM hardware.
The implementation and comparative results have been presented in Section 4. Section 5 presents a reconfigurable address generator. Finally, Section 6 concludes this paper.
13
2 FSBM-based Motion Estimation
The detailed proof of the above derivation can be found
in [12]. Again, it can be posited that, if,
Motion-compensated video compression models the
pixel motion within the current picture as a translation of
those within a previous picture. The motion vector is obtained by minimizing a cost function measuring the mismatch between the current MB in current frame and the
candidate block in reference frame. SAD, the most popular
cost function, between the pixels of the current MB x(i, j)
and the search region y(i, j) can be expressed as,
N
−1
−1
N
−1 N
−1 N
x(i, j) −
y(i + u, j + v) ≥ SADmin
SAD(u, v) =
−1
N
−1 N
|x(i, j) − y(i + u, j + v)|
i=0 j=0
then,
SAD(u, v) ≥ SADmin
(1)
where, (u, v) is the displacement between these two blocks.
Thus, each search requires N 2 absolute differences and
(N 2 − 1) additions. The FSBMA exhaustively evaluates
all possible search locations and hence is optimal in terms
of reconstructed video quality and compression ratio. High
computational requirements, regular processing scheme and
simple control structures make the hardware implementation of FSBM a preferred choice.
3.2
(3)
Pipelined SAD Operator
The SAD hardware for FSBMA has been divided into
eight independent sequential steps. It computes the initial
full SAD for the first Search Location (SL) and derives the
SAD sums for subsequent SLs. Fig. 1 shows the data path
of the proposed SAD operator for N = 16.
Stages 1 to 4 of the proposed design have been split to
facilitate parallel processing. Each half-stage (from Stage 1
to Stage 4) computes the sum of 16 pixel values per clock
cycle. These partial sums are accumulated in SR and M B
registers of Stage 6. Initially, the SR and MB registers
of Stage 6 are initialized to 0. For the first SAD calculation, Stage 5 just passes the intermediate addition result
of Stage 4 to Stage 6. This can be achieved by setting the
S0 control signal of Stage 6 to 0. Thus, the SAD sum of
the candidate MB and the first SL can be computed in 6 (for
the six stages of the pipeline) + 15 (to add 16 values) = 21
cycles. Thereafter, for every subsequent SL, the right and
the left half-stages add the pixel intensities of the old and
new rows/coloumns, respectively. At this point, Stage 5
is activated by enabling the S0 control signal. This stage
differentiates the resultant sum of the two half-stages and
accumulates the result in SR register of Stage 6. Stage 7
computes the AD between the older MB sum and the newly
obtained SL sum. Finally, Stage 8 compares the new SAD
with the existing SADmin and stores the minimum SAD
sum obtained so far. Thus, at each clock cycle, the proposed pipelined architecture computes one new SAD value
and stores the minimum SAD. Hence, with a search region
size of p = 16, this hardware can search the best match for
an MB in only [(2p + 1)2 − 1] + 23 clocks = 1111 clock
cycles.
Table 1: Execution profile of a typical video encoder
SAD
ME/
DCT/
Q/IQ
VLC/ Others
MC
IDCT
VLD
72.28% 16.85% 6.17% 2.35% 1.45% 0.32%
The execution profile of a standard video encoder obtained using the GNU gprof tool has been shown in Table 2.
The table shows that motion estimation is the most computationally expensive module in a typical video encoder. In
addition, SAD computations take the maximum time due to
complex nature of absolute operation and subsequent multitude of additions.
3 Proposed FSBM Architecture
In this section we delineate our proposed speed-area optimized FSBM architecture. The first subsection briefly explains the SAD modification and the MB searching technique. The subsequent subsections describe the proposed
hardware and the memory organization.
SAD modification
This section presents a modification to SAD computation. The SAD expression in Eq. 1 can be re-written as,
N −1 N −1
−1
N
−1 N
SAD(u, v) ≥ x(i, j) −
y(i + u, j + v)
i=0 j=0
(by Eq. 2)
where SADmin denotes the current minimum SAD
value. Thus, if Eq. 3 is satisfied, then the SAD computation at the (u, v)th location may be skipped.
In addition, if X(u, v) be the sum of pixel intensities at
the (u, v)th MB location, then this sum can be derived from
X(u − 1, v) by subtracting and adding the intensity sum of
columns at specific positions. Based on this fact, [12] proposes a search strategy to efficiently derive and compute the
MB sums at successive locations. The MB search technique
used in our proposed design adopts this particular approach.
i=0 j=0
3.1
i=0 j=0
i=0 j=0
(2)
14
Pipeline Stages
+
+
+
+
+
+
+
+
+
+
+
+
+
(1)
(2)
(3)
+
+
+
+
+
+
_
0 1
SR
+
+
+
+
+
+
+
(4)
+
+
+
+
(5)
S0
+
AD
(6)
MB
+
(7)
SAD
a< b
1 0
(8)
Figure 1: Data path of different pipeline stages of the proposed SAD unit
3.3
Memory Organization
left or right the oldest column of the pervious search location is subtracted from one new column in the new search
location. This implies that, at every clock, we need to access
two 128-bit (16 × 8) data from the memory. These 128-bit
data are basically represented as a part of one column in
the search region (Fig.3), e.g., [P1,1 , P2,1 , P3,1 , ..., P16,1 ] is
one such 128-bit data, which belongs to the column 1 of the
search region. It is observed that the one of the columns
from column number 17 to 32 are accessed concurrently
with another column from rest of the columns, i.e., 1 to 16
and 33 to 48, in the pre-defined search region. Therefore,
the pixels have been organized in two different memory
banks, as shown in Fig. 2. The data in these memory banks
are organized in column major format so that the whole column can be accessed by a single memory access. The memory controller generates the right address at every clocks for
both the memory banks. The selected 384 bits (48 pixels
of a single column of Fig.3) of each bank are then multiplexed and the correct 16 pixels are passed onto the SAD
processing unit.
Our design adopts the MB scanning technique proposed
in [12]. The pixels in p = 16 search region are represented
by Pi,j where 0 ≤ i ≤ 48 and 0 ≤ j ≤ 48 (shown in
Fig. 3)). This search region has (2p + 1)2 = 332 = 1089
search locations.
1
row number
1
2
3
2
3
column number
….
….
….. …. 48
P1,1 P1,2 ………
….
….
….. …. P1,48
P2,1 P2,2 ………
….
….
….. …. P2,48
.
.
.
.
.
.
48
P48,1 P48,2 ………
….
….
When the search location is moved down from the previous position, then we need to access two set of row pixels. This is not possible by the previously organized memory banks in one clock. It is easily observed Fig. 3 that
either the first 16 pixels or the last 16 pixels of a single
row have to be accessed for this purpose. It is also to
be observed that, for the even row number, the first 16
…. …. P48,48
Figure 3: Position of Pixels in the search region
by
Initially,
sum of the first search location is computed
16 the
16
j=1
i=1 Pi,j equation. Thereafter, to move towards
15
1
1
2
3
.
.
2
3
row number
….
….
3
….
….
….. …. 16
….. …. 48
1
P 1,33 P 1,34 …..…
….
….
….. …. P 1,48
P 2,1 P 2,2 ………
….
….
….. …. P2,16
P 3,33 P 3,34 …..…
….
….
….. …. P 3,48
P1,1 P2,1 …..…... P16,1
….
P32,1
…. P48,1
2
P1,2 P2,2 …..…... P16,2
….
P32,2
…. P48,2
3
P1,3 P2,3 …..…... P16,3
….
P32,3
…. P48,3
RB1
2
3
16 P1,16 P2,16 …..…. P16,16
33 P1,33 P2,33 …..…...P16,33
….
….
P32,16
…. P48,16
P32,33
…. P48,33
….
….
….. …. P16,16
33 P33,33 P33,34 …...
….
….
….. …. P33,48
P 32,17
…. P48,17
18 P1,18 P2,18 ….... P16,18
….
P 32,18
…. P48,18
19 P1,19 P2,19 ….... P16,19
….
P 32,19
…. P48,19
….
P 32,32
…. P48,32
.
.
.
48 P1,48 P2,48 …..…...P16,48
….
P32,48
…. P48,48
(a)
48 P48,1 P48,2 ………
….
….
(b)
…. …. P48,16
(c)
1 2 3
….
….
17 P17,33 P17,34 …...
….
18 P18,1 P18,2 ……
….
19 P19,33 P19,34 …..
.
32 P32,1 P32,2 …….
….
RB2
.
….. …. 48
….
32 P1,32 P2,32 ….... P16,32
16 P16,1 P16,2 …….
.
row number
….
….
17 P1,17 P2,17 ….... P16,17
.
RB3
column number
1
2
column number
1
….
….. …. 16
….
….. …. P17,48
….
….. …. P18,16
….
….. …. P19,48
….
….. …. P32,16
(d)
Figure 2: Organization of pixels in [(a),(c)] column major/[(b),(d)] row major format that are added or subtracted during the
shift of search in left or right/down locations, respectively. (c) and (d) represent the corresponding 2nd column/row memory
banks that are independent of the 1st column/row memory banks shown in (a) and (b), respectively
(Pi,1 , Pi,2 , ... , Pi,16 when i is even) and for the odd row
the last 16 (Pi,33 , Pi,34 , ... , Pi,48 when i is odd) pixels are
accessed to handle the downward movement of the search
location. Hence, we have stored the required row values in
another two memory banks. One is 32 × 128 − bit, to store
32 such row pixel sets and the another one is 16 × 128 − bit,
to store 16 such row pixel sets. Thus, the design needs only
768 bytes of overhead memory. The organization of this
memory banks and the stored pixels are shown in Fig. 2.
In order to reduce the total number of memory accesses in FSBM-based architecture, data reuse can be performed [14] at four different levels. Our on-chip memory
bank organization technique adopts the data reuse defined
as Level A and Level B. Level A describes the locality of
data within the candidate block strip where the search locations are moving within the block strip. Level B describes
the locality among the candidate block strips, as vertically
adjacent candidate block strips are overlapped. In our design this memory organization primarily based on the usage
of Look Up Tables (LUT) in the FPGA implementation.
sized on a Xilinx Virtex IV 4vlx100ff1513 FPGA. The synthesis results show that design requires 333 CLB Slices, 416
DFFs/Latches and a total of 278 input/output pins. The area
of the implementation is 380 look-up tables (LUTs) and the
highest achievable frequency is 221.322 MHz.
The pipelined design takes 23 clock cycles to produce
the first SAD value. Thereafter, one SAD value is generated
in every cycle. A search range of p = 16 has (2p + 1)2 =
1089 search locations. So for a search range of p = 16, the
number of cycles required by our hardware to find the best
matching block is, 23 (for the first search location) + (10891) (for the remaining search locations) = 1111 cycles.
Our FPGA implementation works at a maximum frequency of 221.322MHz (4.52 ns clock cycle). Hence, the
FPGA implementation can process a MB (16x16) in 5.022
usec (1111 clock cycles per MB * 4.52 ns per clock cycle = 5.022 usec) and a 720p HDTV (1280x720) frame in
18.078 msec (3600 MBs per frame * 5.022 usecs per MB
= 18.078 msec). At this speed, the proposed hardware can
process 55.33 720p HDTV frames per second. This is a
big improvement over other approaches, where the frames
processed per second is much lower. This is evident from
Table 2. The high speed and throughput of our design is
mainly because of the modified SAD operation and the split
pipeline design of the proposed architecture.
4 Performance Analysis
This section presents the implementations results of the
proposed hardware. Subsequently, it compares the obtained
results with other exiting FPGA based designs.
4.1
4.2
Implementation Results
Performance Comparison
This subsection compares the hardware features and
performance of the proposed design with existing FPGA
architectures. No comparison has been made with available
ASIC solutions.
The proposed design has been implemented in Verilog HDL and verified with RTL simulations using Mentor
Graphics ModelSim SE. The Verilog RTL has been synthe-
16
Table 2: Comparison of hardware features and performance with N=16 and p=16
Feature-based comparison
Performance
Design
cycles
Freq
CLB Input AD
Adders
Comp HDTV ThroughT
/MB
(MHz)
Slices Ports PEs
720p
put (T)
Area
(fps)
(MBs/sec)
Loukil et al. [7]
8482
103.8
1654
48
33
33 8-bit
17
3.4
12237.7
7.4
(Altera Stratix)
Mohammad et al. [8] 25344
191.0
300
2
33
33 8-bit
34
2.09
7536.3
25.1
(Xilinx Virtex II)
Olivares et al. [9]
27481
366.8
2296
2
256
510 1-bit
1
3.71
13347.4
5.8
(Xilinx Spartan)
Roma et al. [10]
2800
76.1
29430
3
256
15 8-bit
1
7.55
27178.6
0.92
(Xilinx XCV3200E)
Ryszko et al. [11]
1584
30.0
948
16
256
16 8-bit
1
5.26
18939.4
11.9
(Xilinx XC40250)
Wong et al. [17]
45738
197.0
1699
32
16
243 8-bit
1
1.2
4307.1
2.5
(Altera Flex20KE)
Our
1111 221.322
333
256
1
16 8-bit, 8
1
55.33
199209.7 598.2
(Xilinx Virtex IV)
9-bit, 4 10bit, 3 11-bit
& 2 16-bit
stantially high throughput/area value of 598.2.
Table 4.1 compares the hardware features of the proposed and existing FPGA solutions for a macroblock (MB)
of size 16 × 16 and a search range of p = 16. As can
be seen, our design consumes less cycles per MB, has the
highest maximum operating frequency. The splitting of the
initial stage of the pipeline facilitates this high speed. The
area required in terms of CLB slices and the hardware complexity in terms of AD PEs (Absolute Difference Processing
Elements), adders and comparators are much lesser for the
proposed architecture. Modification of the SAD operation
contributes to the high speed and less area and hardware
complexity. The use of memory banks has led to higher
on-chip bandwidth. However, this has also led to the only
drawback of our design, which is the high number of input/output pins.
5 Reconfigurable Block Matching Hardware
Apart from using the full pattern, block matching can
also be performed by using N-queen decimation patterns. It
has been shown [16] that the N-queen patterns have similar
PSNR drop but yield much faster encoding performance as
compared to the full pattern, particularly for N = 4 and
N = 8. This section presents a reconfigurable hardware design to find the minimum SAD value by selecting any one of
the full-search, 8-queen or 4-queen decimation techniques.
To the best of our knowledge no similar hardware design
exists in literature.
For both 4-queen and 8-queen decimation techniques,
the pixels being processed for two consecutive SAD-based
block matching are mutually independent. This fact can be
utilized to further enhance the performance of the SAD operator discussed in section 3. Only the memory organization and the address generation at each clock will differ for
the three decimation patterns. It has been observed that the
reconfigurable address generator and SAD operator require
only 40% and 2% extra hardware cost, respectively, as compared to the already proposed full pixel architecture.
The reconfigurable address generator uses a common
datapath. Two consecutive addresses are represented by
their respective bit value differences. For each decimation technique, the bit value is toggled following some predefined patterns. Bit toggling of the 8-bit address lines are
A performance comparison of the various architectures
has been also shown in Table 4.1. In order to compare
the speed-area optimized performance of different architectures, the new performance criteria of throughput/area has
been used. Higher the throughput/area parameter of a design, more is the speed-area optimization of the architecture. The architectures have been compared in terms of
(a) number of HDTV 720p (1280x720) frames that can be
processed per second, (b) throughput or MBs processed per
second, (c) throughput/area, and (d) the I/O bandwidth. As
can be seen, the proposed design has a very high throughput and can process the maximum number of HDTV 720p
frames per second (fps). Moreover, the superior speed-area
optimization in the proposed design is exhibited by its sub-
17
controlled by their respective enable signals which are being generated by one special controller logic. This state machine based controller generates the respective enable signals depending on 2-bit decimation mode select input signals. The pipelined datapath shown in Fig. 1 can also be reconfigured according to the user specified decimation mode.
In case of 8-queen on 16 × 16 block size, 32 pixel values are
added at every clock by both halves of the pipe stages from
one to five. The resultant value is directly used to perform
absolute difference with the MB to calculate current SAD
value. The same datapath of the pipelined SAD operator
also performs the SAD calculation for 4-queen decimation.
This technique requires 64 pixels for each SAD value for
16 × 16 block size. So, the pipeline is reconfigured in a way
such that its both halves from stage one to five and stage
six are used to perform the addition of these 64 pixel values. Subsequently, it performs sum of absolute differences
to get the new SAD.
[6] S. Lin, P. Tseng, and L. Chen. Low-power parallel tree architecture for full search block-matching motion estimation.
In Proc. of Intl. Symp. Circ. and Sys., volume 2, pages 313–
316, May 2004.
[7] H. Loukil, F. Ghozzi, A. Samet, M. Ben Ayed, and N. Masmoudi. Hardware implementation of block matching algorithm with fpga technology. In Proc. Intl. Conf. on Microelectronics, pages 542–546, Dec 2004.
[8] M. Mohammadzadeh, M. Eshghi, and M. Azadfar. Parameterizable implementation of full search block matching algorithm using fpga for real-time applications. In Proc. 5th
IEEE Intl. Caracas Conf. on Dev., Circ. and Sys., Dominican
Republic, pages 200–203, Nov 2004.
[9] J. Olivares, J. Hormigo, J. Villalba, I. Benavides, and E. Zapata. Sad computation based on online arithmetic for motion
estimation. Jrnl. Microproc. and Microsys., 30:250–258, Jan
2006.
[10] N. Roma, T. Dias, and L. Sousa. Customisable core-based
architectures for real-time motion estimation on fpgas. In
Proc. of 3rd Intl. Conf. on Field Prog. Logic and Appl., pages
745–754, Sep 2003.
[11] A. Ryszko and K. Wiatr. An assesment of fpga suitability
for implementation of real-time motion estimation. In Proc.
IEEE Euromicro Symp. on DSD, pages 364–367, 2001.
[12] A. Saha and S. Ghosh. A speed-area optimization of full
search block matching with applications in high-definition
tvs (hdtv). In To appear in LNCS Proc. of High Performance
Computing (HiPC), Dec 2007.
[13] L. Sousa and N. Roma. Low-power array architectures for
motion estimation. In IEEE 3rd Workshop on Mult. Sig.
Proc., pages 679–684, 1999.
[14] J. Tuan and C. Jen. An architecture of full-search block
matching for minimum memory bandwidth requirement. In
Proceedings of the IEEE GLSVLSI, pages 152–156, Feb
1998.
[15] L. Vos and M. Stegherr. Parameterizable vlsi architectures
for the full- search block- matching algorithm. IEEE Circ.
and Sys., 36(10):1309–1316, Oct 1989.
[16] C. Wang, S. Yang, C. Liu, and T. Chiang. A hierarchical nqueen decimation lattice and hardware architecture formotion estimation. IEEE Transactions on CSVT, 14(4):429–
440, April 2004.
[17] S. Wong, V. S., and S. Cotofona. A sum of absolute differences implementation in fpga hardware. In Proc. 28th
Euromicro Conf., pages 183–188, Sep 2002.
[18] K. Yang, M. Sun, and L. Wu. A family of vlsi designs for
the motion compensation block-matching algorithm. IEEE
Circ. and Sys., 36(10):1317–1325, Oct 1989.
[19] Y. Yeh and C. Lee. Cost-effective vlsi architectures and
buffer. size optimization for full-search block matching algorithms. IEEE Tran. VLSI Sys., 7(3):345–358, Sep. 1999.
[20] H. Yeo and Y. Hu. A novel modular systolic array architecture for full-search blockmatching motion estimation. In
Proc. Intl. Conf. on Acou., Speech, and Sig. Proc., volume 5,
pages 3303–3306, 1995.
6 Conclusions
This paper has presented a FPGA based design for Full
Search Block Matching Algorithm. The novelty of this
design lies in its modified SAD calculation and in splitpipelined design for parallel processing in the initial stages
of the hardware. The macroblock search scan has also been
suitably altered to facilitate the derivation of SAD sums
from previously computed results. Compared to existing
FPGA architectures, the proposed design exhibits superior
performance in terms of high throughput and low hardware
complexity. The high frame processing rate of 55.33 fps
makes this design particularly useful in both frame and field
processing of HDTV based applications. The paper finally
hints out the reconfigurable block matching hardware that
could be useful to general purpose real time video processing unit.
References
[1] V. Do and K. Yun. A low-power vlsi architecture for fullsearch block-matching. IEEE Tran. Circ. and Sys. Video
Tech., 8(4):393–398, Aug 1998.
[2] C. Hsieh and T. Lin. Vlsi architecture for block-matching
motion estimation algorithm. IEEE Tran. Circ. and Sys.
Video Tech., 2(2):169–175, June 1992.
[3] Y. Jehng, L. Chen, and T. Chiueh. Efficient and simple vlsi
tree architecture for motion estimation algorithms. IEEE
Tran. Sig. Pro., 41(2):889–899, Feb 1993.
[4] T. Komarek and P. Pirsch. Array archtectures for block
matching algorithms. IEEE Circ. and Sys., 36(10):1301–
1308, Oct 1989.
[5] Y. Lai and L. Chen. A data-interlacing architecture with
two-dimensional data-reuse for full-search block-matching
algorithm. IEEE Tran. Circ. and Sys. Video Tech., 8(2):124–
127, April 1998.
18
Download