Nunez-Yanez et al_2011_IEEE

advertisement
2967
1
Cogeneration of Fast Motion Estimation
Processors and Algorithms for Advanced Video
Coding
Jose L. Nunez-Yanez, Atukem Nabina, Eddie Hung, George Vafiadis

Abstract — This work presents a flexible and scalable motion
estimation processor capable of supporting the processing
requirements of high definition (HD) video in the H.264
advanced video codec and suitable for FPGA implementation.
The core can be programmed using a C-style syntax optimized to
implement fast block matching algorithms. The development
tools are used to compile the algorithm source code to the
processor instruction set and to explore the processor
configuration space. A large configuration space enables the
designer to generate different processor microarchitectures
varying the type and number of integer and fractional pel
execution units together with other functional units. All these
processor instantiations remain binary compatible so
recompilations of the motion estimation algorithm are not
required. Thanks to this optimization process it is possible to
match the processing requirements of the selected motion
estimation algorithm and options to the hardware
microarchitecture leading to a very efficient implementation.
Index Terms — Video coding, motion estimation,
reconfigurable processor, H.264, FPGA.
I. INTRODUCTION
T
he emergence of new advanced coding standards such as
VC-1, AVS and especially H.264 with its multiple coding
tools [1] have introduced new challenges during the motion
estimation process used in inter-frame prediction. While
previous standards such as MPEG-2 could only vary the
search strategy H.264 adds the freedom of using multiple
motion vector candidates, sub-pixel resolutions, multiple
reference frames, multiple partition sizes and rate-distortion
optimization as tools to optimize the inter-prediction process.
The potential complexity introduced by these tools operating
on large reference area sizes containing lengthy motion
vectors makes the full-search approach which exhaustively
tries each possible combination less attractive. A flexible,
reconfigurable and programmable motion estimation processor
such as the one proposed in this work is well poised to address
these challenges by fitting the core microarchitecture to the
inter-frame prediction tools and algorithm for the selected
Manuscript received January 30, 2009. This work was supported by the
UK EPSRC under grants EP/D011639/1 and EP/E061164/1.
Jose Luis Nunez-Yanez, Atukem Nabina and George Vafiadis are with
Bristol University, Department of Electronic Engineering Bristol, UK (phone:
0117 3315128; fax: 0117 954 5206; e-mail: j.l.nunez-yanez@bristol.ac.uk,
a.nabina@bristol.ac.uk,george.vafiadis@bristol.ac.uk.
Eddie Hung is with ECE at University of British Columbia, Vancouver,
Canada; e_mail: eddieh@ece.ubc.ca).
H.264 encoding configuration. The concept was briefly
introduced in [2] and it is further developed and improved in
this work. The paper is organized as follows. Section II
reviews relevant work in the field of hardware architectures
for motion estimation concentrating on reconfigurable and
programmable solutions. Section III motivates the presented
work showing the effects of different motion estimation
options and algorithms in high-definition video coding.
Section IV presents the programming model and tools
developed to explore the software/hardware design space of
advanced motion estimation. Section V describes the
processor microarchitecture details and section VI analyses the
complexity/performance/power of the proposed solution.
Finally, section VII concludes this paper.
II. MOTION ESTIMATION HARDWARE REVIEW
Full-search algorithms have been the preferred option for
hardware implementations due to their regular dataflow which
makes them well suited to architectures using 1-D or 2-D
systolic array principles with simple control and high
hardware utilization. Full-search architectures implement SAD
reuse strategies that makes them specially suited to support the
variable block sizes used in H.264. By combining the results
of smaller blocks into larger blocks only small increases in
gate count are required over their conventional fixed-block
counterparts with little bearing on its throughput, critical path,
or memory bandwidth. On the other hand, the hardware
requirements needed to obtain enough parallelism to check all
the possible search points in real-time are very large. This will
be even more challenging if large-search ranges, ratedistortion optimization and fractional-pel search are
considered. A recent example of a high-performance integeronly full-search architecture is presented in []. This work
considers a relatively large search range of 63×48 pixels and
can vary the number of pixel processing units. A configuration
using 16 pixel processing units can support 62 fps of
1920×1080 video resolution clocking at 200 MHz. Each pixel
processing unit works in a different macroblock in parallel
obtaining 41 motion vectors (all block sizes) in parallel. By
working in 16 adjacent macroblocks of 16x16 pixels in
parallel data reused can be exploited. The architecture needs
around 154K LUTs implemented in a Virtex5 XCV5LX330
In an effort to reduce the complexity of motion search
process architectures for fast ME algorithms have been
proposed as seen in [9]. The challenges the designer faces in
this case include unpredictable data flow, irregular memory
2967
access, low hardware utilization and sequential processing.
Fast ME approaches use a number of techniques to reduce the
number of search positions and this inevitably affects the
regularity of the data flow, eliminating one of the key
advantages that systolic arrays have: their inherent ability to
exploit data locality for re-use. This is evident in the work
done in [10] that compares a number of fast-motion algorithms
mapped onto a systolic array and discovers that the memory
bandwidth required does not scale at anywhere near the same
rate as the gate count. A number of architectures have been
proposed which follow the programmable approach by
offering the flexibility of not having to define the algorithm at
design time. The application specific instruction-set processor
(ASIP) presented in [11] uses a specialized data path and a
minimum instruction set similar to our own work. The
instruction set consists of only 8 instructions operating on a
RISC-like, register-register architecture designed for lowpower devices. There is the flexibility to execute any arbitrary
block matching algorithms and the basic SAD16 instruction
computes the difference between two sets of sixteen pixels and
in the proposed microarchitecture takes sixteen clock cycles to
complete using a single 8-bit SAD unit. The implementation
using a standard cell 0.13 μm ASIC technology shows that this
processor enables real time motion estimation for QCIF,
operating at just 12.5 MHz to achieve low power
consumption. An FPGA implementation using a Virtex-II Pro
device is also presented with a complexity of 2052 slices and a
clock of 67 MHz. In this work scaling can be achieved by
varying the width of the SADU (ALU-equivalent for
calculating SADs) but due to its design, the maximum
parallelism that can be achieved would be if the SAD for the
entire row could be calculated in the minimum one clock
cycle, in a 256-bit SIMD (Single Instruction Multiple Data)
manner.
The programmable concept is taken a step further in [12].
This motion estimation core is also oriented to fast motion
estimation
implementation
and
supports
sub-pixel
interpolation and variable block sizes. The interpolation is
done on-demand using a simplified non-standard filter which
will cause a mismatch between the coder output and a
standard-compliant decoder. The core uses a technique to
terminate the calculation of the macroblock SAD when this
value is larger than some previously calculated SAD but it
does not include a Lagrangian-based RD optimization
technique [13]. Scalability is limited since a single functional
unit is available although a number of configuration options
are available to match the architecture to the motion algorithm
such as algorithm-specific instructions. The SAD instruction
comparable to our own pattern instruction operates on a 16pixel pair simultaneously and 16 instructions are needed to
complete a macroblock search point taking up to 20 clock
cycles. The processor uses 2127 slices in an implementation
targeting a Virtex-II device with a maximum clock rate of 50
MHz. This implementation can sustain processing of
1024x750 frames at 30 frames per second. Xilinx has recently
developed a processor capable of supporting high definition
720p at 50 frames per second, operating at 225 Mhz [14] in a
Virtex-4 device with a throughput of 200,000 macroblocks per
second. This Virtex-4 implementation uses a total of around
3000 LUTs, 30 DSP48s embedded blocks and 19 block-
2
RAMs. The algorithm is fixed and based on a full search of a
4x3 region around 10 user-supplied initial predictors for a total
of 120 candidate positions, chosen from a search area of
112x128 pixels. The core contains a total of 32 SAD engines
which continuously compute for a given motion vector
candidate the 12 search positions that surround it.
III. THE CASE FOR FAST MOTION ESTIMATION HARDWARE
.
Most of the available literature indicates that full search
algorithms deliver the best performance in terms of PSNR and
bit rate compared with fast motion estimation algorithms.
However the research done in papers such as [19-20] suggest
that a well-designed fast block matching algorithm not only
can speed up the motion estimation process but also improve
the rate-distortion performance in state of the art video coders
such as H.264. The introduction of motion vector candidates
as starting search points obtained from neighboring
macroblocks and early termination techniques tend to produce
smoother motion vectors with a smaller delta between the
predicted motion vector and the selected motion vector. This
in turn translates in fewer bits needed to code the motion
vectors over a large range of macroblocks and it can produce
better results than full search algorithms that check all the
possible motion vectors available in the search range and
select the one that minimizes a decision criteria such as the
sum-of-absolute-differences
(SAD).
Additionally,
the
effective costing of the motion vector with a rate-distortionoptimization (RDO) Lagrangian technique is not generally
considered in the full-search architectures although it can
typically obtained a 10% reduction in bit rate for the same
quality. Fig. 3,4 and 5 explore the rate-distortion performance
of 1080p high definition sequences extracted from [17] with
varying degrees of motion complexity (high in Crodwrun,
medium in Pedestrian and low in sunflower). The algorithm
selected is the popular hexagon-based fast search strategy for
the integer-search followed by a diamond-based search for the
fractional search as available in x.264.The search area has
been increased to 112x128 pixels as used in our own core. The
figures evaluate the performance of full-search at the integerpel level as a reference. The full-search algorithm considered
works in the traditional point of checking all the points and
selection the one with the lower SAD without any Lagrangian
optimization technique. It can be seen that it performs worst
than the equivalent fast-search full-pel technique without
Lagrangian. This is specially the case for the Pedestrian and
Sunflower sequences that correspond to smooth object motion.
These two sequences also show that enabling the Lagrangian
optimization in the fast motion integer-pel only option is
beneficial. This is not the case for the Crowdrun sequence that
contains more local motion components that do not benefit
from this optimization.
Fractional-pel outperforms the
integer-pel modes in all the sequences. Finally using subblocks offers little benefit for the Pedestrian and Sunflower
since the motion complexity is lower an a single larger block
can captured it correctly.
From this analysis it can be concluded that different video
sequences benefit differently from the different options
available as part of motion estimation. Reconfigurable and
2967
3
programmable hardware can be used to better match the
motion estimation algorithm and the hardware the algorithm
runs on.
Crowdrun
43
42.5
IV. INSTRUCTION SET AND PROGRAMMING MODEL
PSNR
40
39.5
39
38.5
38
50.0 55.0 60.0 65.0 70.0 75.0 80.0 85.0 90.0 95.0 100.0105.0110.0
Bit Rate (Mbits/s)
Figure 3. High Complexity HD Motion estimation RD
performance analysis
Sunflower
43
42.5
42
41.5
41
40.5
Title
A. LiquidMotion Instruction Set Architecture
The instruction set should be able to express the inherent
parallelism available in the motion estimation algorithm in a
simple way to minimize the overheads of instruction fetch and
decode and keep the execution units of the core as busy as
possible. The number of execution units available in the
proposed processor vary depending on the implementation so
it is important that binary compatibility between different
hardware implementations is achieved so a program only
needs to be compiled once and can be executed in any
implementation. The instruction set architecture consists of a
total of 9 different instructions and it is illustrated in Fig.5.
There are two arithmetic instructions for integer and fractional
pattern searches, a total of 6 control instructions that change
the program flow and one mode instruction that sets the
partition mode and reference frame to be used for the
arithmetic instructions. The arithmetic instructions exploit the
most obvious form of parallelism which is available at the
search point level. For example in a simple small diamond
pattern there are four points that can be calculated in parallel if
enough execution resources have been implemented. The
arithmetic instructions express this parallelism with two fields
that identify the number of points used by the pattern and the
position in the point memory where the offsets for that pattern
are defined. The control unit can then execute the instruction
with a parallelism level that ranges from issuing each of these
points to a different execution unit in a fully parallel hardware
configuration to issue each point to the same execution unit in
a base hardware configuration. The same approach applies to
fractional instructions. The set mode instruction is used to
change the active partition mode and reference frame of the
core and configures the internal control logic to operate with
different address boundaries and data sources.
There are a total of 32 32-bit registers available. These
registers include the command register, motion vector
candidate registers, results registers and profiling registers.
The motion vector candidate registers are used to store motion
vectors supplied by the user from surrounding macroblocks or
from macroblocks in different frames. These candidate motion
vectors, together with their associated SAD values, are loaded
into the current motion vector and current SAD registers with
the issue of the set mode instruction and are used as the
starting point for subsequent pattern instructions.
41
40.5
40
39.5
39
38.5
38
1.0
1.5
2.0
2.5
3.0
Bit Rate (Mbits/s)
3.5
4.0
Figure 4. Low Complexity HD Motion estimation RD
performance analysis
Pedestrian area
43
42.5
42
41.5
41
40.5
40
PSNR
Fast motion estimation algorithms have not been
standardized and multiple trade-offs between algorithm
complexity and quality of the results can be made making a
programmable architecture beneficial. The following sections
present the microarchitecture and programming model of the
hardware/software solution developed according to the
principles of configurability and programmability which has
been named LiquidMotion.
42
41.5
39.5
39
38.5
38
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
8.5
9.0
Bit Rate (Mbits/s)
Fractional-pel search
Fractional-pel and all blocks
Full-search
Integer-pel search
Integer-pel only with Lagrangian dissable
Figure 2. Medium Complexity HD Motion estimation RD
performance analysis
2967
4
Op code
Field B
Field A
Pattern address
0000
Number of points
Integer pattern instruction
0001
Pattern address
used to process this configuration file together with the
LiquidMotion RTL hardware library and generate a hardware
netlist and FPGA bitstream with the right number and type of
execution units matching the software requirements.
Number of points
New ME algorithm
Fractional pattern instruction
Winner field
0010
immediate8
High level ME
algorithm code
Conditional jump to label
(if winner field = winner id the jump to inmediate8)
winner id = 0 no winner in pattern else ids the winner execution unit
unused
0011
SharpEye
Compiler
immediate8
87
16 15
immediate8
unused
0100
Assembly code
SharpEye
Assembler/Linker
Unconditional jump to label
87
16 15
Processor
bitstream
Conditional jump to label (if
condition bit set jump to label)
Program Binary
reg
0101
immediate12
Compare
(if less than set condition bit)
reg
0110
reg
0111
immediate12
Compare
(if equal set condition bit)
15
1000
Unused
New Hardware
configuration
immediate12
Compare
(if greater than set condition bit)
11
3
7
Partition
mode
Reference
Frame
MVC
Set mode for MV
candidate/reference frame/partition
Point Binary
Standard
Synthesis/
Place&route
FPGA tools
Cycle Accurate
Simulator/
Configurator
Constraints
energy/
throughput/
quality/area
Generate
configuration RTL
RTL
file
Component Library
Yes
Constraints
met?
Number and type of
functional units
(Integer and fractional
pel, Lagrangian,
motion vector
candidates, etc)
No
Figure 5. LiquidMotion Design Flow
Figure 5. LiquidMotion ISA
The core has no instructions to access external memory,
relying instead on an external DMA engine to move the
reference frame data and current macro block data before
processing starts. This external DMA engine moves an initial
7x8 macroblock search area at the beginning of each row for a
search range of 112x128 pixels. For the remaining
macroblocks in each row, only the newest column needs to be
loaded and the loading of the new macroblock column can
take place in parallel with data processing as explained in
section V.C. Once the input data is ready, processing can start
by writing to the command register. Advanced motion
estimation techniques such as the adaptive thresholds used in
the UMH [21] or PMVFAST algorithms can be implemented
modifying the program memory contents directly by inserting
modified immediate field contents into the compare
instructions.
B. LiquidMotion Programming Model and Design Flow
The processor offers a simple programming model so that a
motion estimation algorithm programmer can access the
functionality of the hardware without detailed knowledge of
the microarchitecture. The toolset is composed by a compiler,
a cycle accurate simulator and analysis functions and it
enables the programmer to test different motion search
techniques before deciding on the one that obtains the required
quality of results in terms of rate-distortion performance and
throughput in terms of macroblocks or frames per second. At
this point the programmer can instruct the tools to generate an
RTL configuration file for the processor. Commercial
synthesis tools such as Xilinx ISE or Synplify can then be
This design flow is illustrate in Fig. 5. This scalable
architecture can be easily programmed using an intuitive Cstyle language called EstimoC. EstimoC is a high level
language, powerful enough to express a broad range of motion
estimation algorithms in a natural way. The EstimoC code is
written in the embedded editor or any other compatible editor
and is interpreted by the EstimoC compiler. The language has
a natural syntax with elements from C and with special
structures for the development of motion estimation
algorithms. Typical constructs such as for loops, if-else, while
loops, etc are supported. The algorithm designer can use these
constructs to create arbitrary block-matching motion
estimation algorithms ranging from the classical full search to
advanced algorithms such us UMH. Part of the language is
dedicated to the preprocessor and other parts are for the core
decoding unit. The preprocessor is a crucial part of the
compiler because it provides syntax facilities for the
development sophisticated algorithms. For example, EstimoC
provides two ways to specify the search patterns: using a
static pattern specification as in pattern(hexagon) {pattern
instructions} or using the dynamic pattern generation. In this
second case the programmer writes a sequence of simple
check instructions in the form check(x,y); followed by the
update; syntax element. A simple program example and a
section of the compiler output with all the loops unrolled are
shown in Fig. 6. The algorithm corresponds to a 4-point
diamond pattern followed by a full-search fractional-pel
refinement which also illustrates that it is possible to
implement exhaustive search approaches if they are required.
The example starts setting an initial step size of 8 that defines
the size of the diamond. An initial check is done at the center
2967
5
point (defined by the motion vector loaded in the motion
vector candidate register or zero if none available) and the 4point diamond surrounding it. This will result in a single
integer-pattern instruction with 5 points (instruction zero in
the sample code). Then a number of diamond steps are
conducted reducing the step size until the step size is smaller
than 1 that corresponds to fractional searches. Each 4-point
diamond will generate a single check instruction. Finally a
small full search is conducted with the two for loops that will
result in a single fractional instruction with a total of 25 points
(0-.5, -0.25, 0, 0.25, 0.5) for the i and j indexes (instruction 31
in the sample code). The example program also shows a
specific if-break syntax that is used to terminate the search
early as described in section C and corresponds to instruction
opcode 2 in the sample code.
S = 8; // Initial step size
Conditional jump instruction
check(0, 0);
check(0, S);
check(0, -S);
check(S, 0);
check(-S, 0);
update;
do
{
S = S / 2;
for(i = 0 to 4 step 1)
{
check(0, S);
check(0, -S);
check(S, 0);
check(-S, 0);
update;
#if( WINID == 0 )
#break;
}
} while( S > 1);
Integer check pattern instruction
0
1
2
3
0 05 00 chk NumPoints: 5 startAddr: 0
0 04 05 chk NumPoints: 4 startAddr: 5
2 00 0B chkjmp WIN: 0
goto: 11
0 04 05 chk NumPoints: 4 startAddr: 5
……………….
11 0 04 0A chk NumPoints: 4 startAddr: 9
12 2 00 15 chkjmp WIN: 0
goto: 21
………………..
21 0 04 0D chk NumPoints: 4 startAddr: 13
22 2 00 1F chkjmp WIN: 0
goto: 31
……………….
31 1 19 11 chkfr NumPoints: 25 startAddr: 17
Fractional check pattern instruction
for(i = -0.5 to 0.5 step 0.25)
for(j = -0.5 to 0.5 step 0.25)
check(i, j);
update;
Figure 6. LiquidMotion programming example and
compiler output
All of the check constructs between update constructs result
in a single integer or fractional pattern instruction. The
compiler processes this source code and generates two binary
files. The first file called program_memory contains the
program instructions themselves, the second file called
point_memory contains the x and y offsets off the basic search
pattern (e.g. [-1,0], [1,0], [0,1], [0,-1] for the diamond search)
that will be computed with the current motion vector candidate
and identify the location of each new search point to be
checked.
C. Early termination and Search-point duplication avoidance
implementation
Early termination is a very important feature used to speed
up execution in fast motion estimation algorithms. Typically if
a pattern fails to improve the SAD of the previous iteration,
the algorithm terminates the current search loop. To
implement this technique each completing check pattern
instruction sets a best_eu register indicating which search
point has improved upon the current cost. This register is set
to zero before each instruction starts executing so the value of
the best_eu register at the end of execution indicates if the
instruction has improved the cost value (best_eu different
from zero) and if so which search point has achieved this
improvement. The conditional jump instruction checks this
register and changes the execution flow as required. The same
hardware can be used to support a technique to avoid
searching duplicated points by coding optimized sub-patterns
in software. For example, in a hexagon search pattern the first
pattern contains six different points but subsequent patterns
will only add three new points to the search sequence. To
avoid checking the same point more than once the best_eu
register can be checked to identify the winning search point
and this information can be used by the hardware to decide
which instruction to execute next. For this optimization to
work for the hexagon case the program needs to be extended
to contain a total of one full pattern sequence and six short
patterns sequences. The complexity of identifying possible
duplicated search points and avoiding them is built into the
compiler so the algorithm designer does not need to get
involved in this process and this also helps to keep the
hardware simple.
V. PROCESSOR MICROARCHITECTURE
The microarchitecture of two configurations are illustrated in
Fig.7 and Fig.8. Fig.7 corresponds to the base-configuration
with a single integer-pel execution unit while Fig.8
corresponds to a complex configuration with 4 integer-pel
execution units, 2 fractional-pel execution units and one
interpolation execution unit. One integer-pel pipeline must
always be present as shown in Fig.7 to generate a valid
processor configuration but the others units are optional, and
are configured at compile time. Additionally to the number of
fractional and integer execution units the hardware includes
support for other motion estimation options as shown in Table
1. Notice that inpedent state machines are used in the control
unit to support variable block-sizes. The set mode instruction
can be used to set the core for a particular partition. Partitions
are calculate sequentially one after another.
A. Integer-pel execution units (IPEU).
Each functional unit uses a 64-bit wide word and a deep
pipeline to achieve a high throughput. All the accesses to
reference and macroblock memory are done through 64-bit
wide data buses and the SAD engine also operates on 64-bit
data in parallel. The memory is organized in 64-bit words and
typically all accesses are unaligned, since they refer to
macroblocks that start in any position inside this word. By
performing 64-bit read accesses in parallel to two memory
blocks, the desired 64-bits inside the two words can be
selected inside the vector alignment unit. The number of
integer-pel execution units is configurable from a minimum of
one to a maximum of 16 and generally limited by the available
resources in the technology selected for implementation. Each
execution unit has its own copy of the point memory and
processes 64-bits of data in parallel with the rest of the
execution units. The point memories are 256x16 in size and
contain the x and y offsets of the search patterns.
2967
6
Configuration option
Options available
Base processor with one
IPEU, one reference frame
and 16x16 block size
Number integer-pel
execution units
Number of fractional-pel
execution units
Additional partition sizes
supported
Motion vector candidates
supported
Langrangian optimization
Number of additional
reference frames
Complexity
In Virtex-5 LUTs
1464
Complexity
memory Virtex-5
BRAMS
9
ipeu from 1 to 16
+(ipeu-1) * 1015
+(ipeu-1)*4
fpeu from 0 to 16
+4057 + fpeu*1104
+1+fpeu*9
8x8 (8x8, 16x8,8x16) or
4x4 (4x8, 8x4, 4x4)
Enable or Disable
+75
+160
+132
0
Enable or Disable
1
+133
+20
0
+4
N/A
16-bit
current
motion
vector
Point Memory
8-bit point
addresses
Address
Calculator
12-bit reference
addresses
20-bit
instructions
Fetch,
Decode,
Issue
Program
Memory
Reference
Memory
8-bit
Fetch
addresses
128-bit Reference Vector
0
Vector
Alignm
ent
Unit
64-bit
Reference
Vector
Macroblock
Memory
64-bit
Current
Vector
SAD
64-bit SAD
Vector
SAD
Accumulat
or and
control
Table 1. LiquidMotion configuration options
16-bit
current
sad
8-bit eu id
SAD
Selector
16-bit best
sad
8-bit best
eu id
Figure 7. Microarchitecture with a single execution unit
8-bit pattern
addresses
Program
Memory
20-bit
instructions
16-bit
current
motion
vector
Point
Memory
Point
Memory
8-bit point
addresses
8-bit Fetch
addresses
16-bit best
sad and 8bit winner id
Point
Memory
Address
Calculator
8-bit point
addresses
8-bit point
addresses
Address
Calculator
12-bit reference
addresses
Point
Memory
Address
Calculator
12-bit reference
addresses
Address
Calculator
12-bit reference
addresses
12-bit reference
addresses
Fetch, Decode,
Issue
Reference
Memory
8-bit point
addresses
Macroblock
Memory
Address
Calculator
8-bit
reference
addresses
Half-pel
Interpolation
Address
Calculator
Half-pel
Reference
Memory
Half-pel
Reference
Memory
128-bit hp
interpolated
pixels
Vector
Alignm
ent Unit
Vector
Alignm
ent Unit
Vector
Alignm
ent Unit
64-bit
interpol
ated
vector
DIF
64-bit
DIF
vector
16-bit current
cost
64-bit SAD
Vector
COST Selector
64-bit
Current
Vector
16-bit
current sad
64-bit
Current Vector
Vector Alignme
nt Unit
Vector
Alignme
nt Unit
SAD
Reference
Memory
64-bit SAD
Vector
COST
Accumulat
or and
control
8-bit eu id
16-bit
current sad
64-bit
Current
Vector
MVP QP MV
MV
cost
Motion
Vector cost
COST
Accumulat
or and
control
16-bit
current sad
8-bit eu id
COST Selector
16-bit best
sad
Vector
Alignme
nt Unit
Macroblock
Memory
SAD
64-bit SAD
Vector
COST
Accumulat
or and
control
16-bit
current sad
8-bit eu id
64-bit
Current
Vector
SAD
64-bit SAD
Vector
COST
Accumulat
or and
control
COST
Accumulat
or and
control
16-bit current
cost
Reference
Memory
128-bit Reference Vector
128-bit Reference Vector
128-bit Reference Vector
Vector
Alignme
nt Unit
Quarter-pel
Interpolator
DIF
COST
Accumulat
or and
control
64-bit
reference
Vector
SAD
Vector
Alignm
ent Unit
Quarter-pel
Interpolator
Reference
Memory
8-bit best eu
id
16-bit best cost
Figure 8. Microarchitecture with a total of six execution units
8-bit eu id
2967
For example a typical diamond search pattern with a radius of
1 will use 4 positions in the point memory with values [-1,0],
[0,-1], [1,0], [0, 1]. Any pattern can be specified in this way
and multiple instructions specifying the same pattern can point
to the same position in the point memory saving memory
resources. Each integer-pel execution unit receives an
incremented address for the point memory so each of them can
compute the SAD for a different search point corresponding to
the same pattern. This means that the optimal number of
integer-pel execution units for a diamond search pattern is
four, and for the hexagon pattern six. Further optimization to
avoid searching duplicated patterns can halve the number of
search points for many regular patterns. In algorithms which
combine different search patterns, such as UMH, a
compromise can be found to optimize the hardware and
software components. This illustrates the idea that the
hardware configuration and the software motion estimation
algorithm can be optimized together to generate different
processors depending on the software algorithm to be
deployed.
B. Fractional-pel Execution Unit (FPEU) and Interpolation
Execution Unit (IEU).
The engine supports half and quarter pel motion estimation
thanks to a half-pel interpolator execution unit and specifically
designed fractional-pel execution units. The number of halfpel interpolation execution units is limited to one but the
number of fractional-pel execution units can also be
configured at compile time. The IEU interpolates the 20x20 pi
xel area that contains the 16x16 macroblock corresponding to
the winning integer motion vector. The interpolation hardware
is cycled 3 times to calculate first the horizontal pixels then
the vertical pixels and finally the diagonal pixels. The IEU
calculates the half pels through a 6-tap filter as defined in the
H.264 standard. The IEU has a total of 8 systolic 1-D
interpolation processors with 6 processing elements each. The
objective is to balance the internal memory bandwidth with
the processing power so in each cycle a total of 8 valid pixels
are presented to one interpolator. The interpolator starts
processing these 8 pixels producing one new half-pel sample
after each clock cycle. In parallel with the completion of 1-D
interpolation of the first 8-pixel vector, the memory has
already been read another 7 times and its output assigned to
the other 7 interpolators. The data read during memory cycle 9
can then be assigned back to the first interpolator obtaining
high hardware utilization. The horizontally interpolated area
contains enough pixels for the diagonal interpolation to also
complete successfully. A total of 24 rows with 24 bytes each
are read. Each interpolator is enabled 9 times so that a total of
72 8-byte vectors are processed. Due to the effects of filling
and emptying the systolic pipeline before the half-pel samples
are available, a total of 141 clock cycles are needed to
complete half-pel horizontal interpolation. During this time,
the integer pipeline is stalled, since the memory ports for the
reference memory are in use. Once horizontal interpolation
completes, and in parallel with the calculation of the vertical
and diagonal half-pel pixels and the fractional pel motion
estimation, the processing of the next macroblock or partition
7
can start in the integer-pel execution units. Completion of the
vertical and diagonal pixel interpolation takes a further 170
clock cycles after which the motion estimation using the
fractional pels can start. Quarter-pel interpolation is done onthe-fly as required simply by reading the data from two of the
four memories containing the half and full pel positions, and
averaging according to the H.264 standard. The fractional
pipeline is as fast as the integer pipeline, requiring the same
number of cycles to compute each search position as explained
in section VI.
C. Reference memory organization
The implemented reference memory can accommodate a
search area with a width of 128 pixels. The limitation of the
horizontal search range to 112 pixels leaves a 16-pixel wide
memory area available to be reloaded with a new column for
the next macroblock in parallel with the processing of the
current macrobloc using a shifting window technique. The
shift window means that the reference addresses are offset so
reads are not performed on the memory area being loaded with
a new column of reference pixels for the next macroblock. The
implementation of the reference area in Xilinx V5 BlockRAMs uses a total 4 blocks RAMs. Each Block-RAM is
organized with 1024 words and 4-bytes per word and in a
dual-port configuration. Fig. 8 shows a simplified view of the
reference memory organization. The key feature is that the 8pixel words that form the reference area are stored in an
interleaved organization in the BlockRAMs. For example the
first row of the first 16x16 macroblock is formed by words 0
and 1. Word 0 is stored in BRAMs 1 and 2 while word 1 is
stored in BRAMs 3 and 4 as shown on the left of Fig.8. The
less significant bit of the address is used to activate the reading
of the adequate BRAMs. Since a motion vector can point to
any location in this reference window the accesses are
generally misaligned and, for example, the last 3-bytes from
the word read in BlockRAMs 1/2 must be concatenated with
the first 5-bytes from the word read in BlockRAMs 3/4 to
form 64 bits of valid data. Notice that if for example the
motion vector points to the middle of memory word 1 then a
few bytes from memory word 2 will also be needed to formed
64-bits of valid data.. In this case the address must be
incremented by one to access the right location for memory
word 2 (second position in the BRAMs 1/2). The effect of the
memory interleaving technique is that the BlockRAMs always
have one memory port free. The free port can be used to load
new reference data for the next macroblock in parallel with the
processing of the current macroblock. This is very important
since if processing and loading of new data must be done in
sequence performance will typically half. The simultaneous
reading and writing means that next macroblock data is being
loaded by an external DMA engine while the current
macroblock is processed in parallel to mask the overheads
effects of limited bus bandwidth. In our prototype the bus
width is 64-bits so the DMA engine can load a new 64-bit
word in each clock cycle. A new column of 8 macroblocks
(2048 bytes) can then be loaded in 256 clock cycles.
2967
8
64
128-pixels
+16
8-pixels
4x4,8x4 read port
4x4/4x8 enable
0
128
rows
1
…...
2
17
…...
18
15
31
BRAM2 Port
(1024x32)
B
Data
Effective Search area (112 x128 pixels)
Read Address(0)
Read Address
8-pixels
0
1
10
Write Address
Data
Port BRAM1
A (1024x32)
Read
Port
…...
8-pixels
64
Address
10
10
Address
16
Reference
Data Out
64
Write
Port /
Read
Write Enable Port
Read Address(0)
64
Reference
Data In
WE
0
64
10
4x4/4x8 enable
+16
+1
1024
words
31
30
BRAM1
(1024x32)
10
Address
3
2
BRAM 2
(1024x32)
BRAM3
(1024x32)
2046
Address
…...
BRAM4
(1024x32)
Read Address(0) Read
Port
Reference
Data Out
Data
64
16x16, 16x8,
64
8x16, 8x8, 4x8
read port
64
10
Data
Port BRAM3
A (1024x32)
BRAM4 Port
(1024x32)
B
Write
Port /
Read
Write Enable
Port
64
Read Address(0)
WE
0
2047
Reference memory
architecture
Reference memory
organization
Figure 8. Reference memory internal organization
VI.HARDWARE PERFORMANCE EVALUATION AND
IMPLEMENTATION
For the implementation we have selected the Virtex-5
LX110T device included in the XUPV5 development
platform. This device offers a high level of density inside the
Virtex-5 family and can be considered main stream being
fabricated using 65 nm CMOS technology.
D. Performance/complexity analysis.
The results of implementing the processor with different
numbers and types of execution units are illustrated in table
1.The basic configuration is small using only 2% of the
available logic resources and 6% of the memory blocks.
Virtex-5 LX110T (XUP V5 board)
Configuration Virtex-5 Slice LUTs
(number of
used / Slice LUTs
execution
available
units)
1,464/ 69,120 (2%)
1 IPEU/ 0
FPEU
2,479/69,120 (3%)
2 IPEU/ 0
FPEU
3,461/69,120 (5%)
3 IPEU/ 0
FPEU
6,625/69,120 (9%)
1 IPEU/ 1
FPEU
2 IPEU/ 1
FPEU
7,567/69,120 (11%)
Virtex-5 Memory blocks used/
Memory blocks available
Critical
path (ns)
9/148 (6%)
4.551
13/148 (8%)
4.420
18/148 (12%)
4.620
18/148 (12%)
4.695
23/148 (15%)
4.470
Table 1. Processor complexity
Each new execution unit adds around 1000 V5 LUTs and 4
embedded memory blocks to the complexity. The fractional
and integer execution units have been carefully pipelined and
all the configurations can achieve a clock rate of 200 MHz in
this part. To obtain a performance value in terms of
macroblocks per second is not as straight forward as in full
search hardware that always computes the same number of
SADs for each macroblock. In this case the amount of motion
in the video sequence, the type of algorithm and the hardware
implementation vary the number of macroblocks per second
that the engine can processed. The cycle accurate simulator
part of the toolset has been used to measure the performance
of the core processing the same high-definition files
introduced in section 3. The performance values obtained from
the cycle accurate simulator have been verified against a
prototype implementation of the system using the XUPV5
board. .Overall the microarchitecture always uses 33 cycles
per search point although there is an overhead of 11 clock
cycles needed to empty the integer pipeline before the best
motion vector can be found in each pattern iteration and the
next pattern started from the current winning position. The
microarchitecture stops an execution unit if the current SAD
calculation becomes larger than the cost obtained during a
previous calculation to save power but it does not try to start
the next search point earlier. The main reason why this
optimization is not used can be explained as follows. Since the
core uses multiple execution units it is very important that all
the execution units are maintained synchronized so that a
single control unit can issue the same control signals to all the
execution units. Execution units starting at different clock
cycles will invalidate this requirement.
2967
9
frames/second
Sunflower
300
285
270
255
240
225
210
195
180
165
150
135
120
105
90
75
60
45
30
15
0
1
dia 16x16
hex 4x4
2
4
8
Number of IPEU
dia 8x8
UMH 16x16
dia 4x4
UMH 8x8
hex 16x16
UMH 4x4
16
hex 8x8
Figure 10. Analysis of iInteger-pel
performance in Sunflower sequence
Crowdrun
300
285
270
255
240
225
frames/second
Integer-pel performance is evaluated using three different fast
motion estimation algorithms: diamond, hexagon and UMH
(Uneven Multi-hexagon Cross Search) all of them followed by
a 8-point square refinement as implemented in the x.264
codec. Figs. 9 to 11 show the performance in terms of frames
per seconds as the number of integer execution units changes
for different minimum sub-partitions.
The 8x8 mode
considers the 16x16, 8x16, 16x8 and 8x8 partitions while the
mode 4x4 considers all the partitions. As the number of
partitions consider increases performance decreases since the
core must compute one partition at a time. It is not possible to
reuse partition results and calculate them in parallel for the
fast motion estimation algorithms considered since each
partition will follow a potentially different search direction. It
is important to notice that not all the partitions are checked.
The inter-mode selection algorithm part of the x.264 codec
selects which sub-partitions to test. For example, if the 8x8
partition has not improved over 16x16 partition then 4x4 will
not be considered. The figures show that more complex
algorithms show a better scalability with the available number
of execution units. For example, a diamond search pattern
optimal configuration includes four IPEUs although in these
experiments performance increases for configurations with
more than 4 IPEUs due to the presence of the final square
refinement that includes 8-points in its search pattern. It is also
important to notice that a configuration with three IPEUs will
need the same number of cycles as for the two IPEUs case for
the diamond-search. The reason for this is that whilst the first
iteration will enabled all three IPEUs, a second iteration will
still be required to complete the pattern instruction, when only
one IPEU will be enabled.
210
195
180
165
150
135
120
105
90
75
60
frames/second
Pedestrian area
45
300
285
270
255
240
225
210
195
180
165
150
135
120
105
90
75
60
45
30
15
0
30
15
0
1
2
4
8
16
Number of IPEU
dia 16x16
dia 8x8
dia 4x4
hex 16x16
hex 4x4
UMH 16x16
UMH 8x8
UMH 4x4
hex 8x8
Figure 11. Analysis of iInteger-pel
performance in Crowdrun sequence
1
2
4
8
Number of IPEU
dia 16x16
hex 4x4
dia 8x8
UMH 16x16
dia 4x4
UMH 8x8
hex 16x16
UMH 4x4
hex 8x8
Figure 9. Analysis of iInteger-pel
performance in Pedestrian sequence
16
Figures 9 to 11 also showed that the simpler motion available
in video sequences such as Sunflower and Pedestrian result in
higher frame rates. This could be exploited in the hardware by
lowering the clock frequency to maintain a constant frame rate
in a real application. Also the complex motion present in
Crowdrun makes the probability of selecting the smaller subblocks much higher and increases the impact on performance
of using these sub-blocks. For example to maintain a frame
rate of 30 frames per second over the Crowdrun sequence
when all the block sizes are used 16 IPEUs are needed as
shown in Fig.11. Another form of parallelism not described in
this paper but certainly possible will be a multi-core
implementation. In this case some ME processors could be
dedicated to run particular sub-blocks and only activated if
needed. This will enable the further scaling of the presented
2967
10
Sunflower
150
135
120
105
frames/second
90
75
60
45
30
15
0
1
2
4
8
16
Number of FPEU
dia 16x16
hex 4x4
dia 8x8
square 16x16
dia 4x4
square 8x8
hex 16x16
square 4x4
hex 8x8
Figure 13. Analysis of Fractional-pel
performance in Sunflower sequence
Crowdrun
150
135
120
105
frames/second
architecture to higher frame rates for complex algorithms. The
current microarchitecture can run both the integer-pel and
fractional-pel in parallel. To be able to obtain the same level
of fractional and integer-pel performance each fractional pel
execution unit needs two alignment units due to the fact that in
order to perform quarter-pel interpolation two half pel data
words need to be read and aligned. The complex part of
executing the fractional-pel refinement involves the half-pel
interpolation using the standard 6-tap filter. In the current
microarchitecture this interpolation needs to complete before
the fractional-pel search can start and the interpolator needs
around 300 clock cycles to calculate the horizontal, vertical
and diagonal pixels. Figs. 12 to 14 evaluate the performance
of the fractional-pel searches using three fractional motion
estimation: diamond, hexagon and square search. Fractional
search does not require complex algorithms since the search
area is limited to 20x20 pixels. This is the 16x16 pixel area
that corresponds to the winning integer macroblock extended
with two pixels in each side. In all the cases we consider a
search loop formed by two half-pel checks followed by two
quarter-pel checks. This follows the same approach as used in
the x.264 codec. Also, sub-partitions are processed in a
similar way as done in the low complexity mode of x.264: the
fractional search refinement is only done on the best partition
after the integer-search completes. This option is taken to
maintain interpolation complexity low. The alternative of
performing a fractional refinement over each possible partition
will need a muti-core implementation since the single
interpolator available in the microarchitecture will not be able
to cope. Similarly to the integer-pel search the figures show
that simpler motion sequences translate in higher performance
as expected. In this case we can also observe that the
scalability of the fractional-pel search performance with the
number of FPEU is more limited that in the integer-pel case.
The reason for this is the need for the half-pel interpolation
stage before the search can start that always needs a constant
number of clock cycles independentely of how many FPEUs
are available.
90
75
60
45
30
15
0
1
2
Pedestrian area
4
8
16
Number of FPEU
dia 16x16
hex 4x4
150
135
dia 8x8
square 16x16
dia 4x4
square 8x8
hex 16x16
square 4x4
hex 8x8
Figure 14. Analysis of Fractional-pel
performance in Crowdrun sequence
120
105
frames/second
90
75
60
45
30
15
0
1
2
4
8
16
Number of FPEU
dia 16x16
hex 4x4
dia 8x8
square 16x16
dia 4x4
square 8x8
hex 16x16
square 4x4
Figure 12. Analysis of Fractional-pel
performance in Pedestrian sequence
hex 8x8
Finally, Table 3 compares the performance and complexity
figures of the base configuration of the LiquidMotion
processor against the ASIP cores proposed in [11] and [12] in
terms of performance complexity. The figures measured in the
general purpose P4 processor with all assembly optimizations
enabled are also presented as a reference although the power
consumption and cost of this general purpose processor are not
suitable for the embedded applications this works targets.
These types of comparisons are difficult since the features of
each implementation vary. For example our base configuration
does not support fractional pel searches and the addition of the
interpolator and fractional pel execution unit in parallel with
the integer pel execution unit increases complexity by a factor
of 3. The core presented in [12] does support fractional pel
2967
11
searches although with a non-standard interpolator and both
searches must run sequentially. Overall Table 3 shows that our
core offers a similar level of integer performance for the
diamond search algorithm to the ASIP develop in [12] with
one execution unit and this can be almost double if the
configuration instantiates two execution units as shown in the
last row. For these experiments our core was retargeted to a
Virtex-II device since this is the technology used in [11] and
[12] to obtain a fair comparison. The pipeline of the proposed
solution can clock at double the frequency as shown in the
table and this helps to justify why our solution with a single
execution unit can support 1080p HD formats while the
solution presented in [12] is limited to 720p HD formats. The
measurements of cycles per macroblock were obtained
processing the same CIF sequences as used in [12].
Cycles per MB
(Diamond
search)
FPGA
Complexity
(Slices)
Memory
(BRAMS)
N/A
FPGA clock
(MHz, VirtexII)
N/A
Intel P4
assembly
Dias et al. [11]
~3,000
4,532
2,052
67
Babionitakis et
al. [12]
660
2,127
50
Proposed with
one integer-pel
execution unit
510
1,231
125
4(external
reference area)
11 (1 reference
area of 48x48
pixels)
21 (2 reference
areas of 112x128
pixels)
Proposed with
two integer-pel
execution units
287
Dynamic Power of ME engines
N/A
100
90
2,051
125
38(2 reference
areas of 112x128
pixels)
Table 3. Performance/complexity comparison
The diamond search corresponds to the implementation
available in x.264 that includes up to 8 diamond interactions
followed by a square refinement using a single reference
frame and a single macroblock size (16x16).
B. Power analysis.
Power is a major consideration in hardware design so it is
important to investigate how effective is the core from a
power efficiency point of view. Unfortunately no power
results have been reported in [10] and [11] for the FPGA
implementations. In any case most of the literature available
reporting power consumption in FPGAs rely on the tools
provided by the vendors. The standard approach is to use a
tool such as Xilinx Xpower together with a VCD activity file
obtained from simulating the netlist backannotated with timing
information. This should accurately captured the logic gliches
largely responsible for dynamic power consumption together
with the switching behavior of flip-flops and LUTs. This flow
applied to our core translates into unreasonable running times
or a low level of confidence regarding the power results due to
only a portion of the signals/logic being activated if using
short simulation runs. A timing simulation of around 1000 ns
that only contents 100 clock cycles of activity with a 100 MHz
clock rate results on Xpower needing more than 12 hours to
complete the analysis in a P4 computer due to the complexity
of the core. A more accurate approach involves measuring the
amount of power consumed by the chip when deployed and
this is the method we have used in this work. The core is
80
70
Power (mW)
Approach
deployed as part of a SoC using a modified Xilinx XUP V5
board with isolated vcore power supply connected to a
purposely designed voltage regulator. For the analysis we have
clock gated the ME processor to be able to isolate the power
consumed by the ME core from the power of the rest of SoC
map in the FPGA. The SoC uses a soft core processor to move
data from external DDR memories to the internal ME
memories using the AMBA bus. In this experiment the
movement of data is done initially gating the clock of the ME
core and in the second run with the clock running and the real
motion estimation process done. The difference between these
two measures corresponds to the dynamic power of the ME
core. Fig. 15 shows the dynamic power of the ME cores with
one and two integer-pel cores. As expected power increases
linearly with core frequency and it is proportional to the core
complexity.
60
50
ME 1IU
40
ME 2IU
30
20
10
0
18.75
21.88
25.00
28.12
31.25
34.38
37.50
40.62
43.75
46.88
50.00
Frequency (MHz)
To consider static power in FPGA devices it is possible to
make the distinction between the configured and unconfigured
states. In the unconfigured state the bitstream has not been
loaded and the FPGA fabric is set to a default low-leakage
state as described in [x] with
Initially the reconfigurable region is left in a blank state
(unconfigured) but the process of moving data from external
DDR memories to th
Column 2 in Table 4 shows the power measured after
configuring the static region with the SoC but keeping the
reconfigurable region empty. The value corresponding to
Frequency 0 shows the static power consumption of the
FPGA. It can be observed that the static power consumption
is the main cause of power consumption in the device. The
second column shows the power after the region has been
configured with the motion estimation core but both SoC and
ME remain in an idle state. Power increases from 29 mw to 78
mw depending on the clock frequency. The increase in power
with the clocks is expected since although no useful work is
being done the activation of the clocks will increase the
switching activity of the logic cells, digital clock managers
and another logic present in the chip. It is however remarkable
the large increase in power resulting from configuring the
region. This suggest that the unconfigured state is much more
power efficient than the configured state and that if a region in
the FPGA is not going to be used for some time it could be
unconfigured to save power. Column 3 correspond to the
region configured with the ME core but only the SoC running.
2967
12
In this case the SoC processor writes reference and current
macroblock data to the ME memories but it does not activate
the core. Finally column 4 is equivalent to column 3 but in this
case the SoC processor activates the ME core to calculate the
motion vectors as defined by the motion search algorithm. A
diamond search is used for these experiments and the
difference between column 3 and column 4 is the dynamic
power of the running motion estimation core that measures
around 22 mw for the 50 MHz clock. The total dynamic power
can be estimated by adding the power consumed by the core
with the clocks running but not doing any useful work which
can be obtained from the difference between columns 2 and 1
for the 50 MHz once the value corresponding to static power
at 0 MHz has been subtracted ( (576-454) – (520 – 425) = 27
mw). This translates in a total dynamic power of 49 mw. The
table mex2 shows equivalent data when the ME processor is
configured with two integer execution units with a total
dynamic power of 74 mw.
Table 4. Power analysis
The value of the static power consumed is very high but this
corresponds to the whole device and not just the portion of the
device being used by the core. Since the core with one integer
execution unit represents around 8% of the total device we can
estimate that the ratio of static power that corresponds to the
core itself is approximately 63 mw (34 mw unconfigured + 29
mw after configuration). The mex2 version uses
approximately 13% of the device. The static power for the
mex2 version is approximately 98 mw (55 mw unconfigured +
43 mw after configuration). In both cases static power is
higher than dynamic power. Techniques such as the voltage
scaling and error correction approach used in [22] for motion
estimation could also be added to the execution units in this
work to reduce both the static and dynamic power
consumptions.
VII.CONCLUSION
The main features of the presented processor are the support
of arbitrary fast motion estimation algorithms, the seamless
integration of fractional and integer pel support, the
availability of a software toolset to ease the development of
new motion estimation algorithms and processors and the
description of a scalable, configurable architecture with a
number of execution units determined by the algorithm and
throughput requirements. The combination of these features
constitutes a significant advancement compared with the work
reviewed in section two. For the case of traditional full search
hardware the presented core scales well to large search ranges
without linear increases in hardware resources and
consequently power consumption. The power analysis based
on measured data has shown the large effect of static power.
The power values have been added to the cycle accurate
simulator
part
of the
toolset
(available
at
http://sharpeye.borelspace.com/) which can then be used to
configure the processor according to power, performance and
complexity constraints.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Ostermann, J., Bormans, J., List, P., Marpe, D., Narroschke, M., Pereira,
F., Stockhammer, T. and Wedi, T., “Video coding with H.264/AVC:
tools, performance and complexity”. IEEE Circuits Syst. Mag. v4. pp. 728.
Nunez-Yanez, J.L.; Hung, E.; Chouliaras, V., 'A configurable and
programmable motion estimation processor for the H.264 video codec,'
FPL 2008. International Conference on , vol., no., pp.149-154, 8-10
Sept. 2008
Huang, Y.-W., Wang, T.-C., Hsieh, B.-Y., Chen L.-G. “Hardware
Architecture Design for Variable Block Size Motion Estimation in
MPEG-4 AVC/JVT/ITU-T H.264”. ISCAS. May 2003.
Ching-Yeh Chen; Shao-Yi Chien; Yu-Wen Huang; Tung-Chien Chen;
Tu-Chih Wang; Liang-Gee Chen, "Analysis and architecture design of
variable block-size motion estimation for H.264/AVC", IEEE TCSVT,
vol.53, no.3, pp.578-593, March 2006
Yap, S.Y.; Mccanny, J.V., ‘A VLSI architecture for advanced video
coding motion estimation’, ASAP, pp. 293-301, 24-26 June 2003
Chao-Yung Kao and Youn-Long Lin, “An AMBA-Compliant Motion
Estimator For H.264 Advanced Video Coding” IEEE International SOC
Conference (ISOCC), Seoul, Korea, October 2004
Brian M. Li , Philip H. Leong, “Serial and Parallel FPGA-based
Variable Block Size Motion Estimation Processors”, Journal of Signal
Processing Systems, Vol. 51 , No. 1, pp. 77-98 April 2008
Bing-Fei Wu; Hsin-Yuan Peng; Tung-Lung Yu, "Efficient Hierarchical
Motion Estimation Algorithm and Its VLSI Architecture," Very Large
Scale Integration (VLSI) Systems, IEEE Transactions on , vol.16, no.10,
pp.1385-1398, Oct. 2008.
Yu-Wen Huang, Ching-Yeh Chen, Chen-Han Tsai, Chun-Fu Shen,
Liang-Gee Chen, “Survey on Block Matching Motion Estimation
Algorithms and Architectures with New Results”, The Journal of VLSI
Signal Processing, Vol. 42, No. 3. (March 2006), pp. 297-320.
Sheu-Chih Cheng; Hsueh-Min Hang, "A comparison of block-matching
algorithms mapped to systolic-array implementation," IEEE TCSVT,
IEEE Transactions on , vol.7, no.5, pp.741-757, Oct 1997
T. Dias , S. Momcilovic , N. Roma , L. Sousa, “Adaptive motion
estimation processor for autonomous video devices”, EURASIP Journal
on Embedded Systems, v.2007 n.1, pp.41-41, January 2007
Babionitakis, Konstantinos1, et al., “A real-time motion estimation
FPGA architecture”, Journal of Real-Time Image Processing, Volume 3,
Numbers 1-2, March 2008 , pp. 3-20(18)
Sullivan, G.J.; Wiegand, T., "Rate-distortion optimization for video
compression", Signal Processing Magazine, IEEE , vol.15, no.6, pp.7490, Nov 1998
Information available at http://www.xilinx.com/ products/ipcenter/DODI-H264-ME.htm
S. Saponara, K. Denolf, G. Lafruit, C. Blanch, and J. Bormans,
“Performance and complexity co-evaluation of the advanced video
coding standard for cost-effective multimedia communications,”
EURASIP J. Appl. Signal. Process., no. 2, Feb. 2004, pp. 220-235.
Information available at http://www.videolan.org/developers/x264.html
1080p
HD
sequences
obtained
from
http://nsl.cs.sfu.ca/wiki/index.php/Video_Library_and_Tools#HD_Sequ
ences_from_CBC
JM
reference
software
[available
on-line].
https://bs.hhi.de/suehring/tml/download
Alfonso, D.; Rovati, F.; Pau, D.; Celetto, L., "An innovative,
programmable architecture for ultra-low power motion estimation in
reduced memory MPEG-4 encoder," Consumer Electronics, IEEE
Transactions on , vol.48, no.3, pp. 702-708, Aug 2002
Tourapis, H.-Y.C.; Tourapis, A.M., "Fast motion estimation within the
H.264 codec," Multimedia and Expo, 2003. ICME '03. Proceedings.
2003 International Conference on Multimedia and Expo, vol.3, no., pp.
III-517-20 vol.3, 6-9 July 2003.
Toivonen, T.; Heikkila, J., "Improved Unsymmetric-Cross MultiHexagon-Grid Search Algorithm for Fast Block Motion Estimation",
Image Processing, 2006 IEEE International Conference on , vol., no.,
pp.2369-2372, 8-11 Oct. 2006
Varatkar, G.V.; Shanbhag, N.R., "Error-Resilient Motion Estimation
Architecture," Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on , vol.16, no.10, pp.1399-1412, Oct. 2008
Download