2967 1 Cogeneration of Fast Motion Estimation Processors and Algorithms for Advanced Video Coding Jose L. Nunez-Yanez, Atukem Nabina, Eddie Hung, George Vafiadis Abstract — This work presents a flexible and scalable motion estimation processor capable of supporting the processing requirements of high definition (HD) video in the H.264 advanced video codec and suitable for FPGA implementation. The core can be programmed using a C-style syntax optimized to implement fast block matching algorithms. The development tools are used to compile the algorithm source code to the processor instruction set and to explore the processor configuration space. A large configuration space enables the designer to generate different processor microarchitectures varying the type and number of integer and fractional pel execution units together with other functional units. All these processor instantiations remain binary compatible so recompilations of the motion estimation algorithm are not required. Thanks to this optimization process it is possible to match the processing requirements of the selected motion estimation algorithm and options to the hardware microarchitecture leading to a very efficient implementation. Index Terms — Video coding, motion estimation, reconfigurable processor, H.264, FPGA. I. INTRODUCTION T he emergence of new advanced coding standards such as VC-1, AVS and especially H.264 with its multiple coding tools [1] have introduced new challenges during the motion estimation process used in inter-frame prediction. While previous standards such as MPEG-2 could only vary the search strategy H.264 adds the freedom of using multiple motion vector candidates, sub-pixel resolutions, multiple reference frames, multiple partition sizes and rate-distortion optimization as tools to optimize the inter-prediction process. The potential complexity introduced by these tools operating on large reference area sizes containing lengthy motion vectors makes the full-search approach which exhaustively tries each possible combination less attractive. A flexible, reconfigurable and programmable motion estimation processor such as the one proposed in this work is well poised to address these challenges by fitting the core microarchitecture to the inter-frame prediction tools and algorithm for the selected Manuscript received January 30, 2009. This work was supported by the UK EPSRC under grants EP/D011639/1 and EP/E061164/1. Jose Luis Nunez-Yanez, Atukem Nabina and George Vafiadis are with Bristol University, Department of Electronic Engineering Bristol, UK (phone: 0117 3315128; fax: 0117 954 5206; e-mail: j.l.nunez-yanez@bristol.ac.uk, a.nabina@bristol.ac.uk,george.vafiadis@bristol.ac.uk. Eddie Hung is with ECE at University of British Columbia, Vancouver, Canada; e_mail: eddieh@ece.ubc.ca). H.264 encoding configuration. The concept was briefly introduced in [2] and it is further developed and improved in this work. The paper is organized as follows. Section II reviews relevant work in the field of hardware architectures for motion estimation concentrating on reconfigurable and programmable solutions. Section III motivates the presented work showing the effects of different motion estimation options and algorithms in high-definition video coding. Section IV presents the programming model and tools developed to explore the software/hardware design space of advanced motion estimation. Section V describes the processor microarchitecture details and section VI analyses the complexity/performance/power of the proposed solution. Finally, section VII concludes this paper. II. MOTION ESTIMATION HARDWARE REVIEW Full-search algorithms have been the preferred option for hardware implementations due to their regular dataflow which makes them well suited to architectures using 1-D or 2-D systolic array principles with simple control and high hardware utilization. Full-search architectures implement SAD reuse strategies that makes them specially suited to support the variable block sizes used in H.264. By combining the results of smaller blocks into larger blocks only small increases in gate count are required over their conventional fixed-block counterparts with little bearing on its throughput, critical path, or memory bandwidth. On the other hand, the hardware requirements needed to obtain enough parallelism to check all the possible search points in real-time are very large. This will be even more challenging if large-search ranges, ratedistortion optimization and fractional-pel search are considered. A recent example of a high-performance integeronly full-search architecture is presented in []. This work considers a relatively large search range of 63×48 pixels and can vary the number of pixel processing units. A configuration using 16 pixel processing units can support 62 fps of 1920×1080 video resolution clocking at 200 MHz. Each pixel processing unit works in a different macroblock in parallel obtaining 41 motion vectors (all block sizes) in parallel. By working in 16 adjacent macroblocks of 16x16 pixels in parallel data reused can be exploited. The architecture needs around 154K LUTs implemented in a Virtex5 XCV5LX330 In an effort to reduce the complexity of motion search process architectures for fast ME algorithms have been proposed as seen in [9]. The challenges the designer faces in this case include unpredictable data flow, irregular memory 2967 access, low hardware utilization and sequential processing. Fast ME approaches use a number of techniques to reduce the number of search positions and this inevitably affects the regularity of the data flow, eliminating one of the key advantages that systolic arrays have: their inherent ability to exploit data locality for re-use. This is evident in the work done in [10] that compares a number of fast-motion algorithms mapped onto a systolic array and discovers that the memory bandwidth required does not scale at anywhere near the same rate as the gate count. A number of architectures have been proposed which follow the programmable approach by offering the flexibility of not having to define the algorithm at design time. The application specific instruction-set processor (ASIP) presented in [11] uses a specialized data path and a minimum instruction set similar to our own work. The instruction set consists of only 8 instructions operating on a RISC-like, register-register architecture designed for lowpower devices. There is the flexibility to execute any arbitrary block matching algorithms and the basic SAD16 instruction computes the difference between two sets of sixteen pixels and in the proposed microarchitecture takes sixteen clock cycles to complete using a single 8-bit SAD unit. The implementation using a standard cell 0.13 μm ASIC technology shows that this processor enables real time motion estimation for QCIF, operating at just 12.5 MHz to achieve low power consumption. An FPGA implementation using a Virtex-II Pro device is also presented with a complexity of 2052 slices and a clock of 67 MHz. In this work scaling can be achieved by varying the width of the SADU (ALU-equivalent for calculating SADs) but due to its design, the maximum parallelism that can be achieved would be if the SAD for the entire row could be calculated in the minimum one clock cycle, in a 256-bit SIMD (Single Instruction Multiple Data) manner. The programmable concept is taken a step further in [12]. This motion estimation core is also oriented to fast motion estimation implementation and supports sub-pixel interpolation and variable block sizes. The interpolation is done on-demand using a simplified non-standard filter which will cause a mismatch between the coder output and a standard-compliant decoder. The core uses a technique to terminate the calculation of the macroblock SAD when this value is larger than some previously calculated SAD but it does not include a Lagrangian-based RD optimization technique [13]. Scalability is limited since a single functional unit is available although a number of configuration options are available to match the architecture to the motion algorithm such as algorithm-specific instructions. The SAD instruction comparable to our own pattern instruction operates on a 16pixel pair simultaneously and 16 instructions are needed to complete a macroblock search point taking up to 20 clock cycles. The processor uses 2127 slices in an implementation targeting a Virtex-II device with a maximum clock rate of 50 MHz. This implementation can sustain processing of 1024x750 frames at 30 frames per second. Xilinx has recently developed a processor capable of supporting high definition 720p at 50 frames per second, operating at 225 Mhz [14] in a Virtex-4 device with a throughput of 200,000 macroblocks per second. This Virtex-4 implementation uses a total of around 3000 LUTs, 30 DSP48s embedded blocks and 19 block- 2 RAMs. The algorithm is fixed and based on a full search of a 4x3 region around 10 user-supplied initial predictors for a total of 120 candidate positions, chosen from a search area of 112x128 pixels. The core contains a total of 32 SAD engines which continuously compute for a given motion vector candidate the 12 search positions that surround it. III. THE CASE FOR FAST MOTION ESTIMATION HARDWARE . Most of the available literature indicates that full search algorithms deliver the best performance in terms of PSNR and bit rate compared with fast motion estimation algorithms. However the research done in papers such as [19-20] suggest that a well-designed fast block matching algorithm not only can speed up the motion estimation process but also improve the rate-distortion performance in state of the art video coders such as H.264. The introduction of motion vector candidates as starting search points obtained from neighboring macroblocks and early termination techniques tend to produce smoother motion vectors with a smaller delta between the predicted motion vector and the selected motion vector. This in turn translates in fewer bits needed to code the motion vectors over a large range of macroblocks and it can produce better results than full search algorithms that check all the possible motion vectors available in the search range and select the one that minimizes a decision criteria such as the sum-of-absolute-differences (SAD). Additionally, the effective costing of the motion vector with a rate-distortionoptimization (RDO) Lagrangian technique is not generally considered in the full-search architectures although it can typically obtained a 10% reduction in bit rate for the same quality. Fig. 3,4 and 5 explore the rate-distortion performance of 1080p high definition sequences extracted from [17] with varying degrees of motion complexity (high in Crodwrun, medium in Pedestrian and low in sunflower). The algorithm selected is the popular hexagon-based fast search strategy for the integer-search followed by a diamond-based search for the fractional search as available in x.264.The search area has been increased to 112x128 pixels as used in our own core. The figures evaluate the performance of full-search at the integerpel level as a reference. The full-search algorithm considered works in the traditional point of checking all the points and selection the one with the lower SAD without any Lagrangian optimization technique. It can be seen that it performs worst than the equivalent fast-search full-pel technique without Lagrangian. This is specially the case for the Pedestrian and Sunflower sequences that correspond to smooth object motion. These two sequences also show that enabling the Lagrangian optimization in the fast motion integer-pel only option is beneficial. This is not the case for the Crowdrun sequence that contains more local motion components that do not benefit from this optimization. Fractional-pel outperforms the integer-pel modes in all the sequences. Finally using subblocks offers little benefit for the Pedestrian and Sunflower since the motion complexity is lower an a single larger block can captured it correctly. From this analysis it can be concluded that different video sequences benefit differently from the different options available as part of motion estimation. Reconfigurable and 2967 3 programmable hardware can be used to better match the motion estimation algorithm and the hardware the algorithm runs on. Crowdrun 43 42.5 IV. INSTRUCTION SET AND PROGRAMMING MODEL PSNR 40 39.5 39 38.5 38 50.0 55.0 60.0 65.0 70.0 75.0 80.0 85.0 90.0 95.0 100.0105.0110.0 Bit Rate (Mbits/s) Figure 3. High Complexity HD Motion estimation RD performance analysis Sunflower 43 42.5 42 41.5 41 40.5 Title A. LiquidMotion Instruction Set Architecture The instruction set should be able to express the inherent parallelism available in the motion estimation algorithm in a simple way to minimize the overheads of instruction fetch and decode and keep the execution units of the core as busy as possible. The number of execution units available in the proposed processor vary depending on the implementation so it is important that binary compatibility between different hardware implementations is achieved so a program only needs to be compiled once and can be executed in any implementation. The instruction set architecture consists of a total of 9 different instructions and it is illustrated in Fig.5. There are two arithmetic instructions for integer and fractional pattern searches, a total of 6 control instructions that change the program flow and one mode instruction that sets the partition mode and reference frame to be used for the arithmetic instructions. The arithmetic instructions exploit the most obvious form of parallelism which is available at the search point level. For example in a simple small diamond pattern there are four points that can be calculated in parallel if enough execution resources have been implemented. The arithmetic instructions express this parallelism with two fields that identify the number of points used by the pattern and the position in the point memory where the offsets for that pattern are defined. The control unit can then execute the instruction with a parallelism level that ranges from issuing each of these points to a different execution unit in a fully parallel hardware configuration to issue each point to the same execution unit in a base hardware configuration. The same approach applies to fractional instructions. The set mode instruction is used to change the active partition mode and reference frame of the core and configures the internal control logic to operate with different address boundaries and data sources. There are a total of 32 32-bit registers available. These registers include the command register, motion vector candidate registers, results registers and profiling registers. The motion vector candidate registers are used to store motion vectors supplied by the user from surrounding macroblocks or from macroblocks in different frames. These candidate motion vectors, together with their associated SAD values, are loaded into the current motion vector and current SAD registers with the issue of the set mode instruction and are used as the starting point for subsequent pattern instructions. 41 40.5 40 39.5 39 38.5 38 1.0 1.5 2.0 2.5 3.0 Bit Rate (Mbits/s) 3.5 4.0 Figure 4. Low Complexity HD Motion estimation RD performance analysis Pedestrian area 43 42.5 42 41.5 41 40.5 40 PSNR Fast motion estimation algorithms have not been standardized and multiple trade-offs between algorithm complexity and quality of the results can be made making a programmable architecture beneficial. The following sections present the microarchitecture and programming model of the hardware/software solution developed according to the principles of configurability and programmability which has been named LiquidMotion. 42 41.5 39.5 39 38.5 38 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 Bit Rate (Mbits/s) Fractional-pel search Fractional-pel and all blocks Full-search Integer-pel search Integer-pel only with Lagrangian dissable Figure 2. Medium Complexity HD Motion estimation RD performance analysis 2967 4 Op code Field B Field A Pattern address 0000 Number of points Integer pattern instruction 0001 Pattern address used to process this configuration file together with the LiquidMotion RTL hardware library and generate a hardware netlist and FPGA bitstream with the right number and type of execution units matching the software requirements. Number of points New ME algorithm Fractional pattern instruction Winner field 0010 immediate8 High level ME algorithm code Conditional jump to label (if winner field = winner id the jump to inmediate8) winner id = 0 no winner in pattern else ids the winner execution unit unused 0011 SharpEye Compiler immediate8 87 16 15 immediate8 unused 0100 Assembly code SharpEye Assembler/Linker Unconditional jump to label 87 16 15 Processor bitstream Conditional jump to label (if condition bit set jump to label) Program Binary reg 0101 immediate12 Compare (if less than set condition bit) reg 0110 reg 0111 immediate12 Compare (if equal set condition bit) 15 1000 Unused New Hardware configuration immediate12 Compare (if greater than set condition bit) 11 3 7 Partition mode Reference Frame MVC Set mode for MV candidate/reference frame/partition Point Binary Standard Synthesis/ Place&route FPGA tools Cycle Accurate Simulator/ Configurator Constraints energy/ throughput/ quality/area Generate configuration RTL RTL file Component Library Yes Constraints met? Number and type of functional units (Integer and fractional pel, Lagrangian, motion vector candidates, etc) No Figure 5. LiquidMotion Design Flow Figure 5. LiquidMotion ISA The core has no instructions to access external memory, relying instead on an external DMA engine to move the reference frame data and current macro block data before processing starts. This external DMA engine moves an initial 7x8 macroblock search area at the beginning of each row for a search range of 112x128 pixels. For the remaining macroblocks in each row, only the newest column needs to be loaded and the loading of the new macroblock column can take place in parallel with data processing as explained in section V.C. Once the input data is ready, processing can start by writing to the command register. Advanced motion estimation techniques such as the adaptive thresholds used in the UMH [21] or PMVFAST algorithms can be implemented modifying the program memory contents directly by inserting modified immediate field contents into the compare instructions. B. LiquidMotion Programming Model and Design Flow The processor offers a simple programming model so that a motion estimation algorithm programmer can access the functionality of the hardware without detailed knowledge of the microarchitecture. The toolset is composed by a compiler, a cycle accurate simulator and analysis functions and it enables the programmer to test different motion search techniques before deciding on the one that obtains the required quality of results in terms of rate-distortion performance and throughput in terms of macroblocks or frames per second. At this point the programmer can instruct the tools to generate an RTL configuration file for the processor. Commercial synthesis tools such as Xilinx ISE or Synplify can then be This design flow is illustrate in Fig. 5. This scalable architecture can be easily programmed using an intuitive Cstyle language called EstimoC. EstimoC is a high level language, powerful enough to express a broad range of motion estimation algorithms in a natural way. The EstimoC code is written in the embedded editor or any other compatible editor and is interpreted by the EstimoC compiler. The language has a natural syntax with elements from C and with special structures for the development of motion estimation algorithms. Typical constructs such as for loops, if-else, while loops, etc are supported. The algorithm designer can use these constructs to create arbitrary block-matching motion estimation algorithms ranging from the classical full search to advanced algorithms such us UMH. Part of the language is dedicated to the preprocessor and other parts are for the core decoding unit. The preprocessor is a crucial part of the compiler because it provides syntax facilities for the development sophisticated algorithms. For example, EstimoC provides two ways to specify the search patterns: using a static pattern specification as in pattern(hexagon) {pattern instructions} or using the dynamic pattern generation. In this second case the programmer writes a sequence of simple check instructions in the form check(x,y); followed by the update; syntax element. A simple program example and a section of the compiler output with all the loops unrolled are shown in Fig. 6. The algorithm corresponds to a 4-point diamond pattern followed by a full-search fractional-pel refinement which also illustrates that it is possible to implement exhaustive search approaches if they are required. The example starts setting an initial step size of 8 that defines the size of the diamond. An initial check is done at the center 2967 5 point (defined by the motion vector loaded in the motion vector candidate register or zero if none available) and the 4point diamond surrounding it. This will result in a single integer-pattern instruction with 5 points (instruction zero in the sample code). Then a number of diamond steps are conducted reducing the step size until the step size is smaller than 1 that corresponds to fractional searches. Each 4-point diamond will generate a single check instruction. Finally a small full search is conducted with the two for loops that will result in a single fractional instruction with a total of 25 points (0-.5, -0.25, 0, 0.25, 0.5) for the i and j indexes (instruction 31 in the sample code). The example program also shows a specific if-break syntax that is used to terminate the search early as described in section C and corresponds to instruction opcode 2 in the sample code. S = 8; // Initial step size Conditional jump instruction check(0, 0); check(0, S); check(0, -S); check(S, 0); check(-S, 0); update; do { S = S / 2; for(i = 0 to 4 step 1) { check(0, S); check(0, -S); check(S, 0); check(-S, 0); update; #if( WINID == 0 ) #break; } } while( S > 1); Integer check pattern instruction 0 1 2 3 0 05 00 chk NumPoints: 5 startAddr: 0 0 04 05 chk NumPoints: 4 startAddr: 5 2 00 0B chkjmp WIN: 0 goto: 11 0 04 05 chk NumPoints: 4 startAddr: 5 ………………. 11 0 04 0A chk NumPoints: 4 startAddr: 9 12 2 00 15 chkjmp WIN: 0 goto: 21 ……………….. 21 0 04 0D chk NumPoints: 4 startAddr: 13 22 2 00 1F chkjmp WIN: 0 goto: 31 ………………. 31 1 19 11 chkfr NumPoints: 25 startAddr: 17 Fractional check pattern instruction for(i = -0.5 to 0.5 step 0.25) for(j = -0.5 to 0.5 step 0.25) check(i, j); update; Figure 6. LiquidMotion programming example and compiler output All of the check constructs between update constructs result in a single integer or fractional pattern instruction. The compiler processes this source code and generates two binary files. The first file called program_memory contains the program instructions themselves, the second file called point_memory contains the x and y offsets off the basic search pattern (e.g. [-1,0], [1,0], [0,1], [0,-1] for the diamond search) that will be computed with the current motion vector candidate and identify the location of each new search point to be checked. C. Early termination and Search-point duplication avoidance implementation Early termination is a very important feature used to speed up execution in fast motion estimation algorithms. Typically if a pattern fails to improve the SAD of the previous iteration, the algorithm terminates the current search loop. To implement this technique each completing check pattern instruction sets a best_eu register indicating which search point has improved upon the current cost. This register is set to zero before each instruction starts executing so the value of the best_eu register at the end of execution indicates if the instruction has improved the cost value (best_eu different from zero) and if so which search point has achieved this improvement. The conditional jump instruction checks this register and changes the execution flow as required. The same hardware can be used to support a technique to avoid searching duplicated points by coding optimized sub-patterns in software. For example, in a hexagon search pattern the first pattern contains six different points but subsequent patterns will only add three new points to the search sequence. To avoid checking the same point more than once the best_eu register can be checked to identify the winning search point and this information can be used by the hardware to decide which instruction to execute next. For this optimization to work for the hexagon case the program needs to be extended to contain a total of one full pattern sequence and six short patterns sequences. The complexity of identifying possible duplicated search points and avoiding them is built into the compiler so the algorithm designer does not need to get involved in this process and this also helps to keep the hardware simple. V. PROCESSOR MICROARCHITECTURE The microarchitecture of two configurations are illustrated in Fig.7 and Fig.8. Fig.7 corresponds to the base-configuration with a single integer-pel execution unit while Fig.8 corresponds to a complex configuration with 4 integer-pel execution units, 2 fractional-pel execution units and one interpolation execution unit. One integer-pel pipeline must always be present as shown in Fig.7 to generate a valid processor configuration but the others units are optional, and are configured at compile time. Additionally to the number of fractional and integer execution units the hardware includes support for other motion estimation options as shown in Table 1. Notice that inpedent state machines are used in the control unit to support variable block-sizes. The set mode instruction can be used to set the core for a particular partition. Partitions are calculate sequentially one after another. A. Integer-pel execution units (IPEU). Each functional unit uses a 64-bit wide word and a deep pipeline to achieve a high throughput. All the accesses to reference and macroblock memory are done through 64-bit wide data buses and the SAD engine also operates on 64-bit data in parallel. The memory is organized in 64-bit words and typically all accesses are unaligned, since they refer to macroblocks that start in any position inside this word. By performing 64-bit read accesses in parallel to two memory blocks, the desired 64-bits inside the two words can be selected inside the vector alignment unit. The number of integer-pel execution units is configurable from a minimum of one to a maximum of 16 and generally limited by the available resources in the technology selected for implementation. Each execution unit has its own copy of the point memory and processes 64-bits of data in parallel with the rest of the execution units. The point memories are 256x16 in size and contain the x and y offsets of the search patterns. 2967 6 Configuration option Options available Base processor with one IPEU, one reference frame and 16x16 block size Number integer-pel execution units Number of fractional-pel execution units Additional partition sizes supported Motion vector candidates supported Langrangian optimization Number of additional reference frames Complexity In Virtex-5 LUTs 1464 Complexity memory Virtex-5 BRAMS 9 ipeu from 1 to 16 +(ipeu-1) * 1015 +(ipeu-1)*4 fpeu from 0 to 16 +4057 + fpeu*1104 +1+fpeu*9 8x8 (8x8, 16x8,8x16) or 4x4 (4x8, 8x4, 4x4) Enable or Disable +75 +160 +132 0 Enable or Disable 1 +133 +20 0 +4 N/A 16-bit current motion vector Point Memory 8-bit point addresses Address Calculator 12-bit reference addresses 20-bit instructions Fetch, Decode, Issue Program Memory Reference Memory 8-bit Fetch addresses 128-bit Reference Vector 0 Vector Alignm ent Unit 64-bit Reference Vector Macroblock Memory 64-bit Current Vector SAD 64-bit SAD Vector SAD Accumulat or and control Table 1. LiquidMotion configuration options 16-bit current sad 8-bit eu id SAD Selector 16-bit best sad 8-bit best eu id Figure 7. Microarchitecture with a single execution unit 8-bit pattern addresses Program Memory 20-bit instructions 16-bit current motion vector Point Memory Point Memory 8-bit point addresses 8-bit Fetch addresses 16-bit best sad and 8bit winner id Point Memory Address Calculator 8-bit point addresses 8-bit point addresses Address Calculator 12-bit reference addresses Point Memory Address Calculator 12-bit reference addresses Address Calculator 12-bit reference addresses 12-bit reference addresses Fetch, Decode, Issue Reference Memory 8-bit point addresses Macroblock Memory Address Calculator 8-bit reference addresses Half-pel Interpolation Address Calculator Half-pel Reference Memory Half-pel Reference Memory 128-bit hp interpolated pixels Vector Alignm ent Unit Vector Alignm ent Unit Vector Alignm ent Unit 64-bit interpol ated vector DIF 64-bit DIF vector 16-bit current cost 64-bit SAD Vector COST Selector 64-bit Current Vector 16-bit current sad 64-bit Current Vector Vector Alignme nt Unit Vector Alignme nt Unit SAD Reference Memory 64-bit SAD Vector COST Accumulat or and control 8-bit eu id 16-bit current sad 64-bit Current Vector MVP QP MV MV cost Motion Vector cost COST Accumulat or and control 16-bit current sad 8-bit eu id COST Selector 16-bit best sad Vector Alignme nt Unit Macroblock Memory SAD 64-bit SAD Vector COST Accumulat or and control 16-bit current sad 8-bit eu id 64-bit Current Vector SAD 64-bit SAD Vector COST Accumulat or and control COST Accumulat or and control 16-bit current cost Reference Memory 128-bit Reference Vector 128-bit Reference Vector 128-bit Reference Vector Vector Alignme nt Unit Quarter-pel Interpolator DIF COST Accumulat or and control 64-bit reference Vector SAD Vector Alignm ent Unit Quarter-pel Interpolator Reference Memory 8-bit best eu id 16-bit best cost Figure 8. Microarchitecture with a total of six execution units 8-bit eu id 2967 For example a typical diamond search pattern with a radius of 1 will use 4 positions in the point memory with values [-1,0], [0,-1], [1,0], [0, 1]. Any pattern can be specified in this way and multiple instructions specifying the same pattern can point to the same position in the point memory saving memory resources. Each integer-pel execution unit receives an incremented address for the point memory so each of them can compute the SAD for a different search point corresponding to the same pattern. This means that the optimal number of integer-pel execution units for a diamond search pattern is four, and for the hexagon pattern six. Further optimization to avoid searching duplicated patterns can halve the number of search points for many regular patterns. In algorithms which combine different search patterns, such as UMH, a compromise can be found to optimize the hardware and software components. This illustrates the idea that the hardware configuration and the software motion estimation algorithm can be optimized together to generate different processors depending on the software algorithm to be deployed. B. Fractional-pel Execution Unit (FPEU) and Interpolation Execution Unit (IEU). The engine supports half and quarter pel motion estimation thanks to a half-pel interpolator execution unit and specifically designed fractional-pel execution units. The number of halfpel interpolation execution units is limited to one but the number of fractional-pel execution units can also be configured at compile time. The IEU interpolates the 20x20 pi xel area that contains the 16x16 macroblock corresponding to the winning integer motion vector. The interpolation hardware is cycled 3 times to calculate first the horizontal pixels then the vertical pixels and finally the diagonal pixels. The IEU calculates the half pels through a 6-tap filter as defined in the H.264 standard. The IEU has a total of 8 systolic 1-D interpolation processors with 6 processing elements each. The objective is to balance the internal memory bandwidth with the processing power so in each cycle a total of 8 valid pixels are presented to one interpolator. The interpolator starts processing these 8 pixels producing one new half-pel sample after each clock cycle. In parallel with the completion of 1-D interpolation of the first 8-pixel vector, the memory has already been read another 7 times and its output assigned to the other 7 interpolators. The data read during memory cycle 9 can then be assigned back to the first interpolator obtaining high hardware utilization. The horizontally interpolated area contains enough pixels for the diagonal interpolation to also complete successfully. A total of 24 rows with 24 bytes each are read. Each interpolator is enabled 9 times so that a total of 72 8-byte vectors are processed. Due to the effects of filling and emptying the systolic pipeline before the half-pel samples are available, a total of 141 clock cycles are needed to complete half-pel horizontal interpolation. During this time, the integer pipeline is stalled, since the memory ports for the reference memory are in use. Once horizontal interpolation completes, and in parallel with the calculation of the vertical and diagonal half-pel pixels and the fractional pel motion estimation, the processing of the next macroblock or partition 7 can start in the integer-pel execution units. Completion of the vertical and diagonal pixel interpolation takes a further 170 clock cycles after which the motion estimation using the fractional pels can start. Quarter-pel interpolation is done onthe-fly as required simply by reading the data from two of the four memories containing the half and full pel positions, and averaging according to the H.264 standard. The fractional pipeline is as fast as the integer pipeline, requiring the same number of cycles to compute each search position as explained in section VI. C. Reference memory organization The implemented reference memory can accommodate a search area with a width of 128 pixels. The limitation of the horizontal search range to 112 pixels leaves a 16-pixel wide memory area available to be reloaded with a new column for the next macroblock in parallel with the processing of the current macrobloc using a shifting window technique. The shift window means that the reference addresses are offset so reads are not performed on the memory area being loaded with a new column of reference pixels for the next macroblock. The implementation of the reference area in Xilinx V5 BlockRAMs uses a total 4 blocks RAMs. Each Block-RAM is organized with 1024 words and 4-bytes per word and in a dual-port configuration. Fig. 8 shows a simplified view of the reference memory organization. The key feature is that the 8pixel words that form the reference area are stored in an interleaved organization in the BlockRAMs. For example the first row of the first 16x16 macroblock is formed by words 0 and 1. Word 0 is stored in BRAMs 1 and 2 while word 1 is stored in BRAMs 3 and 4 as shown on the left of Fig.8. The less significant bit of the address is used to activate the reading of the adequate BRAMs. Since a motion vector can point to any location in this reference window the accesses are generally misaligned and, for example, the last 3-bytes from the word read in BlockRAMs 1/2 must be concatenated with the first 5-bytes from the word read in BlockRAMs 3/4 to form 64 bits of valid data. Notice that if for example the motion vector points to the middle of memory word 1 then a few bytes from memory word 2 will also be needed to formed 64-bits of valid data.. In this case the address must be incremented by one to access the right location for memory word 2 (second position in the BRAMs 1/2). The effect of the memory interleaving technique is that the BlockRAMs always have one memory port free. The free port can be used to load new reference data for the next macroblock in parallel with the processing of the current macroblock. This is very important since if processing and loading of new data must be done in sequence performance will typically half. The simultaneous reading and writing means that next macroblock data is being loaded by an external DMA engine while the current macroblock is processed in parallel to mask the overheads effects of limited bus bandwidth. In our prototype the bus width is 64-bits so the DMA engine can load a new 64-bit word in each clock cycle. A new column of 8 macroblocks (2048 bytes) can then be loaded in 256 clock cycles. 2967 8 64 128-pixels +16 8-pixels 4x4,8x4 read port 4x4/4x8 enable 0 128 rows 1 …... 2 17 …... 18 15 31 BRAM2 Port (1024x32) B Data Effective Search area (112 x128 pixels) Read Address(0) Read Address 8-pixels 0 1 10 Write Address Data Port BRAM1 A (1024x32) Read Port …... 8-pixels 64 Address 10 10 Address 16 Reference Data Out 64 Write Port / Read Write Enable Port Read Address(0) 64 Reference Data In WE 0 64 10 4x4/4x8 enable +16 +1 1024 words 31 30 BRAM1 (1024x32) 10 Address 3 2 BRAM 2 (1024x32) BRAM3 (1024x32) 2046 Address …... BRAM4 (1024x32) Read Address(0) Read Port Reference Data Out Data 64 16x16, 16x8, 64 8x16, 8x8, 4x8 read port 64 10 Data Port BRAM3 A (1024x32) BRAM4 Port (1024x32) B Write Port / Read Write Enable Port 64 Read Address(0) WE 0 2047 Reference memory architecture Reference memory organization Figure 8. Reference memory internal organization VI.HARDWARE PERFORMANCE EVALUATION AND IMPLEMENTATION For the implementation we have selected the Virtex-5 LX110T device included in the XUPV5 development platform. This device offers a high level of density inside the Virtex-5 family and can be considered main stream being fabricated using 65 nm CMOS technology. D. Performance/complexity analysis. The results of implementing the processor with different numbers and types of execution units are illustrated in table 1.The basic configuration is small using only 2% of the available logic resources and 6% of the memory blocks. Virtex-5 LX110T (XUP V5 board) Configuration Virtex-5 Slice LUTs (number of used / Slice LUTs execution available units) 1,464/ 69,120 (2%) 1 IPEU/ 0 FPEU 2,479/69,120 (3%) 2 IPEU/ 0 FPEU 3,461/69,120 (5%) 3 IPEU/ 0 FPEU 6,625/69,120 (9%) 1 IPEU/ 1 FPEU 2 IPEU/ 1 FPEU 7,567/69,120 (11%) Virtex-5 Memory blocks used/ Memory blocks available Critical path (ns) 9/148 (6%) 4.551 13/148 (8%) 4.420 18/148 (12%) 4.620 18/148 (12%) 4.695 23/148 (15%) 4.470 Table 1. Processor complexity Each new execution unit adds around 1000 V5 LUTs and 4 embedded memory blocks to the complexity. The fractional and integer execution units have been carefully pipelined and all the configurations can achieve a clock rate of 200 MHz in this part. To obtain a performance value in terms of macroblocks per second is not as straight forward as in full search hardware that always computes the same number of SADs for each macroblock. In this case the amount of motion in the video sequence, the type of algorithm and the hardware implementation vary the number of macroblocks per second that the engine can processed. The cycle accurate simulator part of the toolset has been used to measure the performance of the core processing the same high-definition files introduced in section 3. The performance values obtained from the cycle accurate simulator have been verified against a prototype implementation of the system using the XUPV5 board. .Overall the microarchitecture always uses 33 cycles per search point although there is an overhead of 11 clock cycles needed to empty the integer pipeline before the best motion vector can be found in each pattern iteration and the next pattern started from the current winning position. The microarchitecture stops an execution unit if the current SAD calculation becomes larger than the cost obtained during a previous calculation to save power but it does not try to start the next search point earlier. The main reason why this optimization is not used can be explained as follows. Since the core uses multiple execution units it is very important that all the execution units are maintained synchronized so that a single control unit can issue the same control signals to all the execution units. Execution units starting at different clock cycles will invalidate this requirement. 2967 9 frames/second Sunflower 300 285 270 255 240 225 210 195 180 165 150 135 120 105 90 75 60 45 30 15 0 1 dia 16x16 hex 4x4 2 4 8 Number of IPEU dia 8x8 UMH 16x16 dia 4x4 UMH 8x8 hex 16x16 UMH 4x4 16 hex 8x8 Figure 10. Analysis of iInteger-pel performance in Sunflower sequence Crowdrun 300 285 270 255 240 225 frames/second Integer-pel performance is evaluated using three different fast motion estimation algorithms: diamond, hexagon and UMH (Uneven Multi-hexagon Cross Search) all of them followed by a 8-point square refinement as implemented in the x.264 codec. Figs. 9 to 11 show the performance in terms of frames per seconds as the number of integer execution units changes for different minimum sub-partitions. The 8x8 mode considers the 16x16, 8x16, 16x8 and 8x8 partitions while the mode 4x4 considers all the partitions. As the number of partitions consider increases performance decreases since the core must compute one partition at a time. It is not possible to reuse partition results and calculate them in parallel for the fast motion estimation algorithms considered since each partition will follow a potentially different search direction. It is important to notice that not all the partitions are checked. The inter-mode selection algorithm part of the x.264 codec selects which sub-partitions to test. For example, if the 8x8 partition has not improved over 16x16 partition then 4x4 will not be considered. The figures show that more complex algorithms show a better scalability with the available number of execution units. For example, a diamond search pattern optimal configuration includes four IPEUs although in these experiments performance increases for configurations with more than 4 IPEUs due to the presence of the final square refinement that includes 8-points in its search pattern. It is also important to notice that a configuration with three IPEUs will need the same number of cycles as for the two IPEUs case for the diamond-search. The reason for this is that whilst the first iteration will enabled all three IPEUs, a second iteration will still be required to complete the pattern instruction, when only one IPEU will be enabled. 210 195 180 165 150 135 120 105 90 75 60 frames/second Pedestrian area 45 300 285 270 255 240 225 210 195 180 165 150 135 120 105 90 75 60 45 30 15 0 30 15 0 1 2 4 8 16 Number of IPEU dia 16x16 dia 8x8 dia 4x4 hex 16x16 hex 4x4 UMH 16x16 UMH 8x8 UMH 4x4 hex 8x8 Figure 11. Analysis of iInteger-pel performance in Crowdrun sequence 1 2 4 8 Number of IPEU dia 16x16 hex 4x4 dia 8x8 UMH 16x16 dia 4x4 UMH 8x8 hex 16x16 UMH 4x4 hex 8x8 Figure 9. Analysis of iInteger-pel performance in Pedestrian sequence 16 Figures 9 to 11 also showed that the simpler motion available in video sequences such as Sunflower and Pedestrian result in higher frame rates. This could be exploited in the hardware by lowering the clock frequency to maintain a constant frame rate in a real application. Also the complex motion present in Crowdrun makes the probability of selecting the smaller subblocks much higher and increases the impact on performance of using these sub-blocks. For example to maintain a frame rate of 30 frames per second over the Crowdrun sequence when all the block sizes are used 16 IPEUs are needed as shown in Fig.11. Another form of parallelism not described in this paper but certainly possible will be a multi-core implementation. In this case some ME processors could be dedicated to run particular sub-blocks and only activated if needed. This will enable the further scaling of the presented 2967 10 Sunflower 150 135 120 105 frames/second 90 75 60 45 30 15 0 1 2 4 8 16 Number of FPEU dia 16x16 hex 4x4 dia 8x8 square 16x16 dia 4x4 square 8x8 hex 16x16 square 4x4 hex 8x8 Figure 13. Analysis of Fractional-pel performance in Sunflower sequence Crowdrun 150 135 120 105 frames/second architecture to higher frame rates for complex algorithms. The current microarchitecture can run both the integer-pel and fractional-pel in parallel. To be able to obtain the same level of fractional and integer-pel performance each fractional pel execution unit needs two alignment units due to the fact that in order to perform quarter-pel interpolation two half pel data words need to be read and aligned. The complex part of executing the fractional-pel refinement involves the half-pel interpolation using the standard 6-tap filter. In the current microarchitecture this interpolation needs to complete before the fractional-pel search can start and the interpolator needs around 300 clock cycles to calculate the horizontal, vertical and diagonal pixels. Figs. 12 to 14 evaluate the performance of the fractional-pel searches using three fractional motion estimation: diamond, hexagon and square search. Fractional search does not require complex algorithms since the search area is limited to 20x20 pixels. This is the 16x16 pixel area that corresponds to the winning integer macroblock extended with two pixels in each side. In all the cases we consider a search loop formed by two half-pel checks followed by two quarter-pel checks. This follows the same approach as used in the x.264 codec. Also, sub-partitions are processed in a similar way as done in the low complexity mode of x.264: the fractional search refinement is only done on the best partition after the integer-search completes. This option is taken to maintain interpolation complexity low. The alternative of performing a fractional refinement over each possible partition will need a muti-core implementation since the single interpolator available in the microarchitecture will not be able to cope. Similarly to the integer-pel search the figures show that simpler motion sequences translate in higher performance as expected. In this case we can also observe that the scalability of the fractional-pel search performance with the number of FPEU is more limited that in the integer-pel case. The reason for this is the need for the half-pel interpolation stage before the search can start that always needs a constant number of clock cycles independentely of how many FPEUs are available. 90 75 60 45 30 15 0 1 2 Pedestrian area 4 8 16 Number of FPEU dia 16x16 hex 4x4 150 135 dia 8x8 square 16x16 dia 4x4 square 8x8 hex 16x16 square 4x4 hex 8x8 Figure 14. Analysis of Fractional-pel performance in Crowdrun sequence 120 105 frames/second 90 75 60 45 30 15 0 1 2 4 8 16 Number of FPEU dia 16x16 hex 4x4 dia 8x8 square 16x16 dia 4x4 square 8x8 hex 16x16 square 4x4 Figure 12. Analysis of Fractional-pel performance in Pedestrian sequence hex 8x8 Finally, Table 3 compares the performance and complexity figures of the base configuration of the LiquidMotion processor against the ASIP cores proposed in [11] and [12] in terms of performance complexity. The figures measured in the general purpose P4 processor with all assembly optimizations enabled are also presented as a reference although the power consumption and cost of this general purpose processor are not suitable for the embedded applications this works targets. These types of comparisons are difficult since the features of each implementation vary. For example our base configuration does not support fractional pel searches and the addition of the interpolator and fractional pel execution unit in parallel with the integer pel execution unit increases complexity by a factor of 3. The core presented in [12] does support fractional pel 2967 11 searches although with a non-standard interpolator and both searches must run sequentially. Overall Table 3 shows that our core offers a similar level of integer performance for the diamond search algorithm to the ASIP develop in [12] with one execution unit and this can be almost double if the configuration instantiates two execution units as shown in the last row. For these experiments our core was retargeted to a Virtex-II device since this is the technology used in [11] and [12] to obtain a fair comparison. The pipeline of the proposed solution can clock at double the frequency as shown in the table and this helps to justify why our solution with a single execution unit can support 1080p HD formats while the solution presented in [12] is limited to 720p HD formats. The measurements of cycles per macroblock were obtained processing the same CIF sequences as used in [12]. Cycles per MB (Diamond search) FPGA Complexity (Slices) Memory (BRAMS) N/A FPGA clock (MHz, VirtexII) N/A Intel P4 assembly Dias et al. [11] ~3,000 4,532 2,052 67 Babionitakis et al. [12] 660 2,127 50 Proposed with one integer-pel execution unit 510 1,231 125 4(external reference area) 11 (1 reference area of 48x48 pixels) 21 (2 reference areas of 112x128 pixels) Proposed with two integer-pel execution units 287 Dynamic Power of ME engines N/A 100 90 2,051 125 38(2 reference areas of 112x128 pixels) Table 3. Performance/complexity comparison The diamond search corresponds to the implementation available in x.264 that includes up to 8 diamond interactions followed by a square refinement using a single reference frame and a single macroblock size (16x16). B. Power analysis. Power is a major consideration in hardware design so it is important to investigate how effective is the core from a power efficiency point of view. Unfortunately no power results have been reported in [10] and [11] for the FPGA implementations. In any case most of the literature available reporting power consumption in FPGAs rely on the tools provided by the vendors. The standard approach is to use a tool such as Xilinx Xpower together with a VCD activity file obtained from simulating the netlist backannotated with timing information. This should accurately captured the logic gliches largely responsible for dynamic power consumption together with the switching behavior of flip-flops and LUTs. This flow applied to our core translates into unreasonable running times or a low level of confidence regarding the power results due to only a portion of the signals/logic being activated if using short simulation runs. A timing simulation of around 1000 ns that only contents 100 clock cycles of activity with a 100 MHz clock rate results on Xpower needing more than 12 hours to complete the analysis in a P4 computer due to the complexity of the core. A more accurate approach involves measuring the amount of power consumed by the chip when deployed and this is the method we have used in this work. The core is 80 70 Power (mW) Approach deployed as part of a SoC using a modified Xilinx XUP V5 board with isolated vcore power supply connected to a purposely designed voltage regulator. For the analysis we have clock gated the ME processor to be able to isolate the power consumed by the ME core from the power of the rest of SoC map in the FPGA. The SoC uses a soft core processor to move data from external DDR memories to the internal ME memories using the AMBA bus. In this experiment the movement of data is done initially gating the clock of the ME core and in the second run with the clock running and the real motion estimation process done. The difference between these two measures corresponds to the dynamic power of the ME core. Fig. 15 shows the dynamic power of the ME cores with one and two integer-pel cores. As expected power increases linearly with core frequency and it is proportional to the core complexity. 60 50 ME 1IU 40 ME 2IU 30 20 10 0 18.75 21.88 25.00 28.12 31.25 34.38 37.50 40.62 43.75 46.88 50.00 Frequency (MHz) To consider static power in FPGA devices it is possible to make the distinction between the configured and unconfigured states. In the unconfigured state the bitstream has not been loaded and the FPGA fabric is set to a default low-leakage state as described in [x] with Initially the reconfigurable region is left in a blank state (unconfigured) but the process of moving data from external DDR memories to th Column 2 in Table 4 shows the power measured after configuring the static region with the SoC but keeping the reconfigurable region empty. The value corresponding to Frequency 0 shows the static power consumption of the FPGA. It can be observed that the static power consumption is the main cause of power consumption in the device. The second column shows the power after the region has been configured with the motion estimation core but both SoC and ME remain in an idle state. Power increases from 29 mw to 78 mw depending on the clock frequency. The increase in power with the clocks is expected since although no useful work is being done the activation of the clocks will increase the switching activity of the logic cells, digital clock managers and another logic present in the chip. It is however remarkable the large increase in power resulting from configuring the region. This suggest that the unconfigured state is much more power efficient than the configured state and that if a region in the FPGA is not going to be used for some time it could be unconfigured to save power. Column 3 correspond to the region configured with the ME core but only the SoC running. 2967 12 In this case the SoC processor writes reference and current macroblock data to the ME memories but it does not activate the core. Finally column 4 is equivalent to column 3 but in this case the SoC processor activates the ME core to calculate the motion vectors as defined by the motion search algorithm. A diamond search is used for these experiments and the difference between column 3 and column 4 is the dynamic power of the running motion estimation core that measures around 22 mw for the 50 MHz clock. The total dynamic power can be estimated by adding the power consumed by the core with the clocks running but not doing any useful work which can be obtained from the difference between columns 2 and 1 for the 50 MHz once the value corresponding to static power at 0 MHz has been subtracted ( (576-454) – (520 – 425) = 27 mw). This translates in a total dynamic power of 49 mw. The table mex2 shows equivalent data when the ME processor is configured with two integer execution units with a total dynamic power of 74 mw. Table 4. Power analysis The value of the static power consumed is very high but this corresponds to the whole device and not just the portion of the device being used by the core. Since the core with one integer execution unit represents around 8% of the total device we can estimate that the ratio of static power that corresponds to the core itself is approximately 63 mw (34 mw unconfigured + 29 mw after configuration). The mex2 version uses approximately 13% of the device. The static power for the mex2 version is approximately 98 mw (55 mw unconfigured + 43 mw after configuration). In both cases static power is higher than dynamic power. Techniques such as the voltage scaling and error correction approach used in [22] for motion estimation could also be added to the execution units in this work to reduce both the static and dynamic power consumptions. VII.CONCLUSION The main features of the presented processor are the support of arbitrary fast motion estimation algorithms, the seamless integration of fractional and integer pel support, the availability of a software toolset to ease the development of new motion estimation algorithms and processors and the description of a scalable, configurable architecture with a number of execution units determined by the algorithm and throughput requirements. The combination of these features constitutes a significant advancement compared with the work reviewed in section two. For the case of traditional full search hardware the presented core scales well to large search ranges without linear increases in hardware resources and consequently power consumption. The power analysis based on measured data has shown the large effect of static power. The power values have been added to the cycle accurate simulator part of the toolset (available at http://sharpeye.borelspace.com/) which can then be used to configure the processor according to power, performance and complexity constraints. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] Ostermann, J., Bormans, J., List, P., Marpe, D., Narroschke, M., Pereira, F., Stockhammer, T. and Wedi, T., “Video coding with H.264/AVC: tools, performance and complexity”. IEEE Circuits Syst. Mag. v4. pp. 728. Nunez-Yanez, J.L.; Hung, E.; Chouliaras, V., 'A configurable and programmable motion estimation processor for the H.264 video codec,' FPL 2008. International Conference on , vol., no., pp.149-154, 8-10 Sept. 2008 Huang, Y.-W., Wang, T.-C., Hsieh, B.-Y., Chen L.-G. “Hardware Architecture Design for Variable Block Size Motion Estimation in MPEG-4 AVC/JVT/ITU-T H.264”. ISCAS. May 2003. Ching-Yeh Chen; Shao-Yi Chien; Yu-Wen Huang; Tung-Chien Chen; Tu-Chih Wang; Liang-Gee Chen, "Analysis and architecture design of variable block-size motion estimation for H.264/AVC", IEEE TCSVT, vol.53, no.3, pp.578-593, March 2006 Yap, S.Y.; Mccanny, J.V., ‘A VLSI architecture for advanced video coding motion estimation’, ASAP, pp. 293-301, 24-26 June 2003 Chao-Yung Kao and Youn-Long Lin, “An AMBA-Compliant Motion Estimator For H.264 Advanced Video Coding” IEEE International SOC Conference (ISOCC), Seoul, Korea, October 2004 Brian M. Li , Philip H. Leong, “Serial and Parallel FPGA-based Variable Block Size Motion Estimation Processors”, Journal of Signal Processing Systems, Vol. 51 , No. 1, pp. 77-98 April 2008 Bing-Fei Wu; Hsin-Yuan Peng; Tung-Lung Yu, "Efficient Hierarchical Motion Estimation Algorithm and Its VLSI Architecture," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , vol.16, no.10, pp.1385-1398, Oct. 2008. Yu-Wen Huang, Ching-Yeh Chen, Chen-Han Tsai, Chun-Fu Shen, Liang-Gee Chen, “Survey on Block Matching Motion Estimation Algorithms and Architectures with New Results”, The Journal of VLSI Signal Processing, Vol. 42, No. 3. (March 2006), pp. 297-320. Sheu-Chih Cheng; Hsueh-Min Hang, "A comparison of block-matching algorithms mapped to systolic-array implementation," IEEE TCSVT, IEEE Transactions on , vol.7, no.5, pp.741-757, Oct 1997 T. Dias , S. Momcilovic , N. Roma , L. Sousa, “Adaptive motion estimation processor for autonomous video devices”, EURASIP Journal on Embedded Systems, v.2007 n.1, pp.41-41, January 2007 Babionitakis, Konstantinos1, et al., “A real-time motion estimation FPGA architecture”, Journal of Real-Time Image Processing, Volume 3, Numbers 1-2, March 2008 , pp. 3-20(18) Sullivan, G.J.; Wiegand, T., "Rate-distortion optimization for video compression", Signal Processing Magazine, IEEE , vol.15, no.6, pp.7490, Nov 1998 Information available at http://www.xilinx.com/ products/ipcenter/DODI-H264-ME.htm S. Saponara, K. Denolf, G. Lafruit, C. Blanch, and J. Bormans, “Performance and complexity co-evaluation of the advanced video coding standard for cost-effective multimedia communications,” EURASIP J. Appl. Signal. Process., no. 2, Feb. 2004, pp. 220-235. Information available at http://www.videolan.org/developers/x264.html 1080p HD sequences obtained from http://nsl.cs.sfu.ca/wiki/index.php/Video_Library_and_Tools#HD_Sequ ences_from_CBC JM reference software [available on-line]. https://bs.hhi.de/suehring/tml/download Alfonso, D.; Rovati, F.; Pau, D.; Celetto, L., "An innovative, programmable architecture for ultra-low power motion estimation in reduced memory MPEG-4 encoder," Consumer Electronics, IEEE Transactions on , vol.48, no.3, pp. 702-708, Aug 2002 Tourapis, H.-Y.C.; Tourapis, A.M., "Fast motion estimation within the H.264 codec," Multimedia and Expo, 2003. ICME '03. Proceedings. 2003 International Conference on Multimedia and Expo, vol.3, no., pp. III-517-20 vol.3, 6-9 July 2003. Toivonen, T.; Heikkila, J., "Improved Unsymmetric-Cross MultiHexagon-Grid Search Algorithm for Fast Block Motion Estimation", Image Processing, 2006 IEEE International Conference on , vol., no., pp.2369-2372, 8-11 Oct. 2006 Varatkar, G.V.; Shanbhag, N.R., "Error-Resilient Motion Estimation Architecture," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , vol.16, no.10, pp.1399-1412, Oct. 2008