FPGA Implementation of Reduced Bit Plane Motion Estimation Shrutisagar Chandrasekaran, Abbes Amira and Faycal Bensaali Overview September 2004 Chandrasekaran 1 MAPLD 2005/P200 Outline Research Objectives Introduction Reduced Bit-Plane Motion Estimation Proposed Architecture FPGA Implementations and Results Conclusions Future Work and Acknowledgments Chandrasekaran 2 MAPLD 2005/P200 Research Objectives To efficiently implement a reduced bit plane motion estimation algorithm on FPGA using Handel-C for onboard video compression To develop efficient low power architectures for image processing techniques such as Motion Estimation (ME) To evaluate and model power consumption of FPGA based designs at various levels of abstraction and to evolve and implement strategies for low power energy efficient design Chandrasekaran 3 MAPLD 2005/P200 Introduction Block Matching (BM) is a widely used Motion Estimation (ME) technique for calculating motion vectors by minimising some cost functions Optimal prediction is obtained when a Full Search (FS) algorithm is performed FS algorithm is computationally intensive and requires a large number of I/O pins and large bandwidth for real time ME An effective method for reducing the complexity of ME architecture is to reduce the number of bit planes used for computing the motion vector Chandrasekaran 4 MAPLD 2005/P200 Introduction Most of the motion information is the 6th bit plane and a significant amount of the motion information is also available in the 7th bit plane The lower bit planes contain significantly less motion information as they represent the smooth areas of the image Reduce bit-plane methods for ME using a range of arithmetic units and simple Boolean operations leads to power and area efficient architectures Chandrasekaran 5 MAPLD 2005/P200 Reduced Bit-Plane ME for i=1:dim:M-dim+1, for j=1:dim:N-dim+1, ii=i+d; jj=j+d; window=previous_frame(ii-d:ii+dim+d-1,jj-d:jj+dim+d-1); [m,n]=size(window); I=1; J=1; val=sum(sum((current_frame(i:i+dim-1,j:j+dim-1)-window(I:I+dim-1,J:J+dim1)).^2)); for l=1:m-dim+1, for k=1:n-dim+1, val_t=sum(sum(abs(current_frame(i:i+dim-1,j:j+dim-1)-window(l:l+dim1,k:k+dim-1)).^2)); if val_t<val, I=l; J=k; val=val_t; end end end I=I-d-1; J=J-d-1; vec=[vec;I,J]; end end Where -dim : block size d : border extension for window (square window) vec : array of motion vectors Chandrasekaran 6 Pseudo Code MAPLD 2005/P200 Proposed Architecture Control Unit 31 0 PSU: Processor Sub-Unit 0 15 16 Bits 16 Bits PSU 2 Adder 0 Adder 15 5 bits 5 bits Adder 9 bits 9 bits Comparator 2 bits Least input location 9 bits 5 bits 5 bits New min Register Register Comparator Intermediate motion Vectors 5 bits Final motion Vectors Chandrasekaran 7 MAPLD 2005/P200 Proposed Architecture The architecture exploits the massive parallelism available in hardware to reduce the computation time The search window is stored on-chip in an array of 32 bit wide registers, the width of each register being equal to the size of the search window The block size is taken to be 16x16 bits (1 Bit Per Pixel), and is stored on-chip in an array of registers Each Processing Sub-Unit (PSU) contains 256 Processor Elements PEs (256 XOR + 16 5-bit Adders) for parallel execution of the block matching and estimate the SAD (Sum of Absolute Differences) Chandrasekaran 8 MAPLD 2005/P200 Proposed Architecture 2 PSUs are used to cover the entire search window by means of bitwise shift of the contents of the search window in horizontal and vertical directions The intermediate values of motion vectors are stored in the on chip array, with one location for each PSU At the end, the global values of motion vectors are obtained using the intermediate values and the output of the comparators Chandrasekaran 9 MAPLD 2005/P200 Proposed Architecture The proposed architecture yields improved performance metrics when compared to other existing work Architecture Nbre of PEs Throughput Search range Proposed 256 1 MV/308 cycles [ -8, 7 ] [1] 1024 1MV/256 cycles [ -16, 15 ] [2] 256 1 MV/496 cycles [ -8, 7 ] [3] 256 1 MV/2209 cycles [ -8, 7 ] [1] Y-H. Yeh and C-Y. Lee, IEEE Trans. VLSI Syst. 7, 345 (1999) [2] T. Komarek and P. Pirsch, IEEE Trans. Circuits Syst. 36, 1301 (1989) [3] C-H. Hseih and T-P. Lin, IEEE Trans. Circuits Syst. Video Technol. 2, 169 (1992) Chandrasekaran 10 MAPLD 2005/P200 FPGA Implementations and Results In order to verify the performance of the proposed architectures, designs have been prototyped on the Celoxica RC1000 board containing the Xilinx XCV2000E FPGA Available on chip logic resource include - Slices : 19200 - CLB Array : 80 x 120 - Block RAM : 655,360 bits - Distributed RAM : 614,400 bits The RC1000 has 4 memory banks which communicate with the host by means of DMA transfers Chandrasekaran 11 MAPLD 2005/P200 FPGA Implementations and Results Design Flow Chandrasekaran 12 MAPLD 2005/P200 FPGA Implementations and Results Handel-C adds constructs to ANSI-C to enable DK to directly implement hardware Fully synthesizable HW programming language based on ANSI-C Implements C algorithm direct to optimized FPGA or outputs RTL from C Handel-C Majority of ANSI-C constructs supported by DK Software-only ANSI-C constructs Recursion Side effects Standard libraries Malloc Chandrasekaran Control statements (if, switch, case, etc.) Integer Arithmetic Functions Pointers Basic types (Structures, Arrays etc.) #define #include 13 Additions for hardware Parallelism Timing Interfaces Clocks Macro pre-processor RAM/ROM Shared expression Communications Handel-C libraries FP library Bit manipulation MAPLD 2005/P200 FPGA Implementations and Results Reduced BitPlane ME 16 pixels 16 pixels Bank0 Bank1 Bank2 XCV 2000E Bank3 8x8 Blocks Motion Vectors Chandrasekaran 14 MAPLD 2005/P200 FPGA Implementations and Results The bit-plane values from the current frame are sent from the host to the SRAM Bank 0, and those from the previous frame are sent as 16 bit values to the SRAM Bank 1 The motion vectors are computed by the ME core and stored in the SRAM Bank 3 The host application reads the motion vectors and generates the predicted image in real time Chandrasekaran 15 MAPLD 2005/P200 FPGA Implementations and Results The proposed architecture is area efficient, as the motion estimation is performed on a single bit plane, requiring compact logic and greatly reduced on-chip memory size The architecture is efficient, compact and can be massively parallelised as the PE contains simple 1-bit XOR gates only Memory access is greatly reduced due to use of single bit plane only, saving considerable amount of I/O power Chandrasekaran 16 MAPLD 2005/P200 FPGA Implementations and Results This, along architectural level optimisations including parallelism and pipelining yield power efficient implementation Implementation is carried out on the Celoxica RC1000 board equipped with Xilinx XCV2000E FPGA, as well as synthesised on Xilinx QPro Virtex-II FPGA Results in terms of power/area/maximum frequency show that using reduced bit planes instead of full resolution images drastically reduces the FPGA resources used Chandrasekaran 17 MAPLD 2005/P200 FPGA Implementations and Results Various performance metrics of the RBFSBM algorithm implemented on the Virtex-E and the QPro Virtex-II FPGAs Performance Metrics Virtex-E QPro Virtex-II Area Occupied (slices) 1500 1488 Max Frequency (MHz) 43.57 76.247 Max Power (mW) 432.65 227.31 Energy/CIF Frame (mJ) 1.934 1.40 Max Throughput (FPS) 89.305 173 Chandrasekaran 18 MAPLD 2005/P200 Conclusions A reduced bit plane architecture for full search block matching has been proposed The proposed architecture is low power, area efficient and suitable for VLSI/FPGA implementation The developed architecture can be used for space applications such as onboard video compression, video conferencing, etc. Chandrasekaran 19 MAPLD 2005/P200 Future work and Acknowledgments Develop Complete on-chip compression engine for realtime video compression, with applications ranging from onboard satellite compression to video conferencing Explore the effect of Algorithmic, architectural and RTL level optimisations to minimise power consumption Acknowledgments Celoxica (Mr. Roger Gook) and EPSRC for supporting this work Chandrasekaran 20 MAPLD 2005/P200