ELEC692 VLSI Signal Processing Architecture Lecture 7 VLSI Architecture for Block Matching Algorithm for Video compression * Part of the notes is taken from the course notes of Prof. Bing Zeng’s ELEC 533 Reference • P. Pirsch, N. Demassieux, W. Gehrke, “VLSI architecture for Video compression – A survey”, in ther IEEE Proceedings, Vol. 83, No. 2, pp. 220-246,Feb 1995 • T. Komarek, P. Pirsch, “Array Architecture for Block Matching Algorithm”, in IEEE Transactions of Circuit and Systems, vol. 36, No. 10, pp. 1301-1310, Oct. 1989 Interframe Coding/Motion Estimation of Video Sequence Interframe Transform/Predictive Coding Interframe Transform/Predictive Coding • Prediction is based on a previously processed frame • Prediction is accomplished by motion estimation (ME) • Motion estimation is done in spatial domain • 2-D DCT has to be inside the coding loop and a 2-D IDCT is needed to convert the frequency domain information back to spatial domain Motion Compensated Prediction Block Matching Method Search window Block matching Criterion • Mean Square Error (MSE) 1 MSE( , ) 2 N N N 2 ( x ( i , j ) x ( i , j )) t t 1 i 1 j 1 • Mean Absolute Difference (MAD) 1 N N MAD( , ) 2 | xt (i, j ) xt 1 (i , j ) | N i 1 j 1 Important factors for BM Motion Estimation • Block size – 8X8, 16X16, variable • Size of searching window – Depend on frame differences, speed of moving objects, resolution, etc • Matching criterion – Accuracy vs complexity, use of truncated pixels • Search strategy – Full search, hierarchical search, subsampling of motion field • Hardware consideration Real time processing for BMA • Let Block size = 16*16, window size = 32*32, assuming CIF frame at 30f/s, we need ops search blocks frame 30 256 289 396 879Mops/ sec search block frame sec For CCIR 601 or HDTV, it will require several or tens of GOPS/sec. So Full search has to be implemented in dedicated hardware. Exhaustive Search Block Matching • Block size of N X N of the current image (reference block, denote by X) • Matched with all the block located within a search window (candidate blocks, denote by Y). • Maximum displacement – w • Computing the mean absolute difference (MAD) between the blocks • Matching distance D is given by N 1 N 1 D(m, n) x(i, j ) y(i m, j n) i 0 j 0 m v Dmin n V is the motion vector No. of candidate block to be considered: (2w+1)2 Algorithm to find the motion vector Dmin = MAXVALUE Vmin = (0,0) For m=-w to +w for n = -w to +w D(m,n) = 0 for i=1 to N for j = 1 to N D(m,n) = D(m,n)+|x(I,j)-y(i+m,j+n)| endfor endfor if D(m,n) < Dmin then Dmin = D(m,n) Vmin = (m,n) endif endfor endfor Dependency graph Calculating MAD Calculating si(m.n) and s(m,n) Calculate Dmin and v Dependency graph • The BM algorithm can be described by several different dependency graph • Example 1 AD = absolute difference and addition M = minimum value computation Dependency graph • Example 2 Data input • Line scan and block scan • Line scan – TV lines run through as a whole, from the upper to the lower side of the frame • Block scan – Quadratic blocks of n X n pixels are run through in a blockline manner – Well suited if the data are supplied by a memory with block scan output – Pixels within a block are traversed column by column – E.g. (3X3)-pixel block x(1,1) x(1,2) x(1,3) x(2,1) x(2,2) x(2,3) x(3,1) x(3,2) x(3,3) Data are read in the order x(1,1), x(2,1) x(3,1), x(1,2), x(2, 2) x(3,2), x(1,3), x(2,3) x(3,3), Mapping BMA onto Systolic Arrays • Decompose the algorithm into its basic operations and convert it into a form where each result is assigned to a unique variable • Formulate it as an n-dimension dependence graph (DG) of computation nodes and data dependence arcs. • One straight forward mapping is implementing a PE designated to each node of the DG and a communication link to each edge of the DG. • More efficient design with a higher processor utilization if each PE executes the operations of multiple computation nodes • Need time schedule and assignment of multiple nodes to a single PE by projection. PE need to be programmable to some extent. Mapping BMA onto Systolic Arrays • The BMA is defined over a 4-dimensional index space (i,j,m,n) • The BMA can be decomposed into two parts which are defined over two-dimesional index spaces. – 1st one spawn by the index I,j, finding the sum of D(m,n) N Di (m, n) x(i, j ) y(i m, j n) j 1 N D(m, n) Di (m, n) i 1 – 2nd one defined over m and n, the minium search and the selectin of displacement vector Dmin min{D(m, n)} vn (m, n) | Dmin Transform it into a 2D -array • D(m,n) mapped into a 2D array of PE • V(X,Y) is mapped into time Realistic implementation of 2-D array • Reduction of the cycle time – Pipelining of the computation of D(m,n). • I/O management – Each of the AD-PE receives a new value of y(m+i,n+j) at each clock cycle. • Transmitting the N2 value from an external memory is not feasible. WE can take the advantage of that these values belong to the search window. • A portion of the search window of size N.(2w+N) is stored in the circuit in a 2D bank of shift registers, able to shift in, up, down, and right direction. • Each AD-PE has one of these registers and can at each cycle obtain the value of y(m+i,n+j) that it needs • To update this register bank, a new column of 2w+N piexls of the serach area is serially entered in the circuit and is inserted in the back of regigters. • To load in a new reference with a low I/O overhead, a double buffering of x(I,j) is required, with the pixels x’(I,j) of a new reference block serially loaded during the computation of the current reference block. implementation of the 2-D array 2-D array • Alternate projection of the DG onto the I and j –plane provides the architecture AB2 • Current frame data x(i,j) remains fixed in the PE’s AD that they have to be loaded into the array before. Time required= (2w+1)*(2w+1) Mapping to a 1-D array • More efficient design with a higher processor utilization if each PE executes the operations of multiple computation nodes • Mapped to a 1D array of PE, which is able to compute in parallel the partial distortion along one row. • Compute D(m,n) in N cycles 1-D array • Project the DG along the i-axis onto a onedimensional signal flow graph. • Called AB1 array, it has the size of a block Consecutive computation of all (2w+1)2 candidate blocks per displacement vector may provide N*(2p+1)2 time instances Another way of mapping-search area based • The dependency graph for computing v(X,Y) is mapped into a 2D array of (2w+1)2 PE while the dependency graph for computing D(m,n) is mapped into time • Each PE working in parallel keeps track of a particular distortion computation and sequentially explore the reference block. • At each cycle, one PE receives a different vlaue of y(m+I,n+j) and all the PE receive the value of one pixel of the reference block which is broadcasted to the array. • After N2 cycle, each of the (2w+1)2 PE holds one value of D(m,n) corresponding to a particular displacement (m,n) • To find the minimum distortion value, find the minimum of a column by downshifting the D(m,n) in the PEs and find the final minimum value by left-shifting the result D(m,n) in the M-PE. 2-D search area based architecture Part of the search area of size w.(2w+N) is needed to be stored in order to reduce I/O. 1-D search area based architecture • An array of (2w+1) processing elements executes in N2 cycles the computation of the distortion D(m,n) corresponding to one line (resp. column) of possible motion vectors. • This process is repeated sequentially 2w+1 times for computing all the distortion. Another architecture • Require only a sequential data input. • Dummy data denotes by dots are inserted into the stream of reference data to guarantee a regular data flow without any data permutation within the array Time required = (2w+1)*(2w+1)*N