VLSI Architecture for Block Matching Algorithm for Video compression

advertisement
ELEC692 VLSI Signal
Processing Architecture
Lecture 7
VLSI Architecture for Block
Matching Algorithm for Video
compression
* Part of the notes is taken from the course notes of Prof. Bing Zeng’s ELEC 533
Reference
• P. Pirsch, N. Demassieux, W. Gehrke,
“VLSI architecture for Video compression
– A survey”, in ther IEEE Proceedings, Vol.
83, No. 2, pp. 220-246,Feb 1995
• T. Komarek, P. Pirsch, “Array Architecture
for Block Matching Algorithm”, in IEEE
Transactions of Circuit and Systems, vol.
36, No. 10, pp. 1301-1310, Oct. 1989
Interframe Coding/Motion
Estimation of Video Sequence
Interframe Transform/Predictive
Coding
Interframe Transform/Predictive
Coding
• Prediction is based on a previously processed
frame
• Prediction is accomplished by motion estimation
(ME)
• Motion estimation is done in spatial domain
• 2-D DCT has to be inside the coding loop and a
2-D IDCT is needed to convert the frequency
domain information back to spatial domain
Motion Compensated Prediction
Block Matching Method
Search window
Block matching Criterion
• Mean Square Error (MSE)
1
MSE( ,  )  2
N
N
N
2
(
x
(
i
,
j
)

x
(
i


,
j


))
 t
t 1
i 1 j 1
• Mean Absolute Difference (MAD)
1 N N
MAD( ,  )  2 | xt (i, j )  xt 1 (i   , j   ) |
N i 1 j 1
Important factors for BM Motion
Estimation
• Block size – 8X8, 16X16, variable
• Size of searching window
– Depend on frame differences, speed of moving
objects, resolution, etc
• Matching criterion
– Accuracy vs complexity, use of truncated pixels
• Search strategy
– Full search, hierarchical search, subsampling of
motion field
• Hardware consideration
Real time processing for BMA
• Let Block size = 16*16, window size = 32*32,
assuming CIF frame at 30f/s, we need
ops 
search 
blocks frame

 30
 256
 289
 396
  879Mops/ sec
search 
block 
frame
sec 

For CCIR 601 or HDTV, it will require several or tens of
GOPS/sec. So Full search has to be implemented in
dedicated hardware.
Exhaustive Search Block Matching
• Block size of N X N of the current image (reference
block, denote by X)
• Matched with all the block located within a search
window (candidate blocks, denote by Y).
• Maximum displacement – w
• Computing the mean absolute difference (MAD)
between the blocks
• Matching distance D is given by
N 1 N 1
D(m, n)   x(i, j )  y(i  m, j  n)
i 0 j 0
m
v    Dmin
n
V is the motion vector
No. of candidate block to be considered: (2w+1)2
Algorithm to find the motion vector
Dmin = MAXVALUE
Vmin = (0,0)
For m=-w to +w
for n = -w to +w
D(m,n) = 0
for i=1 to N
for j = 1 to N
D(m,n) = D(m,n)+|x(I,j)-y(i+m,j+n)|
endfor
endfor
if D(m,n) < Dmin then
Dmin = D(m,n)
Vmin = (m,n)
endif
endfor
endfor
Dependency graph
Calculating MAD
Calculating si(m.n) and s(m,n)
Calculate Dmin and v
Dependency graph
• The BM algorithm can be described by several
different dependency graph
• Example 1
AD
= absolute difference
and addition
M = minimum value
computation
Dependency graph
• Example 2
Data input
• Line scan and block scan
• Line scan
– TV lines run through as a whole, from the upper to the lower
side of the frame
• Block scan
– Quadratic blocks of n X n pixels are run through in a blockline manner
– Well suited if the data are supplied by a memory with block
scan output
– Pixels within a block are traversed column by column
– E.g. (3X3)-pixel block
x(1,1) x(1,2) x(1,3)
x(2,1) x(2,2) x(2,3)
x(3,1) x(3,2) x(3,3)
Data are read in the order
x(1,1), x(2,1) x(3,1), x(1,2),
x(2, 2) x(3,2),
x(1,3), x(2,3) x(3,3),
Mapping BMA onto Systolic Arrays
• Decompose the algorithm into its basic operations and
convert it into a form where each result is assigned to a
unique variable
• Formulate it as an n-dimension dependence graph (DG)
of computation nodes and data dependence arcs.
• One straight forward mapping is implementing a PE
designated to each node of the DG and a
communication link to each edge of the DG.
• More efficient design with a higher processor utilization if
each PE executes the operations of multiple computation
nodes
• Need time schedule and assignment of multiple nodes to
a single PE by projection. PE need to be programmable
to some extent.
Mapping BMA onto Systolic Arrays
• The BMA is defined over a 4-dimensional index
space (i,j,m,n)
• The BMA can be decomposed into two parts which
are defined over two-dimesional index spaces.
– 1st one spawn by the index I,j, finding the sum of D(m,n)
N
Di (m, n)   x(i, j )  y(i  m, j  n)
j 1
N
D(m, n)   Di (m, n)
i 1
– 2nd one defined over m and n, the minium search and the
selectin of displacement vector
Dmin  min{D(m, n)}
vn  (m, n) | Dmin
Transform it into a 2D -array
• D(m,n) mapped into a 2D
array of PE
• V(X,Y) is mapped into
time
Realistic implementation of 2-D array
• Reduction of the cycle time
– Pipelining of the computation of D(m,n).
• I/O management
– Each of the AD-PE receives a new value of y(m+i,n+j) at each clock
cycle.
• Transmitting the N2 value from an external memory is not
feasible. WE can take the advantage of that these values belong
to the search window.
• A portion of the search window of size N.(2w+N) is stored in the
circuit in a 2D bank of shift registers, able to shift in, up, down,
and right direction.
• Each AD-PE has one of these registers and can at each cycle
obtain the value of y(m+i,n+j) that it needs
• To update this register bank, a new column of 2w+N piexls of
the serach area is serially entered in the circuit and is inserted
in the back of regigters.
• To load in a new reference with a low I/O overhead, a double
buffering of x(I,j) is required, with the pixels x’(I,j) of a new
reference block serially loaded during the computation of the
current reference block.
implementation of the 2-D array
2-D array
• Alternate
projection of the
DG onto the I
and j –plane
provides the
architecture AB2
• Current frame
data x(i,j)
remains fixed in
the PE’s AD that
they have to be
loaded into the
array before.
Time required= (2w+1)*(2w+1)
Mapping to a 1-D array
• More efficient design with a higher
processor utilization if each PE executes
the operations of multiple computation
nodes
• Mapped to a 1D array of PE, which is able
to compute in parallel the partial distortion
along one row.
• Compute D(m,n) in N cycles
1-D array
• Project the DG along the i-axis onto a onedimensional signal flow graph.
• Called AB1 array, it has the size of a block
Consecutive computation
of all (2w+1)2 candidate
blocks per displacement
vector may provide
N*(2p+1)2 time instances
Another way of mapping-search
area based
• The dependency graph for computing v(X,Y) is mapped
into a 2D array of (2w+1)2 PE while the dependency
graph for computing D(m,n) is mapped into time
• Each PE working in parallel keeps track of a particular
distortion computation and sequentially explore the
reference block.
• At each cycle, one PE receives a different vlaue of
y(m+I,n+j) and all the PE receive the value of one pixel
of the reference block which is broadcasted to the array.
• After N2 cycle, each of the (2w+1)2 PE holds one value
of D(m,n) corresponding to a particular displacement
(m,n)
• To find the minimum distortion value, find the minimum of
a column by downshifting the D(m,n) in the PEs and find
the final minimum value by left-shifting the result D(m,n)
in the M-PE.
2-D search area based architecture
Part of the search area of
size w.(2w+N) is needed
to be stored in order to
reduce I/O.
1-D search area based architecture
• An array of (2w+1) processing elements executes in N2
cycles the computation of the distortion D(m,n)
corresponding to one line (resp. column) of possible
motion vectors.
• This process is repeated sequentially 2w+1 times for
computing all the distortion.
Another architecture
• Require only a sequential data input.
• Dummy data denotes by dots are inserted into the
stream of reference data to guarantee a regular data
flow without any data permutation within the array
Time required = (2w+1)*(2w+1)*N
Download