Homework #4 Solution - University of Wisconsin

advertisement
February 6, 2016
Yu Hen Hu
Department of Electrical and Computer Engineering
University of Wisconsin – Madison
ECE 734 VLSI Array Structures for Digital Signal Processing
Spring 2002
Homework #4 Solution
This homework consists of questions taken from the notes and open-ended questions. You must
do the homework by yourself. No collaborations are allowed. Late homework will receive a
5% penalty per day. There are total 100 points. This homework is worth 10% of your overall
grades.
1. (25 points) DCT implementation
Consider DCT for JPEG/MPEG image compression applications. Two dimensional,
separable DCTs are applied to each 8 by 8 block of image f(m,n), 0  m,n  7 where 128 
f(m,n)  127 is the value (8-bit 2’s complement integers) of the (m,n) pixel after level shift by
subtracting 128 from each pixel’s value. The 2D DCT coefficients are denoted by F(u,v), 0 
u, v  7.
In pages 24-26 of the DSP note (posted in the course web page), a fast 1D 8-point DCT
algorithm by Arai, Agui, and Nakajimi is give. You may download the MATLAB m-file
from course web page to experiment it yourself. For convenience, the algorithm is listed
below where the constant multipliers a1 to a5 are represented by 16-bit fixed point 2’s
complement numbers with 8 fractional binary digits.
Input: x(m), m = 0 to 7.
% a1=0.707, a2=0.541, a3=0.707, a4=1.307, a5=0.383
% scaled values of multipliers (multiplied by 128) in hexadecimal
% a1=a3=005Ah, a2= 0045h, a4=00A7h, a5=0031h
% step 1
% x(m,1) = x(m) + x(7-m), m = 0, 1, 2, 3
% x(m,1) = x(7-m) - x(m), m = 4, 5, 6, 7
%Step 2.
% x(m,2)
% x(m,2)
% x(4,2)
% x(m,2)
% x(7,2)
=
=
=
=
=
x(m,1) + x(3-m,1), m = 0, 1
x(3-m,1) - x(m,1), m = 2, 3
-x(4,1) - x(5,1)
x(m,1) + x(m+1,1), m = 5, 6
x(7,1)
%
%
%
%
%
%
Step 3.
x(0,3) =
x(1,3) =
x(2,3) =
x(4,3) =
x(m,3) =
%
%
%
%
%
%
%
Step 4.
x(m,4) = x(m,3), m = 0, 1, 3, 7
x(2,4) = x(2,3)*a1
tmp = x(4,3) * a5
x(4,4) = x(4,3)*a2 + tmp
x(5,4) = x(5,3)*a3
x(6,4) = x(6,3)*a4 + tmp
x(0,2) + x(1,2)
x(0,2) - x(1,2)
x(2,2) + x(3,2)
x(4,2) + x(6,2)
x(m,2), m = 3, 5, 6, 7
Page 1 of 4
February 6, 2016
Yu Hen Hu
%
%
%
%
%
%
Step 5.
x(m,5) =
x(2,5) =
x(3,5) =
x(5,5) =
x(7,5) =
x(m,4), m = 0, 1, 4, 6
x(2,4) + x(3,4)
x(3,4) - x(2,4)
x(7,4) + x(5,4)
x(7,4) + x(5,4)
%
%
%
%
%
%
Step 6.
x(m,6) =
x(4,6) =
x(5,6) =
x(6,6) =
x(7,6) =
x(m,5), m = 0, 1, 2, 3
x(4,5) + x(7,5)
x(5,5) + x(6,5)
x(5,5) - x(6,5)
x(7,5) + x(4,5)
output: y(m) = x(m,6), m = 0 to 7
We will consider a dedicated hardware implementation of this algorithm. We will use four
types of components: (i) hardware multiplier, M; (ii) hardware adder, A; (iii) dedicated buses,
B, and (iv) registers R. We assume the eight 8-bit inputs can be made available
simultaneously if needed from input ports. The outputs will be stored in eight output registers.
The output will not be made available to outside this hardware DCT module until all eight
outputs are ready.
(a) (10 points) Derive the dependence graph of this given algorithm. Label input, output
variables, as well as intermediate variables {x(m,i); 0  m  7, 1  i  6} provided space
permits.
x(0)
x(0,1)
x(0,2) x(0,3)
x(0,4)
x(1,1)
x(1,2) x(1,3)
x(1,4)
x(0,5)
x(1,5)
y(1)
x(1)
x(2)
x(3)
y(0)
x(2,1)
x(2,2) x(2,3)
x(3,1)
x(3,2) x(3,3)
x(4,1)
x(4,2)
a1
x(2,4)
x(3,4)
a2
x(4,3)
x(4,4)
x(2,5)
x(3,5)
y(2)
y(3)
x(4,5)
y(4)
x(4)
x(5,1)
x(5,2)
x(5,3)
a5
tmp
x(5,4)
x(5,5)
y(5)
x(5)
a3
x(6,1)
x(6,2)
x(6,3)
x(6,4)
x(6,5)
y(6)
x(6)
x(7,1)
x(7)
: Multiply
a4
x(7,2)
x(7,3)
: addition
x(7,4)
x(7,5)
y(7)
: No operations
Figure 1. Dependence Graph of the fast DCT algorithm
G
Page 2 of 4
February 6, 2016
Yu Hen Hu
(b) (10 points, dynamic range) If we want to avoid any truncation error due to finite register
length, in the worst case, how many bits, as a function of n, will be required to store each
intermediate or final result without incurring any rounding error or overflow? For
convenience, you may assume n = 8. Hint: To do this part, you may scale the five
constant multipliers a1 to a5 by 256 so that they are all represented with 16-bit integers.
Note that their values are known. For example, the result of the multiplication of an m-bit
integer x to the scaled (by 128) value of a3 will result in no more than m+7 significant
bits.
Answer: addition of 2m numbers will increase dynamic range by at most m-bits.
Multiplication of a m-bit number with an n-bit number will result in at most (m+n)-bit
result. We assume {x(i)} and cosine functions are all represented with n-bit signed
binary numbers. The dynamic range can be recorded using the following table (assume n
= 8)
k\m 1 2 3 4 5 y(k)
0
9 10 10
10
1
9 10 10
10
2
9 10 10 17 18 18
3
9 10 10
18 18
4
9 10 10 17 17 19
5
9 10 10 17 18 20
6
9 10 10 18 19 20
7
9 9 9
18 19
tmp
10 16
Clearly at most 20 bits will be needed to store the final results without any error.
(c) (5 points, CC) If the resulting DCT coefficients are to be rounded to nearest integers, how
many bits are required to represent the results?
Answer: 20  7 = 13 bits. In the standard, only 12 bits are used since it is almost
impossible to find a set of coefficients that produce a result that needs 20 bits to represent
without error.
2. (25 points) This part is independent of parts (a)-(c). Assuming that 16 bits will be sufficient
to implement the DCT algorithm. Below is a code segment that implements two-point DCT
operation
 y0  1/ 2 1/ 2   x0 
 
y   
 1  1/ 2 1/ 2   x1 
void fsadct2_mmx (short in[2], short out[2])
{
static __int64 xstatic1 = 0xA57E5A825A825A82;
// -f0 f0 f0 f0
static __int64 rounding = 0x0000400000004000;
__asm {
mov eax, in
// in is address of x1, x0
mov ecx, out
// out is address of y1, y0
movd mm0, [eax]
// [mm0] = [xx xx x1 x0]
pshufw mm1, mm0, 01000100b // [mm1] = [x1 x0 x1 x0]
pmaddwd mm1, xstatic1
// [mm1] = [x0*f0-x1*f0 x0*f0+x1*f0]
paddd mm1, rounding
// do proper rounding
psrad mm1, 15
Page 3 of 4
February 6, 2016
Yu Hen Hu
packssdw mm1, mm7
movd [ecx], mm1
// [mm1] = [xx xx y1 y0]
}
}
(a) (15 points) Trace this program starting from the line
movd mm0, [eax],
and specify the content of the destination register after each assembly instruction in the
hexa-decimal format. Assume that x1=017A h, x0 = 00AC h, and that all MMX registers
are initially cleared. You may use the Intel document no. 24547106, IA-32 Intel
Architecture Software Developer's Manual, Vol. 2 Instruction Set Reference (available at
MMX references on class web page) to help out with MMX instructions. Refer to the
instruction packssdw, note that figure 3-15 in that manual is incorrect. Read the
operations in the following page.
Answer:
Movd mm0, [eax]
[mm0] = 0000 0000 017A 00AC h
Pushfw mm1, mm0, 01000100b [mm1] = 017A 00AC 017A 00AC h
Pmaddwd mm1, xstatic1
[mm1] = FFB7 2B64 00C2 734C h
Paddd mm1, rounding
[mm1] = FFB7 6B64 00C2 B34C h
Psrad mm1, 15
[mm1] = FFFF FF6E 0000 0185 h
Packssdw mm1, mm7
[mm1] = 0000 0000 FF6E 0185 h
Movd [ecx], mm1
Results stored into memory
(b) (10 points) Explain the purpose of these two lines in the program, and verify that they are
correct for what they are meant for perform:
paddd mm1, rounding
psrad mm1, 15


Answer: Note that 5A82h = 215  1/ 2 . Hence, the result needs to be right-shifted by
15 bits to compensate for this scaling. To realize this arithmetic shift, proper rounding
will reduce the truncation error. 4000 h = 214. Let y be the 32-bit intermediate result
after executing the pmaddwd instruction. By executing the instruction paddd mm1,
rounding, we compute a y’ = y + 214. After the next instruction psrad mm1, 15, the results
becomes
 y ' 215    y  214  215    y 215  21 


where   is the floor (truncation) operator. Clearly, the rounding is realized this way.
Page 4 of 4
Download