February 6, 2016 Yu Hen Hu Department of Electrical and Computer Engineering University of Wisconsin – Madison ECE 734 VLSI Array Structures for Digital Signal Processing Spring 2002 Homework #4 Solution This homework consists of questions taken from the notes and open-ended questions. You must do the homework by yourself. No collaborations are allowed. Late homework will receive a 5% penalty per day. There are total 100 points. This homework is worth 10% of your overall grades. 1. (25 points) DCT implementation Consider DCT for JPEG/MPEG image compression applications. Two dimensional, separable DCTs are applied to each 8 by 8 block of image f(m,n), 0 m,n 7 where 128 f(m,n) 127 is the value (8-bit 2’s complement integers) of the (m,n) pixel after level shift by subtracting 128 from each pixel’s value. The 2D DCT coefficients are denoted by F(u,v), 0 u, v 7. In pages 24-26 of the DSP note (posted in the course web page), a fast 1D 8-point DCT algorithm by Arai, Agui, and Nakajimi is give. You may download the MATLAB m-file from course web page to experiment it yourself. For convenience, the algorithm is listed below where the constant multipliers a1 to a5 are represented by 16-bit fixed point 2’s complement numbers with 8 fractional binary digits. Input: x(m), m = 0 to 7. % a1=0.707, a2=0.541, a3=0.707, a4=1.307, a5=0.383 % scaled values of multipliers (multiplied by 128) in hexadecimal % a1=a3=005Ah, a2= 0045h, a4=00A7h, a5=0031h % step 1 % x(m,1) = x(m) + x(7-m), m = 0, 1, 2, 3 % x(m,1) = x(7-m) - x(m), m = 4, 5, 6, 7 %Step 2. % x(m,2) % x(m,2) % x(4,2) % x(m,2) % x(7,2) = = = = = x(m,1) + x(3-m,1), m = 0, 1 x(3-m,1) - x(m,1), m = 2, 3 -x(4,1) - x(5,1) x(m,1) + x(m+1,1), m = 5, 6 x(7,1) % % % % % % Step 3. x(0,3) = x(1,3) = x(2,3) = x(4,3) = x(m,3) = % % % % % % % Step 4. x(m,4) = x(m,3), m = 0, 1, 3, 7 x(2,4) = x(2,3)*a1 tmp = x(4,3) * a5 x(4,4) = x(4,3)*a2 + tmp x(5,4) = x(5,3)*a3 x(6,4) = x(6,3)*a4 + tmp x(0,2) + x(1,2) x(0,2) - x(1,2) x(2,2) + x(3,2) x(4,2) + x(6,2) x(m,2), m = 3, 5, 6, 7 Page 1 of 4 February 6, 2016 Yu Hen Hu % % % % % % Step 5. x(m,5) = x(2,5) = x(3,5) = x(5,5) = x(7,5) = x(m,4), m = 0, 1, 4, 6 x(2,4) + x(3,4) x(3,4) - x(2,4) x(7,4) + x(5,4) x(7,4) + x(5,4) % % % % % % Step 6. x(m,6) = x(4,6) = x(5,6) = x(6,6) = x(7,6) = x(m,5), m = 0, 1, 2, 3 x(4,5) + x(7,5) x(5,5) + x(6,5) x(5,5) - x(6,5) x(7,5) + x(4,5) output: y(m) = x(m,6), m = 0 to 7 We will consider a dedicated hardware implementation of this algorithm. We will use four types of components: (i) hardware multiplier, M; (ii) hardware adder, A; (iii) dedicated buses, B, and (iv) registers R. We assume the eight 8-bit inputs can be made available simultaneously if needed from input ports. The outputs will be stored in eight output registers. The output will not be made available to outside this hardware DCT module until all eight outputs are ready. (a) (10 points) Derive the dependence graph of this given algorithm. Label input, output variables, as well as intermediate variables {x(m,i); 0 m 7, 1 i 6} provided space permits. x(0) x(0,1) x(0,2) x(0,3) x(0,4) x(1,1) x(1,2) x(1,3) x(1,4) x(0,5) x(1,5) y(1) x(1) x(2) x(3) y(0) x(2,1) x(2,2) x(2,3) x(3,1) x(3,2) x(3,3) x(4,1) x(4,2) a1 x(2,4) x(3,4) a2 x(4,3) x(4,4) x(2,5) x(3,5) y(2) y(3) x(4,5) y(4) x(4) x(5,1) x(5,2) x(5,3) a5 tmp x(5,4) x(5,5) y(5) x(5) a3 x(6,1) x(6,2) x(6,3) x(6,4) x(6,5) y(6) x(6) x(7,1) x(7) : Multiply a4 x(7,2) x(7,3) : addition x(7,4) x(7,5) y(7) : No operations Figure 1. Dependence Graph of the fast DCT algorithm G Page 2 of 4 February 6, 2016 Yu Hen Hu (b) (10 points, dynamic range) If we want to avoid any truncation error due to finite register length, in the worst case, how many bits, as a function of n, will be required to store each intermediate or final result without incurring any rounding error or overflow? For convenience, you may assume n = 8. Hint: To do this part, you may scale the five constant multipliers a1 to a5 by 256 so that they are all represented with 16-bit integers. Note that their values are known. For example, the result of the multiplication of an m-bit integer x to the scaled (by 128) value of a3 will result in no more than m+7 significant bits. Answer: addition of 2m numbers will increase dynamic range by at most m-bits. Multiplication of a m-bit number with an n-bit number will result in at most (m+n)-bit result. We assume {x(i)} and cosine functions are all represented with n-bit signed binary numbers. The dynamic range can be recorded using the following table (assume n = 8) k\m 1 2 3 4 5 y(k) 0 9 10 10 10 1 9 10 10 10 2 9 10 10 17 18 18 3 9 10 10 18 18 4 9 10 10 17 17 19 5 9 10 10 17 18 20 6 9 10 10 18 19 20 7 9 9 9 18 19 tmp 10 16 Clearly at most 20 bits will be needed to store the final results without any error. (c) (5 points, CC) If the resulting DCT coefficients are to be rounded to nearest integers, how many bits are required to represent the results? Answer: 20 7 = 13 bits. In the standard, only 12 bits are used since it is almost impossible to find a set of coefficients that produce a result that needs 20 bits to represent without error. 2. (25 points) This part is independent of parts (a)-(c). Assuming that 16 bits will be sufficient to implement the DCT algorithm. Below is a code segment that implements two-point DCT operation y0 1/ 2 1/ 2 x0 y 1 1/ 2 1/ 2 x1 void fsadct2_mmx (short in[2], short out[2]) { static __int64 xstatic1 = 0xA57E5A825A825A82; // -f0 f0 f0 f0 static __int64 rounding = 0x0000400000004000; __asm { mov eax, in // in is address of x1, x0 mov ecx, out // out is address of y1, y0 movd mm0, [eax] // [mm0] = [xx xx x1 x0] pshufw mm1, mm0, 01000100b // [mm1] = [x1 x0 x1 x0] pmaddwd mm1, xstatic1 // [mm1] = [x0*f0-x1*f0 x0*f0+x1*f0] paddd mm1, rounding // do proper rounding psrad mm1, 15 Page 3 of 4 February 6, 2016 Yu Hen Hu packssdw mm1, mm7 movd [ecx], mm1 // [mm1] = [xx xx y1 y0] } } (a) (15 points) Trace this program starting from the line movd mm0, [eax], and specify the content of the destination register after each assembly instruction in the hexa-decimal format. Assume that x1=017A h, x0 = 00AC h, and that all MMX registers are initially cleared. You may use the Intel document no. 24547106, IA-32 Intel Architecture Software Developer's Manual, Vol. 2 Instruction Set Reference (available at MMX references on class web page) to help out with MMX instructions. Refer to the instruction packssdw, note that figure 3-15 in that manual is incorrect. Read the operations in the following page. Answer: Movd mm0, [eax] [mm0] = 0000 0000 017A 00AC h Pushfw mm1, mm0, 01000100b [mm1] = 017A 00AC 017A 00AC h Pmaddwd mm1, xstatic1 [mm1] = FFB7 2B64 00C2 734C h Paddd mm1, rounding [mm1] = FFB7 6B64 00C2 B34C h Psrad mm1, 15 [mm1] = FFFF FF6E 0000 0185 h Packssdw mm1, mm7 [mm1] = 0000 0000 FF6E 0185 h Movd [ecx], mm1 Results stored into memory (b) (10 points) Explain the purpose of these two lines in the program, and verify that they are correct for what they are meant for perform: paddd mm1, rounding psrad mm1, 15 Answer: Note that 5A82h = 215 1/ 2 . Hence, the result needs to be right-shifted by 15 bits to compensate for this scaling. To realize this arithmetic shift, proper rounding will reduce the truncation error. 4000 h = 214. Let y be the 32-bit intermediate result after executing the pmaddwd instruction. By executing the instruction paddd mm1, rounding, we compute a y’ = y + 214. After the next instruction psrad mm1, 15, the results becomes y ' 215 y 214 215 y 215 21 where is the floor (truncation) operator. Clearly, the rounding is realized this way. Page 4 of 4