High Performance Computer Architecture (CS60002) Mid-Spring Semester 2011-12 OUTLINE OF SOLUTIONS 1. Answer the following. [6+8+4=18] a. Consider a k-stage synchronous pipeline with stages S1, S2, …, Sk, and the corresponding stage delays 1, 2, …, k respectively. Let tmax and tmin denote the time delays of the longest and shortest logic paths within a stage respectively. If d denotes the width of the pipeline clock pulse, and s the maximum clock skew, derive an expression for the maximum speedup of the pipelined processor over an equivalent non-pipelined processor when processing n sets of data. Clearly state any assumptions you make. For non-pipelined processor, the latch delays between stages will not come into the picture. The total processing time for n instructions will be: n * (1 + 2 + … + k) For pipelined version, the time will be (k – 1 + n) * here can be calculated as discussed in the class (from maximum stage delay, tmax and tmin). Speedup can be calculated by taking the ratio. b. Consider a non-linear multifunction pipeline processor consisting of five stages S1, S2, S3, S4 and S5, and two functions F1 and F2 with the corresponding stage utilizations for one complete computation as follows: F1 : S1 S2 S3 (S2 S4) S4 S1 S5 (requires 7 clock cycles) F2 : S1 (S2 S4) (S3 S5) S2 S1 (requires 5 clock cycles) where (Si Sj) indicates that the stages Si and Sj are being used simultaneously during the same clock cycle. For both the functions, do the following. i) Draw the reservation tables, and show the corresponding collision vectors. Drawing of the reservation tables is trivial. The collision vectors will be: F1: (1 0 0 1 1) F2: (1 0 1 0 ii) Draw the state diagrams showing the permissible state transitions among successive initiations, and compute the corresponding values of Minimum Average Latency (MAL) and Minimum Constant Latency (MCL). 5+ 3, 4, 6+ 1010 10011 5+ 5+ 1 3 1111 1011 c. A non-pipelined processor X has a clock frequency of 600 MHz and an average CPI (cycles per instruction) of 4. Processor Y, an improved successor of X, is designed with a 5-stage linear instruction pipeline. However, due to latch delay and clock skew, the clock frequency of Y is only 450 MHz. i) If a program containing 100 instructions is executed on both processors, what is the speedup of processor Y compared with that of processor X? Time for nonpipelined processor T1 = 4 x 100 / (600 x 106) Time for pipelined processor T2 = (100 + 5 – 1) / (450 x 106) Speedup = T1 / T2 = 2.88 ii) Calculate the MIPS rating of each processor during the execution of this particular program. MIPS for nonpipelined processor = 600 / 4 = 150 MIPS for pipelined processor = 100 x 450 / (100 + 5 – 1) = 432 2. Answer the following. [4+8=12] a. Three enhancements with following speedups are proposed for a new architecture: Speedup1 = 30, Speedup2 = 20, Speedup3 = 10. Only one enhancement is usable at a time. If enhancements 1 and 2 are each usable for 30% of the time, what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10? Speedup = 1 / [ (1 – (F1 + F2 + F3)) + F1/S1 + F2/S2 + F3/S3] 10 = 1 / [(1 – (0.3 + 0.3 + f)) + 0.3/30 + 0.3/20 + f/10] So, f = 0.3611 = 36.11 % b. Consider the following fragment of C code: for (i=0; i<=100; i++) { A[i] = B[i] + C: } Assume that A and B are arrays of 64-bit integers, and C and i are 64-bit integers. Assume that all data values and their addresses are initially kept in memory (at addresses 0, 5000, 8000 and 8500 for A, B, C and i respectively). For efficiency of the code generated, we decide to keep the values of C and i, and the addresses of the array variables, in registers. Write the corresponding code for MIPS64 processor, and compute the following. LD R1,R0(0) // POINT TO A LD R2,R0(5000) // POINT TO B LD R3,R0(8000) // POINT TO C LD R4,R0(8500) // POINT TO i DADDI R5,R0,#101 // LOOP COUNTER LOOP: DADD R7,R2,R4 LD R6,R7(0) DADD R8,R6,R3 DADD R7,R1,R4 SD R8,R7(0) DADDI R4,R4,#1 DSUBI R5,R5,#1 BNEZ R5,LOOP THERE ARE ERRORS IN THIS CODE. THE ARRAY POINTERS ARE NOT UPDATED CORRECTLY. TRY TO FIND OUT. i) The number of instructions executed 5 + 8 x 101 = 813 ii) The number of memory data references 4 + 2 x 101 = 206 iii) The code size in bytes. 13 x 4 = 52 bytes 3. Answer the following. [(12+3)+(5+5+5)=30] a. Draw a schematic diagram of the 5-stage integer pipeline for the MIPS64 instruction set, and hence show the micro-operations that are carried out in the five stages. What modifications are required to implement data forwarding? b. Consider the following code fragment of MIPS64: loop: LD R1, 0(R2) DADDI R1, R1, #1 SD R1, 0(R2) DADDI R2, R2, #8 DSUB R4, R3, R2 BNEZ R4, loop // // // // // // R1 = M[0 + R2] R1 = R1 + 1 M[0+R2] = R1 R2 = R2 + 8 R4 = R3 – R2 branch if R4 not zero i) Show the partial timing diagram (for one loop) of this instruction sequence for the MIPS64 pipeline without any forwarding or bypassing hardware. Assume two register reads and one register write can be carried out in one clock cycle. Also assume that the branch is handled by flushing the pipeline. If there is no structural hazards while accessing memory, how many clock cycles does this loop take to execute? ii) Show the partial timing diagram (for one loop) and calculate the number of clock cycles required to execute the entire loop, assuming that the normal forwarding/ bypassing hardware has been implemented, and the branch is handled by predicting it as not taken. iii) Assume now that branch is handled with a single-cycle delayed branch (that is, there is one branch delay slot). Try to fill up the branch delay slot by reordering (scheduling) instructions. Again show the partial timing diagram (for one loop) and calculate the number of clock cycles required to execute the entire loop. SOLUTION TO THIS QUESTION IS NOT SHOWN. THEY ARE COVERED IN THE CLASS.