CSCI4250 Exam 1 Solution Question 1 (4 points). Consider enhancing a computer by adding vector hardware. When a computation is run in vector mode on the vector hardware, it is 100 times faster than in the normal mode. We call the percentage of the original time that could be spent using the vector mode the percentage of vectorization. 1a) What percentage of vectorization is necessary to achieve a speedup of 10? (Just set up the equation to show how to get the answer.) Speedup = (Old Execution Time)/(New Execution Time) 10 = ____________________1_____________________ Time in vector mode*1/100+ Time in normal mode 10 = _____1______ f/100+(1-f) f = 10/11 1b) What percentage of the enhanced computation time is spent in vector mode if a speedup of 10 is attained? (Just set up the equation to show how to get the answer.) Percentage time = _Time in vectorization____ Total time T = f/100___ = 1/110 1-f+f/100 11/110 = 1 11 Question 2 (6 points). (It is sufficient to derive the formulas, and not necessary to calculate the final numerical results.) Suppose we build an optimizing compiler that discards 50% of the ALU instructions (but cannot reduce other instructions). Assume that the original total instruction count is 3*109 and that the original ALU instruction count is 2*109. Let the clock rate be 1-GHz, let each ALU instruction take 1 clock cycle, and let each non-ALU instruction take 2 clock cycles. 2a) Calculate the original MIPS rate and execution time. Execution Time = Instruction Count * Cycles per instructions * Clock cycle time = (2*109 instructions *1 cycle/instruction + 1*109 instructions *2 cycles/instruction) 10 9 cycle/s = 4s MIPS rate = Instruction Count /106 = 3000 = 750 MIPS Execution Time 4 2b) Calculate the new MIPS rate and execution time when we use the optimizing compiler. Execution Time = Instruction Count * Cycles per instructions * Clock cycle time = (1*109 instructions *1 cycle/instruction + 1*109 instructions *2cycles/instruction) 10 9 cycle/s = 3s MIPS rate = Instruction Count /106 = 2000 = 666 MIPS Execution Time 3 2c) Discuss the results in (a) and (b). Are there any contradictions? At first, it would seem that a decrease in MIPS rate would lead to an increase in execution time, which is clearly not the case here. However, one notes the optimizing compiler eliminated the most lightweight instruction and so the MIPS rate will decrease as the instructions that take multiple instructions become more dominant. This reinforces the idea that MIPS rate is not a reliable indicator of performance in applications. Question 3 (6 points). Consider a new addressing mode that allows one source operand to be in memory. To reduce complexity, you restrict all memory addressing to be register indirect only (i.e., no displacement). So, “ADD R1, R2, (R3)” adds the contents of register R2 to the contents stored at address R3 in memory. However “ADD R1, (R2), (R3)” and “ADD R1, R2, 4(R3)” are illegal instructions. 3a) Give an advantage of this new addressing mode. This would greatly reduce execution time for working with arrays as it would not be necessary to load the value before using it in execution. The loop overhead for setting up these computations would be decreased. There are several instances where the new addressing mode performs an operation in 1 instruction that used to take 2 instructions, e.g., LD R1, 0(R2) ADD R3, R3, R1 becomes ADD R3, R3, 0(R2) 3b) Consider “ADD R1, (R2), (R3).” Suggest a simple code sequence of two instructions to simulate this illegal instruction using the new addressing mode and only registers R1, R2, and R3. LD R1, (R2) ADD R1, R1, (R3) 3c) Consider “ADD R1, R2, 4(R3).” Suggest a simple code sequence of two instructions to simulate this illegal instruction using the new addressing mode and only registers R1, R2 and R3. ADDI R1, R3, #4 ADD R1, R2, (R1) Question 4 (6 points). Consider single precision IEEE 754 representation with a “truncate” policy. This is a 32-bit representation with 1 sign bit, 8 exponent bits encoded using a bias of 127 and the remaining 23 bits used to encode the fractional part of the mantissa. 2a) In class, you learned that the quantity of one-tenth has the binary representation of 0.000110011001100… Give the IEEE representation of two-tenths. 0.210= 2* .110 .110= 1.10011001100…*2-4, thus .210=1.100110011…*2-3. Thus sign field is 0, exponent field is -3+bias=-3+127=124. Giving our representation as: 0 01111100 10011001100110011001100 2b) Give the IEEE representation of two. 210 = 21*1 Thus sign field is 0, exponent field is 1+bias=1+127=128. Giving our representation as: 0 10000000 00000000000000000000000 2c) Give the IEEE representation of two and two-tenths, i.e., 2.2. Normalize and add .210=1.100110011…*2-3=.0001100110011… *21 2=1*21 Thus 2.2=2+.2=1.0001100110011…*21, giving our representation as, 0 10000000 00011001100110011001100 Question 5 (6 points). A distinguished computer architect suggested these two factors to improve performance: How fast you can crank up the clock; How many instructions you need to perform a task. 5a) Suppose that you can double the clock rate of your processor. Explain why you may not see a performance improvement, if other parts of the machine are not changed. Other parts of the machine may become a bottleneck for execution. In particular, the speed of transferring information from memory becomes critical, as the processor may have to stall to wait for retrieving data from memory. 5b) You are tempted to define a new instruction for a complex task that occurs frequently. Suppose the new instruction requires many cycles for execution. Use exceptions to explain why this new instruction may hurt performance. This becomes very important on machines with out of order execution. If an instruction that takes a large number of instructions to complete is put into the pipeline and other instructions finish and overwrite the operands it reads from, it will become impossible to have a precise exception, that is it is no longer possible to restart the instruction and produce the proper result. 5c) What is a precise exception for a pipeline. A precise exception is one in which all of the instructions prior to the exception have been completed and do not need to be restarted and the state of the machine is such that it is possible to restart execution of all instructions in the pipeline after execution. Question 6 (6 points). Consider the simple 5-stage integer pipeline. 6a) What is data forwarding? Forwarding is making the data available to subsequent instructions as soon as the computation is complete and allowing instructions to receive this data in the beginning of the EX stage instead of retrieving it in ID. Thus, the results of the ALU and MEM register are given as possible source operands to the ALU. 6b) Write a MIPS assembly code to explain how data forwarding may eliminate the data hazard stall cycle. ADD R1, R1, 8 ADD R2, R3, R1 Without forwarding, the result of R1 is not written until clock cycle five and thus the second ADD does not enter EX until cycle 6. With forwarding, the R1 result is piped back to the input and the ADD is able to enter EX at cycle 4 with no stalls. 6c) Write a MIPS assembly code to show an example where forwarding cannot completely eliminate the data hazard stall cycles. LD R1, 0(R2) ADD R1, R1, R1 Here the data is not available until after the MEM stage in cycle 4, and so this cannot be forwarded to the ADD’s EX stage in cycle 4, this must stall and the EX of the ADD can proceed in cycle 5. Data cannot be forwarded backwards in time. Question 7 (6 points). Consider the simple 5-stage integer pipeline, where a branch instruction causes a one-cycle delay. 7a) Construct an example using MIPS assembly code to show how you may schedule the branch delay slot to always eliminate the one-cycle branch delay. This can be done if there is an independent instruction before the branch instruction, consider the following loop: L1: SUB R1, R1, -4 ADD R3, R3, R1 ADD R2, R2, R1 BNEZ R1, L1 Here either ADD instruction can be scheduled in the branch delay slot because they are independent of the branch instruction. 7b) There are cases where scheduling the branch delay slot may not always eliminate the one-cycle branch delay. Construct an example using MIPS assembly code to illustrate one such case. What assumption must be made to ensure that the program will work correctly? If there are no readily available independent instructions, then the instruction to which the branch may be chosen in the branch delay slot. In order this to be viable, it must be possible to either stop execution of this instruction or remove the effects that the instruction has on the state of the machine. For example convert: L1: SUB R1, R2, R3 ADD R5, R5, 4 ADD R2, R5, R3 BNEZ R2, L1 to Where the last SUB is in the branch delay slot. SUB R1, R2, R3 L1: ADD R5, R5, 4 ADD R2, R5, R3 BNEZ R2, L1 SUB R1, R2, R3