CS252 QUIZ #2: 4/18/01 Last Name _______________________ Question 1 2 3 TOTAL Name David vs. Goliath That’s Out of Order! Who Needs Compilers? D. A. Patterson First Name _____________________ Time (minutes) 30 30 50 110 Max Points 14 16 18 48 Your Points CS252 - Quiz #2, Spring 2001 Your last name: ______________________ Question #1: David vs. Goliath (14 points) [30 minutes] The Intel Pentium III and the Transmeta Crusoe both translate 80x86 instructions into a different instruction set for execution. a) (4 points) List the following characteristics of each of the internal instruction sets: Pentium III Registers (approximate number, size) Instruction (approximate size, style) b) (2 points) What is the role of interpretation in each machine? Pentium III: Transmeta Crusoe: 2 Transmeta Crusoe CS252 - Quiz #2, Spring 2001 Your last name: ______________________ c) (2 points) What are the methods of translation in each machine? Pentium III: Transmeta Crusoe: d) (2 points) In addition to performance and cost, an increasingly important consideration is power. What is the impact on power of each approach? Why? Pentium III: Transmeta Crusoe: e) (2 points) Which is a better match to multithreading? Why? Expected: Pentium III: Transmeta Crusoe: 3 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ 4 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ Question 2: That’s Out of Order! (16 points) [30 minutes] Using the MIPS code shown below, show the state of the Reservation stations, Reorder buffers, and floating point (FP) register status for a speculative processor implementing Tomasulo’s algorithm. Assume the following: Only one instruction can issue per cycle. The reorder buffer has 8 slots. The reorder buffer implements the functionality of the load buffers and store buffers. All function units are fully pipelined. There are 2 floating point multiply reservation stations. There are 3 floating point add reservation stations. There are 3 integer reservation stations, which also execute load and store instructions. No exceptions occur during the execution of this code. All integer operations require 1 execution cycle. Memory requests occur and complete in this cycle. All FP multiply operations require 4 execution cycles. All FP addition operations require 2 execution cycles. On a common data bus write conflict, the instruction issued earlier gets priority. Execution for a dependent instruction can begin on the cycle after its operand is broadcast on the common data bus. If any item changes from “Busy” to “Not Busy”, you should update the “Busy” column to reflect this, but you should not erase any other information in the row (unless another instruction then overwrites that information). Assume the all reservation stations, reorder buffers, and functional units were empty and not busy when the code show below began execution. The “Value” column gets updated when the value is broadcast one the common data bus. Integer registers are not shown, and you do not have to show their state. 5 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ For parts a) and b), fill in the new entry only when the entry value changes; leaving the new column blank means it’s unchanged. Use dash to indicate the new entry value is empty. For the instruction column in reorder buffer, use the empty entry for any new instruction. a) (7 points) Assume the tables below show the old state at the end of the cycle in which ADDI from the code below is issued. Modify the tables to show new state at the end of next clock cycle. Assume the execute states for the floating point instructions MULT.D F3, F1, F11 and MULT.D F4, F1, F10 are at the first and second cycle of the execution stage, respectively. (In case you mess up this version, there is an extra copy on the next page.) L.D F0, 0(R1) MULT.D F2, F0, F12 ADD.D F0, F2, F1 MULT.D F3, F1, F11 MULT.D F4, F1, F10 ADDI R3, R3, 1 SUBI R1, R1, 8 Name Busy old Add1 Add2 Add3 Mult1 Mult2 Int1 Int2 Int3 new Y old ADD.D Y Y N Y MULT.D MULT.D L.D. ADDI Entry Busy old 1 2 3 4 5 6 7 8 Reorder # Busy 3 Y old . Instruction new N new old Qj new old old 2 N Qk new old new ROB dest old new F1 #3 F1 F1 F0 R3 F10 F11 R1 1 #5 #4 #1 #6 Reorder buffer State old Commit Commit Write Execute Execute Issue Destination new old new F0 F2 F0 F3 F4 R3 FP register status F2 F3 F4 F1 old new F2 L.D F0, 0(R1) MULT.D F2, F0, F12 ADD.D F0, F2, F1 MULT.D F3, F1, F11 MULT.D F4, F1, F10 ADDI R3, R3, 1 F0 old new new N N Y Y Y Y Field Reservation stations Vj Vk Op new old 4 Y new old 5 Y 6 new Value old Mem[0(R1)] F0*F12 F2+F1 F5 old N new new … F6 old N new old N new CS252 - Quiz #2, Spring 2001 Your last name: ______________________ 7 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ b) (9 points) The tables below, for a different program, show the state at the end of the cycle in which the S.D from the code below is issued. Modify the tables to show state at the end of the next three clock cycles. Assume the execute states for the floating point instructions for MULT.D F2, F1, F11 and MULT.D F0, F0, F10 are at the end of fourth and third cycle of the execution stage, respectively. (There is an extra copy on the next page.) L.D F0, 0(R1) MULT.D F2, F1, F12 ADD.D F0, F2, F0 MULT.D F2, F1, F11 MULT.D F0, F0, F10 ADD.D F0, F0, F2 S.D F0, 0(R1) ADDI R1, R1, 8 SUBI R2, R2, 1 Name Busy old Add1 Add2 Add3 Mult1 Mult2 Int1 Int2 Int3 N Y old ADD.D ADD.D Y Y N Y MULT.D MULT.D L.D S.D Busy old 1 2 3 4 5 6 7 8 1 5 Y old N new new old Qk new old Reorder buffer State old Commit Commit Commit Execute Execute Issue Issue new 4 Y new old N new #5 #4 #1 #7 Destination old new F0 F2 F0 F2 F0 F0 F0 old N 8 ROB dest old new #3 #6 #4 F10 F11 0 0 FP register status F2 F3 F10 old new F0 F0 F1 R1 R1 F1 old old Qj #5 L.D F0, 0(R1) MULT.D F2, F1, F12 ADD.D F0, F2, F0 MULT.D F2, F1, F11 MULT.D F0, F0, F10 ADD. D F0, F0, F2 S.D F0, 0(R1) new new F2 Instruction F0 old new new N N N Y Y Y Y Field Reorder # Busy Op new Entry Reservation stations Vj Vk new Value old Mem[0(R1)] F1*F12 F2+F0 F11 old N new new … F12 old N new old N new CS252 - Quiz #2, Spring 2001 Your last name: ______________________ 3. Who needs compilers? (18 points) [50 minutes] In the following problem, use a simple pipelined RISC architecture with a single branch delay cycle. The architecture has pipelined functional units with the following execution cycles: 1. Floating point op: 3 cycles (7 stages total) 2. Integer op: 1 cycles (5 stages total) The following table shows the minimum number of intervening cycles between the producer and consumer instructions to avoid stalls. Assume 0 intervening cycle for combinations not listed. Instruction producing result FP ALU op FP ALU op Load double Load double Instruction using result Another FP ALU op Store and move double FP ALU op Store double Latency in clock cycles 2 2 1 0 The following code computes a 3-tap filter. R1 contains address of the next input to the filter, and the output overwrites the input for the iteration. R2 contains the loop counter. The tap values are contained in F10, F11, and F12. LOOP: L.D MULT.D ADD.D MULT.D MOV.D MULT.D ADD.D S.D ADDI BNEZ SUBI F0, 0(R1) F2, F1, F12 F0, F2, F0 F2, F1, F11 F1, F0 F0, F0, F10 F0, F0, F2 F0, 0(R1) R1, R1, 8 R2, LOOP R2, R2, 1 #load the filter input for the iteration #multiply elements #add elements #move value in F0 to F1 #store the result #increment pointer, 8 bytes per DW #continue till all inputs are processed #decrement element count a) (4 points) How many cycles does the current code take for each iteration? __________ cycles 9 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ b) (4 points) Rearrange the code without unrolling to achieve 2 less cycles per iteration. You can reorder and drop any line of code, but do not change any line of code. To save writing, just draw arrows in the below copy of the code to show any code movement. Show the execution clock cycle number next to each code line. Assume initialization can be adjusted. _________ cycles c) (2 points) Can the original code be optimized with loop unrolling and software pipeline to avoid stalls in the loop due to data dependencies? Why or why not? 10 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ Suppose the original code is modified to the following; the MOV.D instruction was removed. LOOP: L.D F0, 0(R1) #load the filter input for the iteration MULT.D F2, F1, F12 #multiply elements ADD.D F0, F2, F0 #add elements MULT.D F2, F1, F11 MULT.D F0, F0, F10 ADD.D F0, F0, F2 S.D F0, 0(R1) #store the result ADDI R1, R1, 8 #increment pointer, 8 bytes per DW BNEZ R2, LOOP #continue till all inputs are processed SUBI R2, R2, 1 #decrement element count d) (2 points) Unroll the original loop twice (so contains 3 iterations) and schedule it to avoid stalls. Assume the second iteration has F0 renamed to F3, F1 renamed to F4, and F2 renamed to F5. Assume the third iteration has F0 renamed to F6, F1 renamed to F7, and F2 renamed to F8. Write the code on the next page. Write the number reference when writing any instruction listed below. If you need to use any instruction not listed below, write out the instruction explicitly. 11 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ d) (continued) To save writing, just write the instruction number in the table below if instruction can be used as is from the prior page. If it’s not there, write out the new instruction. If you need fewer instructions than you have space below in the table, just leave the rest blank. Number (if instruction unchanged) Instruction (if not on prior page) e) (2 points) What is the effective cycle per iteration for the unrolled loop, where the iteration is referring to the iteration for the original code? ____________ cycles 12 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ 13 CS252 - Quiz #2, Spring 2001 Your last name: ______________________ f) (4 points) For DSP processor, special instructions are provide to speed up DSP applications such as an n-tap filter. Suppose the following instructions are provided in addition. How can one use them to speed up the original 3-tap filter code at the beginning of the question? Write out the new code below, starting with the version in part a). How many cycles do the DSP instructions save? How does this compare to your answer to part b)? LP RX, LABEL Zero over head loop that loops the segment with the number of times specified in the register RX. This eliminates branch delay. LT FX, RY Auto increment. Load MEM(RY) to register FX. Then increments the base register RY to the next element. 14