1. In the following code fragment, memory reads and writes (LW and SW) take 2 clock cycles, MUL takes 4 clock cycles and ADD takes 1 clock cycle. Assume that A holds the value of 3, B holds the value of 4. Assume r1 initially holds the value of 0. LW r3, A ; Load value of variable A LW r2, B ; Load value of variable B MUL r1, r2, r3 ; r1 = r2 * r3 ADD r1, r1, r2 ; r1 = (r2 * r3) + r2 SW r1, C ; C = (r2 * r3) + r2 a) On a scalar processor, how many clock cycles does it take to execute this program? What is the final answer stored in C? 11 clock cycles, C = 16 b) On a superscalar processor with two identical functional units and a fetch – dispatch-retire policy that completely ignores data dependencies, what is the final answer stored in C? How many clock cycles are required to execute the program now? CC # 1 2 3 4 5 6 FU1 LW r3,A FU2 LW r2, B MUL r1,r2,r3 ADD r1,r1,r2 SW r1, C R1 0 0 4 4 4 12 R2 R3 C 4 4 4 4 4 3 3 3 3 3 4 4 Takes 6 clock cycles, C = 4 c) Suggest some ways how correct execution can generally be enforced on this naïve processor. i) ii) Reorder instructions so that those not dependent on previous computations “go first”. Pad with required number of NOP instructions 2. Suppose that there is a CPU with a 4-stage pipeline (fetch, decode, execute, writeback) with no branch predictions, but with branching decisions becoming available at the end of the decode stage. a. Explain what delayed-branches are, and why they are necessary in this architecture. - Instruction(s) following the branch is/are effectively executed before the branch itself. - Caused by the fetch stage loading up the instruction(s) following the branch before a branching decision is made. b. How many instructions are executed before the branch is taken? - One F branch i1 D E W F D E W F D E t1 W i1,i2 = instructions immediately after branch, t1 = instruction at target c. Suppose that we now duplicate the pipeline (i.e. there are now two identical pipelines). Discuss how this affects the number of instructions executed before a branch is taken, assuming that there are no data dependencies. - branch i1 Number of instructions executed before branch is taken increases to 3. F D E W F D E W F D E W F D E W F D E W F D E W i2 i3 t1 d. In a “normal” program with “normal” data dependencies, discuss why delayed branches are bad for efficient execution in superscalar pipelines. - May not have instructions to insert into delayed slots. Forced to insert NOPs. Decreases efficiency of pipeline. Three times as drastic in superscalar pipeline than in scalar pipeline. 3. Given a processor with 4 architectural registers A, B, C and D and 16 physical registers R0 to R15. A program can have 3 types of dependencies: a. Write-write dependencies: This occurs when consecutive instructions write to the same register: A = A + 1; A = A + B; b. Read-write dependencies: This occurs when an instruction writes into a register read by a previous instruction: A = B + C; B = D + 2; c. True Data Dependency: This occurs when an instruction depends on the results written by a previous instruction: A = B + C; D = A + 2; Identify the dependencies in this program and describe how the dependencies affect the dispatch of each instruction within a processor that can handle 2 instructions at a time (i.e. you only need to test for dependencies between pairwise adjacent instructions I0 and I1, I1 and I2, I2 and I3 etc.). For example, a true-dependency between instructions I0 and I1 will prevent them from being executed together because I1 will receive the wrong value in register A). I0: A = B + 1 I1: C = A + D I2: A= B + 2 I3: D = A –1 I4: C = D + 1 I5: C = 1 Dependency TDD between I0 and I1 RWD between I2 and I1 Effect I1 must wait for I0 to complete I2 cannot commit changes to A until I1 has read A TDD between I3 and I2 I3 must wait for I2 to commit results to register A before reading it. TDD between I4 and I3 I4 must wait for I3 to commit results to D. WWD between I4 and I5 I5 must commit to C only after I4 has done so. d. Show how register renaming using R0 to R15 can remove all dependencies except true dependencies. Map A in I0 and I1 to R0 Map A in I2 and I3 to R1 Map B in I0, I2 to R2 Map C in I1, I4 to R3 Map C in I5 to R4 Map D in I1, I3, I4 to R5 I0: R0 = R2 + 1 I1: R3 = R0 + R5 I2: R1= R2 + 2 I3: R5 = R1 –1 I4: R3 = R5 + 1 I5: R4 = 1