15-740/18-740 Computer Architecture Homework 4 Due Monday, October 17, at 12:00 PM 1. The Reorder Buffer The reorder buffer (ROB) plays a key role in ensuring in-order execution semantics are exposed to the programmer despite the underlying hardware possibly executing instructions out of program order. In this problem, you will perform the tasks of the ROB. (a) Say a program consists of the following instruction stream: Assume: EY 0x0040: LD r0 ← [0xABCD] 0x0044: ADD r1 ← r2, r3 • A LD takes 3 cycles to execute and an ADD takes 1 cycle to execute. • The LD and ADD can write their results to the ROB on the same cycle they are ready (i.e., a bypass path for both tag and data values are used in this system). • It takes one cycle for the ROB to write data ready to be committed to architectural state to the architectural register file. • The initial contents of the architectural register file is r0 = 5, r1 = 4, r2 = 3, and r3 = 2. • The value at address 0xABCD is 1. K Fill in the contents of the architectural register file and a ROB over the next six cycles for a processor which can issue 1 instruction per cycle. Here is the initial state to get you started: Cycle 0: Architectural register file r0 5 r1 4 r2 3 r3 2 Valid 0 0 ROB Done - 1 InstructionAddr - ResultRegister - ResultValue - Cycle 1: Architectural register file r0 5 r1 4 r2 3 r3 2 Cycle 2: Architectural register file r0 5 r1 4 r2 3 r3 2 Cycle 3: Architectural register file r0 5 r1 4 r2 3 r3 2 Valid 1 0 Done 0 - InstructionAddr 0x0040 - ResultRegister r0 - ResultValue - ResultRegister r0 r1 ResultValue - ResultRegister r0 r1 ResultValue 5 ROB Valid 1 1 Done 0 0 InstructionAddr 0x0040 0x0044 ROB Valid 1 1 Done 0 1 InstructionAddr 0x0040 0x0044 EY Cycle 4: Architectural register file r0 5 r1 4 r2 3 r3 2 ROB Valid 1 1 Cycle 6: Architectural register file r0 1 r1 5 r2 3 r3 2 Done 1 1 InstructionAddr 0x0040 0x0044 ResultRegister r0 r1 ResultValue 1 5 ResultRegister r1 - ResultValue 5 - ResultRegister - ResultValue - ROB Valid 1 0 K Cycle 5: Architectural register file r0 1 r1 4 r2 3 r3 2 ROB Valid 0 0 Done 1 - InstructionAddr 0x0044 - ROB Done - InstructionAddr - (b) Now assume that on a system without an ROB an exception occurs when the ADD is executed. What will the contents of the architectural register file be at the time of the exception (when it is exposed to the programmer)? Why is this bad? The contents will be the same as the initial state. This is bad because the programmer will not know if the LD executed and stored its value in architectural register r0, making programs harder to debug. 2. Out-of-Order Execution Out-of-order processors make efficient use of their functional units by executing instructions according to the flow of data between them. (a) In class, we learned about a graphical way of showing the data dependencies (edges) of instructions (vertices) using a data flow graph. For the instruction stream below, draw the corresponding data flow graph. 2 ADD MUL ADD DIV ADD r3 r4 r5 r6 r7 ← ← ← ← ← r1, r1, r8, r1, r6, r2 r3 r9 r3 r5 r1 r2 r8 r9 r3 r4 r6 r5 r7 EY (b) Of course, while a data flow graph is a helpful way of visualizing what goes on in an out-of-order processor, real machines execute instructions using multiple pipeline stages. Take the code from above and find the number of cycles it would take to execute on a processor with in-order fetch and out-of-order dispatch. Assume the following: K • It takes one cycle to insert an instruction into a reservation station and another cycle to schedule the instruction from the reservation station to an execution unit and these stages can be pipelined. • Tags are broadcast one cycle before their corresponding operation finishes executing. • A single tag broadcast bus exists and if more than one reservation station entry to the same execution unit is ready to be dispatched, the older reservation station entry will use the bus and the younger ones will have to stall. Similarly, if more than one instruction contends for writing to the ROB or architectural register file during a single cycle, the oldest one receives access and the younger ones must stall. • ADDs take 3 cycles and are pipelined, MULs take 5 cycles and are pipelined, and DIVs take 7 cycles and are pipelined. 18 cycles: ADD MUL ADD DIV ADD 1 2 3 4 5 6 7 8 9 1 0 F D D E E E R W F D D - - E E E E F D D E E E R F D D E E E E F D D - - - 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 E E - R E - W - W E R W - E E E R W (c) How many cycles will the program take to execute on an in-order fetch and in-order dispatch machine, under the same assumptions as above? 20 cycles: ADD MUL ADD DIV ADD 1 2 3 4 5 6 7 8 9 1 0 F D D E E E R W F D D - - E E E E F D - - D E E E F - - D D E E F D D - 1 1 1 1 1 1 1 1 1 2 1 2 3 4 5 6 7 8 9 0 E R E - R E - W - W E E E R W - - - E E E R W 3