CS316 – Autumn 2013 – 2014 C. Kozyrakis CS316 Advanced Multi-Core Systems HW2 – Superscalar Techniques and Cache Coherence Due Tuesday 11/12/12 at 5PM (online on Submission portal or in box in front of Gates 303) Notes Collaboration on this assignment is encouraged (groups of 3 students) This HW set provides practice for the exam. Make sure you work on or review ALL problems in this assignment. Group: Member1 - _______________ Member2 - ________________ Member3 - ________________ Problem 1: Branch Prediction [9 points] The figure below shows the control flow of a simple program. The CFG is annotated with three different execution trace paths. For each execution trace, circle which branch predictor (bimodal, local, or gshare) will best predict the branching behavior of the given trace. More than one predictor may perform equally well on a particular trace. However, you are to use each of the three predictors exactly once in choosing the best predictors for the three traces. Assume each trace is executed many times and every node in the CFG is a conditional branch. The branch history register for the local and gshare predictors is limited to 4 bits. Bimodal is a common name for a simple branch history table (BHT). Provide a brief explanation for your answer. Circle one: Bimodal Local gshare 1 CS316 – Autumn 2013 – 2014 C. Kozyrakis Circle one: Bimodal Local gshare Circle one: Bimodal Local gshare 2 CS316 – Autumn 2013 – 2014 C. Kozyrakis Problem 2: Renaming [6 points] Consider a MIPS-like instruction set. All instructions are of the form: LD DST, offset(addr) SD SRC, offset(addr) OP DST, SRC1, SRC2 Part A: [3 points] Computers spend most of their time in loops, so multiple loop iterations are great places to speculatively find more work to keep CPU resources busy. Nothing is ever easy, though; the compiler emitted only one copy of that loop’s code, so even though multiple iterations are handling distinct data, they will appear to use the same registers. To keep register usages multiple iterations from colliding, we rename their registers. The following code segment shows an example that we would like our hardware to rename. Loop: LD F2, 0(Rx) I0: MULTD F5, F0, F2 I1: DIVD F8, F0, F2 I2: LD F4, 0(Ry) I3: ADDD F6, F0, F4 I4: ADDD F10, F8, F2 I5: SD F4, 0(Ry) A compiler could have simply unrolled the loop and used different registers to avoid conflicts, but if we expect our hardware to unroll the loop, it must also do the register renaming. How? Assume your hardware has a pool of temporary registers (call them T registers, and assume there are 64 of them, T0 through T63) that it can substitute for those registers designated by the compiler. This rename hardware is indexed by the source register designation, and the value in the table is the T register of the last destination that targeted that register. (Think of these table values as producers, and the src registers are the consumers; it doesn’t much matter where the producer puts its result as long as its consumers can find it.) Consider the code sequence. Every time you see a destination register in the code, substitute the next available T, beginning with T9. Then update all the src registers accordingly, so that true data dependences are maintained. Show the resulting code. (Hint: See following sample) I0: LD T9, 0(Rx) I1: MULTD T10, F0, T9 3 CS316 – Autumn 2013 – 2014 C. Kozyrakis Part B: [3 points] Part A explored simple register renaming: when the hardware register renamer sees a source register, it substitutes the destination T register of the last instruction to have targeted that source register. When the rename table sees a destination register, it substitutes the next available T for it. But superscalar designs need to handle multiple instructions per clock cycle at every stage in the machine, including the register renaming. A simple scalar processor would therefore look up both src register mappings for each instruction, and allocate a new destination mapping per clock cycle. Superscalar processors must be able to do that as well, but they must also ensure that any dest-to-src relationships between the two concurrent instructions are handled correctly. Consider the following sample code sequence: I0: I1: I2: I3: MULTD ADDD ADDD DIVD F5, F9, F5, F2, F0, F5, F5, F9, F2 F4 F2 F0 Assume that we would like to simultaneously rename the first two instructions (2-way superscalar). Further assume that the next two available T registers to be used are known at the beginning of the clock cycle in which these two instructions are being renamed. Conceptually, what we want is for the first instruction to do its rename table lookups, and then update the table per its destination’s T register. Then the second instruction would do exactly the same thing, and any inter-instruction dependency would thereby be handled correctly. But there’s not enough time to write the T register designation into the renaming table and then look it up again for the second instruction, all in the same clock cycle. That register substitution must instead be done live (in parallel with the register rename table update). Figure 2.1 shows a circuit diagram, using multiplexers and comparators, that will accomplish the necessary on-the-fly register renaming. Your task is to show the cycle-by-cycle state of the rename table & destination / sources register mappings for every instruction of the code. Assume the table starts out with every entry equal to its index (T0 = 0; T1 = 1….). Figure 2-1. Rename table and on-the-fly register substitution logic for superscalar machines. (Note: “src” is source, “dst” is destination.) 4 CS316 – Autumn 2013 – 2014 C. Kozyrakis You only need to fill in mappings for registers that have been renamed from their starting values (e.g. no need to write in F60=T60, but if F60=T3 that needs to be filled in). Not all fields may be used. Cycle 0: Architectural F F F F F F F Machine T T T T T T T Instruction I0 dst = src1 = src2 = Instruction I1 dst = src1 = src2 = Machine T T T T T T T Instruction I2 dst = src1 = src2 = Instruction I3 dst = src1 = src2 = Cycle 1: Architectural F F F F F F F 5 CS316 – Autumn 2013 – 2014 C. Kozyrakis Problem 3: Coherence [30 points] Part A: Single processor coherence [5 points] A processor such as the PowerPC G3, widely deployed in Apple Macintosh systems, is primarily intended for use in uniprocessor systems, and hence has a very simple MEI cache coherence protocol. MEI is the same as MESI, except the Shared (S) state is omitted. Identify and discuss one reason why even a uniprocessor design should support cache coherence. Is the MEI protocol of the G3 adequate for this purpose? Why or why not? (Hint: think about Direct Memory Access (DMA)) 6 CS316 – Autumn 2013 – 2014 C. Kozyrakis Part B: MOESIF cache coherence protocol [10 points] Many modern systems use cache-to-cache transfer as a way to avoid penalties of going off-chip for a memory access. MOESIF cache coherency protocol extends from the MESI protocol, where the semantics of the additional states are as follows: O state indicates that the line is shared-dirty: i.e., multiple copies may exist, but the other copies are in the S state, and the cache that has the line in the O state is responsible for writing the line back if it is evicted. F state indicates that the line is shared-clean but multiple copies may exist in the S state and this cache is responsible for a transfer on fill request. Fill in the table below for actions on every event trigger. If nothing needs to be done, write in “Do nothing.” If an event is invalid for a given state, write in “Error.” Current State s Event and Local Coherence Controller Responses and Actions (s’ refers to next state) Local Read Local Write Local Bus Read Bus Write Bus Eviction Upgrade Invalid (I) Shared (S) Forwarding (F) Exclusive (E) Owned (O) Modified (M) 7 CS316 – Autumn 2013 – 2014 C. Kozyrakis Part C: Snoopy Coherence [5 points] Assuming a processor frequency of 1 GHz, a target CPI of 2, a per-instruction level-2 cache miss rate of 1% per instruction, a snoop-based cache coherent system with 32 processors, and 8-byte address messages (including command and snoop addresses), compute the inbound and outbound snoop bandwidth required at each processor node. 8 CS316 – Autumn 2013 – 2014 C. Kozyrakis Part D: Memory Consistency [10 points] Consider a simple multicore processor using a snoopy MSI cache coherence protocol. Each processor has a single, private cache that is direct-mapped with four blocks each holding two words. The initial cache state of the system is shown in the figure below. To simplify the illustration, the cache-address tag contains the full address. P0 B0 B1 B2 B3 P1 Coherency Address State tag I 100 S 108 M 110 I 118 B0 B1 B2 B3 P2 Coherency Address State tag I 100 M 128 I 110 S 118 B0 B1 B2 B3 Coherency Address State tag S 120 S 108 I 110 I 118 Reads and writes will experience stall cycles depending on the state of the cache line: CPU read and write hits generate no stall cycles CPU read and write misses generate Nmemory and Ncache stall cycles if satisfied by memory and cache, respectively CPU write hits that generate an invalidate incur Ninvalidate stall cycles A write-back of a block, due to either a conflict or another processor’s request to an exclusive block, incurs an additional Nwriteback stall cycles The exact cycle count for each event is given in the table below: Parameter Cycles 100 Nmemory 40 Ncache 15 Ninvalidate 10 Nwriteback Sequential consistency (SC) requires that all reads and writes appear to have executed in some total order. This may require the processor to stall in certain cases before committing a read or write instruction. Consider the following code sequence: write A read B where the write A results in a cache miss and the read B results in a cache hit. Under SC, the processor must stall read B until after it can order (and thus perform) write A. Simple implementations of SC will stall the processor until the cache receives the data and can perform the write. Weaker consistency models relax the ordering constraints on reads and writes, reducing the cases that the processor must stall. The Total Store Order (TSO, or Processor Order) consistency model requires that all writes appear to occur in a total order but allows a processor’s reads to pass its own writes. This allows processor to implement write buffers that hold committed writes that have not yet been ordered with respect to other processors’ writes. Reads are allowed to pass (and potentially bypass) the write buffer in TSO (which they could not do under SC). Assume that one memory operation can be performed per cycle and that operations that hit in the cache or that can be satisfied by the write buffer introduce no stall cycles. Operations that miss incur the latencies stated above. How many stall cycles occur prior to each operation for both the SC and TSO consistency models for the cases listed below? Show your work; a correct answer without any work shown will receive no credit. 9 CS316 – Autumn 2013 – 2014 Instructions P0: write 110 80 P0: read 108 C. Kozyrakis Stall cycles SC TSO P0: write 100 80 P0: read 108 SC TSO P0: write 110 80 P0: write 100 90 SC TSO P0: write 100 80 P0: write 110 90 SC TSO P0: read 118 P0: write 110 80 SC TSO 10 CS316 – Autumn 2013 – 2014 C. Kozyrakis Problem 4: Instruction Flow and Branch Prediction [30 points] This problem investigates the effects of branches and control flow changes on program performance for a scalar pipeline (to keep the focus on branch prediction). Branch penalties increase as the number of pipeline stages increases between instruction fetch and branch resolution (or condition and target resolution). This effect of pipelined execution drives the need for branch prediction. This problem explores both static branch prediction in Part C and dynamic branch prediction in Part D. For this problem the base machine is a 5-Stage pipeline. The 5-Stage Pipeline without Dynamic Branch Prediction Execution Assumptions: Br. Corr. • unconditional branches Address execute in the decode Fetch stage Addr. Sequential • conditional branches Addr. Calc. execute in the execute Instruction Fetch stage • Effective address calculation is performed Instruction Decode in the execute stage • All memory access is performed in the memory access stage Execute • All necessary forwarding paths exist • The register file is read Memory Access after write Instruction Cache Instructions The fetch address is a choice between the sequential address generation logic and the branch correction logic. If a mispredicted branch is being corrected the correction address is chosen over the sequential address for the next fetch address. Write Back Part A: Branch Penalties. [2 points] What are the branch penalties for unconditional and conditional branches? Unconditional ______________ Conditional _______________ 11 CS316 – Autumn 2013 – 2014 C. Kozyrakis Part B: No Branch Prediction. [4 points] This problem will use the insertion sort program. An execution trace, or a sequence of executed basic blocks, is provided for this problem. A basic block is a group of consecutive instructions that are always executed together in sequence. Example Code: Insertion Sort BB Line# Label Assembly_Instruction Comment 1 1 2 3 main: addi r2, addi r3, add r4, r0, r0, r0, ListArray ListLength r0 r2 <- ListArray r3 <- ListLength i = 0; 2 4 loop1: bge r4, r3, end 3 4 5 6 loop2: add r5, ble r5, r4, r0, r0 cont 5 7 8 9 10 addi lw lw bge r6, r7, r8, r7, r5, -1 r5(r2) r6(r2) r8, cont while (i < Length) { j=i; while (j > 0) { k=j-1; temp1 = ListArray[j]; temp2 = ListArray[k]; if (temp1 >= temp2) break; 6 11 12 13 14 sw sw addi ba r8, r5(r2) r7, r6(r2) r5, r5, -1 loop2 7 8 15 16 cont: 17 18 end: addi r4, r4, ba loop1 lw ba r1, r1 ListArray[j] temp2; ListArray[k] temp1; j--; } 1 i++; } (sp) r1 <- Return Pointer Execution Trace: Sequence of Basic Blocks Executed: 123 45723 456 456 4723 456 45723 456 456 45728 [Hint: An alternate way to represent the same execution trace above is to use the sequence of branch instructions, both conditional and unconditional (i.e. ba), executed.] 12 CS316 – Autumn 2013 – 2014 C. Kozyrakis 1. Fill in the branch execution table with an N for not taken and a T for taken. This table is recording the execution pattern for each (static) branch instruction. Use the execution trace on the previous page. Branch Execution - Assume No Branch Prediction: Branch Instruction No. (i.e. Line#) Branch Instruction Execution (dynamic executions of each branch) 1 2 3 4 5 6 7 8 9 10 4 6 10 14 16 18 Using the branch execution table above to calculate the statistics requested in the following table. Branch Execution Statistics: Branch Instr. No. Times Executed Times Taken Times Not Taken % Taken %Not Taken 4 6 10 14 16 18 2. How many cycles does the trace take to execute (include all pipeline fill and drain cycles)? [Hint: you don’t need to physically simulate the execution trace, just compute the cycle count.] 3. How many cycles are lost to control dependency stalls? 13 CS316 – Autumn 2013 – 2014 C. Kozyrakis Part C: Static Branch Prediction. [8 points] Static branch prediction is a compile-time technique of influencing branch execution in order to reduce control dependency stalls. Branch opcodes are supplemented with a static prediction bit that indicates a likely direction during execution of the branch. This is done based on profiling information, ala that in Part B. For this part of Problem 4, new branch opcodes are introduced: bget - branch greater than or equal with static predict taken bgen - branch greater than or equal with static predict not-taken blet - branch less than or equal with static predict taken blen - branch less than or equal with static predict not-taken Static branch prediction information is processed in the decode stage of the 5-stage pipeline. When a branch instruction with static predict taken (i.e. bget) is decoded the machine predicts taken. Conversely, when a branch instruction with static predict not-taken (i.e. bgen) is decoded the machine predicts not-taken. 1. [6 points] Pretend you are the compiler, rewrite each conditional branch instruction in the original code sequence using the new conditional branch instructions with static branch prediction encoded. 2. [2 points] Assuming the same execution trace, what is the new total cycle count of the modified code sequence incorporating static branch prediction instructions. Indicate the resultant IPC. 14 CS316 – Autumn 2013 – 2014 C. Kozyrakis Part D: Dynamic Branch Prediction. [16 points] This part examines the use of a Branch Target Buffers (BTB) for performing dynamic branch prediction on the 5-Stage scalar pipeline. The 5-Stage Pipeline without Dynamic Branch Prediction The branch target buffer (BTB) in the 5-Stage pipeline contains 4 entries and is direct mapped. The BTB caches all branch and jump instructions. It stores the branch fetch addresses along with the branch target addresses. If the branch fetch address “hits” in the BTB, the target address in the BTB is used if the branch is predicted taken. The BTB is updated immediately after prediction in the instruction fetch pipe stage, i.e. assuming speculative update. BTB Br. Corr. Address Sequential Addr. Calc. Fetch Addr. Instruction Fetch Instruction Decode Execute Memory Access Write Back 15 Instruction Cache Instructions The BTB is accessed simultaneously with the instruction cache. If the BTB hits and returns a branch target address and the branch is predicted taken, then the next fetch address is the target address from the BTB. If the branch is predicted not taken or the BTB misses, then the sequential address is the next fetch address. CS316 – Autumn 2013 – 2014 C. Kozyrakis 1. [6 points] Assume each entry of the BTB employs a 2-bit saturating up/down counter (initialized to the state 00) that maintains branch history. The BTB uses the following prediction algorithm: 00 - Not taken, 01 - Not taken, 10 - Taken, 11 - Taken. Fill in the table below with the state of the BTB after each dynamic branch instruction is executed. Use I[2:1] as the address of an instruction to determine which BTB entry is referenced, where I is the instruction number in its binary form. (2nd and 3rd bits resp.) Dynamic Branches Executed BTB Entry #0 Stat. Br # Tar. Instr# Hist. bits BTB Entry #1 Stat. Br # Tar. Instr# Hist. bits 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 16 BTB Entry #2 Stat. Br # Tar. Instr# Hist. bits BTB Entry #3 Stat. Br # Tar. Instr# Hist. bits CS316 – Autumn 2013 – 2014 C. Kozyrakis 2. [4 points] Fill in the table below to evaluate how accurately each branch is predicted. Branch Instr. Number 1 2 Branch Instruction Execution 3 4 5 6 7 8 Taken or Not 4 Pred. Direction Correct? (Y/N) Taken or Not 6 Pred. Direction Correct? (Y/N) Taken or Not 10 Pred. Direction Correct? (Y/N) Taken or Not 14 Pred. Direction Correct? (Y/N) Taken or Not 16 Pred. Direction Correct? (Y/N) Taken or Not 18 Pred. Direction Correct? (Y/N) 17 9 10 CS316 – Autumn 2013 – 2014 C. Kozyrakis 3. [2 points] Compare and contrast the effectiveness and efficiency of the dynamic branch predictor of Part D with the static branch predictor of Part C. 4. [4 points] Argue whether the Two-level Adaptive Branch Prediction Scheme [Yeh & Patt] can improve branch prediction accuracy for the given execution trace? Or for any execution of the insertion-sort routine? 18 CS316 – Autumn 2013 – 2014 C. Kozyrakis Problem 5: Instruction Scheduling in the Metaflow Processor [25 points] This problem will help you understand the details of scheduling and management in a micro-dataflow example. The Metaflow architecture was first implemented in the Sparc Lightning processor to help exploit instruction-level parallelism [Popescu et al. IEEE Micro 1991]. The architecture utilizes an interagted instruction-shelving (DRIS) structure to execute instructions out of order. To exploit instruction-level parallelism while maintaining consistency, the DRIS forms a unified structure for register renaming, dependency checking, result forwarding, and ROB. In this problem, you will explore the details of the DRIS structure and understand how it functions by tracking the state of the DRIS through a sample code trace. The DRIS entry parameters and operations are explained in the IEEE Micro1991 paper distributed along with this assignment. To simplify the problem, we are going to trace the execution of the Metaflow processor abstractly. For step 1, show the state of the DRIS after you have issued the instruction trace (from PC=101 to PC=111) into DRIS starting with entry 1. In step 2, show the state of DRIS after all instructions that are ready to execute at the end of step 1 have been executed and their results forwarded the appropriate locations in DRIS. In step 3, show the state of the DRIS and register file after all instructions that are ready to complete at the end of step 2 have been retired. In the subsequent steps, repeat the operations in step 2 and 3 until all instructions have retired out of the DRIS. You can attach extra tables if necessary. Note: don’t forget to consider memory dataflow dependences of load and store instructions. Also, assume load and store addresses are in register indirect format and thus do not require separate computation. Branch instructions are predicted not taken at first so the execution proceeds down the sequential path speculatively until the branch is resolved. Assume no internal memory by-passing, i.e loads and store must go to memory. PC= 101: 102: 103: 104: 105: 106: 107: 108: mul add sub sw mul sub lw bne r1, r1, r3 r3, r1, r2 r4, r4, r2 (r4), r1 r1, r2, r4 r3, r1, r3 r1, (r4) r1, r4, (115) 109: sub r2, r2, r4 110: mult r3, r1, r3 111: add r4, r3, r1 ; r1 r1 * r3 ; r3 r1 + r2 ; r4 r4 – r2 ; mem[r4] r1 ; r1 r2 * r4 ; r3 r1 – r3 ; r1 mem[r4] ; branch to 115 ; if r1 != r4 ; r2 r2 – r4 ; r3 r1 * r3 ; r4 r3 + r1 ………….. 115: halt 19 CS316 – Autumn 2013 – 2014 C. Kozyrakis Step 1 Register File r1 r2 5 3 DRIS Source Operand 1 Tag 1 r3 2 r4 1 Source Operand 2 Locked Reg. ID Locked Reg. ID Latest Destination Disp- Func. Exec- Program atched Unit cuted Counter Reg. Content 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Note: Please use different color pens to show the state values that changes and don’t change from step to step. For each new step, first copy the state unchanged to the new sheet and then make the changes with a different color pen. Use {add, sub, mul, br, lw, and sw} for the function unit column. In this problem, the dispatch column is effectively the same as the executed column and do not need to be filled. 20 CS316 – Autumn 2013 – 2014 C. Kozyrakis Step 2 Register File r1 r2 DRIS Source Operand 1 Tag 1 r3 r4 Source Operand 2 Locked Reg. ID Locked Reg. ID Latest Destination Reg. Content 2 3 4 5 6 7 8 9 10 11 12 13 14 15 21 Disp- Func. Exec- Program atched Unit cuted Counter CS316 – Autumn 2013 – 2014 C. Kozyrakis Step 3 Register File r1 r2 DRIS Source Operand 1 Tag 1 r3 r4 Source Operand 2 Locked Reg. ID Locked Reg. ID Latest Destination Reg. Content 2 3 4 5 6 7 8 9 10 11 12 13 14 15 22 Disp- Func. Exec- Program atched Unit cuted Counter CS316 – Autumn 2013 – 2014 C. Kozyrakis Step 4 Register File r1 r2 DRIS Source Operand 1 Tag 1 r3 r4 Source Operand 2 Locked Reg. ID Locked Reg. ID Latest Destination Reg. Content 2 3 4 5 6 7 8 9 10 11 12 13 14 15 23 Disp- Func. Exec- Program atched Unit cuted Counter CS316 – Autumn 2013 – 2014 C. Kozyrakis Step 5 Register File r1 r2 DRIS Source Operand 1 Tag 1 r3 r4 Source Operand 2 Locked Reg. ID Locked Reg. ID Latest Destination Reg. Content 2 3 4 5 6 7 8 9 10 11 12 13 14 15 24 Disp- Func. Exec- Program atched Unit cuted Counter CS316 – Autumn 2013 – 2014 C. Kozyrakis Step 6 Register File r1 r2 DRIS Source Operand 1 Tag 1 r3 r4 Source Operand 2 Locked Reg. ID Locked Reg. ID Latest Destination Reg. Content 2 3 4 5 6 7 8 9 10 11 12 13 14 15 25 Disp- Func. Exec- Program atched Unit cuted Counter CS316 – Autumn 2013 – 2014 C. Kozyrakis Step 7 Register File r1 r2 DRIS Source Operand 1 Tag 1 r3 r4 Source Operand 2 Locked Reg. ID Locked Reg. ID Latest Destination Reg. Content 2 3 4 5 6 7 8 9 10 11 12 13 14 15 26 Disp- Func. Exec- Program atched Unit cuted Counter CS316 – Autumn 2013 – 2014 C. Kozyrakis Step 8 Register File r1 r2 DRIS Source Operand 1 Tag 1 r3 r4 Source Operand 2 Locked Reg. ID Locked Reg. ID Latest Destination Reg. Content 2 3 4 5 6 7 8 9 10 11 12 13 14 15 27 Disp- Func. Exec- Program atched Unit cuted Counter CS316 – Autumn 2013 – 2014 C. Kozyrakis Step 9 Register File r1 r2 DRIS Source Operand 1 Tag 1 r3 r4 Source Operand 2 Locked Reg. ID Locked Reg. ID Latest Destination Reg. Content 2 3 4 5 6 7 8 9 10 11 12 13 14 15 28 Disp- Func. Exec- Program atched Unit cuted Counter CS316 – Autumn 2013 – 2014 C. Kozyrakis Step 10 Register File r1 r2 DRIS Source Operand 1 Tag 1 r3 r4 Source Operand 2 Locked Reg. ID Locked Reg. ID Latest Destination Reg. Content 2 3 4 5 6 7 8 9 10 11 12 13 14 15 29 Disp- Func. Exec- Program atched Unit cuted Counter