Hardware-Based Speculation • As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing burden. => Speculating on the outcome of branches and executing the program as if the guesses were correct. • Hardware Speculation CSCE 614 Fall 2009 1 3 Key Ideas of Hardware Speculation • Dynamic Branch Prediction – Choose which instruction to execute. • Speculation – Allow the execution of instructions before the control dependences are resolved (with the ability to undo the effect of an incorrectly speculated sequence). • Dynamic Scheduling – Deal with the scheduling of different combinations of basic blocks CSCE 614 Fall 2009 2 Examples • • • • • PowerPC 603/604/G3/G4 MIPS R10000/12000 Intel Pentium II/III/4 Alpha 21264 AMD K5/K6/Athlon CSCE 614 Fall 2009 3 Hardware Speculation • Extended Tomasulo’s algorithm • Additional step (instruction commit) required • Allow instructions to execute out-of-order but to force them to commit in order. • Any irrevocable action (updating state or taking an exception) is prevented until an instruction commits. CSCE 614 Fall 2009 4 Reorder Buffer (ROB) • Holds the result of an instruction between the time the operation associated with the instruction completes and the time the instruction commits. • Source of operands for instructions • With speculation, the register file (or memory) is not updated until the instruction commits. CSCE 614 Fall 2009 5 ROB Fields • Instruction type: indicates whether the instruction is a branch, a store, or a register operation (ALU or Load). • Destination: supplies the register number (for loads and ALU operations) or the memory address (for stores). • Value: holds the value of the instruction result until the instruction commits. • Ready: indicates that the instruction has completed execution, and the value is ready. CSCE 614 Fall 2009 6 Hardware Speculation Issue Execute Write result (to ROB) Commit (write to RF, MEM) Reorder Buffer (ROB) Reservation Stations CSCE 614 Fall 2009 7 Basic Structure of MIPS FP Unit The ROB completely replaces the store buffer. The renaming function of the reservation stations is replaced by the ROB CSCE 614 Fall 2009 8 4 Steps of Execution 1. Issue (also called “dispatch”) - Get an instruction from the instruction queue. - Issue the instruction if there is an empty reservation station and an empty slot in ROB. - If either all reservation stations are full or the ROB is full, then instruction issue is stalled. CSCE 614 Fall 2009 9 2. Execute - If one or more of the operands is not yet available, monitor the CDB while waiting for the register to be computed. - Also RAW hazards are checked. - When both operands are available at a reservation station, execute the operation. - Loads require two steps (effective address calculation and source operand). - Stores need one step (effective address calculation). CSCE 614 Fall 2009 10 3. Write Result - When the result is available, write it on the CDB and from the CDB into the ROB, as well as to any reservation stations waiting for this result. - For stores, if the value to be stored is available, it is written into the Value field of the ROB entry for the store. CSCE 614 Fall 2009 11 4. Commit (also called “completion” or “graduation”) - Normal commit: When an instruction reaches the head of the ROB and its result is present in the buffer, the processor updates the register with the result and removes the instruction from the ROB. - Store commit: Similar except that memory is updated. - Branch with an incorrect prediction: The speculation is wrong. The ROB is flushed and execution is restarted at the correct successor of the branch. CSCE 614 Fall 2009 12 Example L.D L.D MUL.D SUB.D DIV.D ADD.D F6, 34(R2) F2, 45(R3) F0, F2, F4 F8, F6, F2 F10, F0, F6 F6, F8, F2 When the MUL.D is ready to commit. CSCE 614 Fall 2009 13 Example (p.109) Loop: L.D MUL.D S.D DADDIU BNE F0, 0(R1) F4, F0, F2 F4, 0(R1) R1, R1, #-8 R1, R2, Loop Assume that we have issued all the instructions in the loop twice. Assume that L.D and MUL.D from the first iteration have committed and all other instructions have completed execution. Show the contents of the ROB and the FP registers. CSCE 614 Fall 2009 14 Hardware Speculation • Because neither the register values nor any memory values are actually written until an instruction commits, the processor can easily undo its speculative actions when a branch is found to be mispredicted. • Exceptions are handled by not recognizing the exception until it is ready to commit. CSCE 614 Fall 2009 15 Hardware Speculation • Figure 2.17 (p.113) CSCE 614 Fall 2009 16 Multiple-Issue Processors • Allow multiple instructions to issue in a clock cycle. • Ideal CPI < 1 • 3 flavors – Statically Scheduled Superscalar – Dynamically Scheduled Superscalar – VLIW (Very Long Instruction Word) CSCE 614 Fall 2009 17 Superscalar Processors • Issue varying numbers of instructions per clock – statically scheduled • using compiler techniques • in-order execution – dynamically scheduled • Tomasulo’s algorithm • out-of-order execution CSCE 614 Fall 2009 18 VLIW Processors • issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (EPIC: Explicitly Parallel Instruction Computers). • Statically scheduled by the compiler. CSCE 614 Fall 2009 19 name Issue structure Hazard detection Scheduling Distinguishing characteristic Examples Superscalar (static) dynamic h/w static in-order execution MIPS and ARM (embedded) Superscalar (dynamic) dynamic h/w dynamic some out-oforder execution None Superscalar (speculative) dynamic h/w dynamic w/ speculation out-of-order execution w/ speculation Pentium 4, MIPS R12K, Alpha 21264, IBM Power5 VLIW/LIW static primarily s/w static all hazards determined by compiler TI C6x (embedded) EPIC mostly static mostly s/w mostly static all hazards determined by compiler Itanium CSCE 614 Fall 2009 20 Multiple Instruction Issue with Dynamic Scheduling • Two-issue dynamically scheduled processor – It can issue any pair of instructions if there are reservation stations of the right type available. – Extended Tomasulo’s algorithm Note that Tomasulo’s algorithm (and Hardware Speculation) is used for both integer operations and FP operations. CSCE 614 Fall 2009 21 • Two approaches to implement – Issue one instruction in half a clock cycle, so that two instructions can be processed in one clock cycle. – Build the logic necessary to handle two instructions at once, including any possible dependences between the instructions. • Modern superscalar processors that issue 4 or more instructions per clock often include both approaches. CSCE 614 Fall 2009 22 How to Handle Branches? • Dynamically scheduled processors – Only allow instructions to be fetched and issued (but not actually executed) until the branch has completed. – IBM 360/91 • Processors with hardware speculation – Can actually execute instructions based on branch prediction. CSCE 614 Fall 2009 23 • Note that we consider loads and stores, including those to FP registers, as integer operations. • Assume that FP adds take 3 execution cycles. • Latency: Write CDB Execute CSCE 614 Fall 2009 24 • The throughput improvement versus a single-issue pipeline is small. – There is only one FP operation per iteration. – There is only one Integer ALU for both integer ALU operations and effective address calculations. • Larger improvements would be possible if the processor could execute more integer operations per cycle. CSCE 614 Fall 2009 25 Multiple Issue with Speculation • We process multiple instructions per clock assigning reservation stations and reorder buffers to the instructions. • To maintain throughput of greater than one instruction per cycle, a speculative processor must be able to handle multiple instruction commits per clock cycle. CSCE 614 Fall 2009 26 Example (p.119) Loop: LD DADDIU SD DADDIU BNE R2, 0(R1) R2, R2, #1 R2, 0(R1) R1, R1, #8 R2, R3, Loop Consider the execution of the loop on a two-issue processor, once without speculation (dynamic scheduling/Tomasulo’s algorithm) and once with speculation. Assume that there are separate integer functional units for effective address calculation, for ALU operations, and for branch condition evaluation. Assume that there are 2 CDBs. Assume that up to two instructions of any type can commit per clock for a processor with speculation. Show the execution timing of the first three iterations of the loop. CSCE 614 Fall 2009 27 High-Performance Instruction Delivery • For multiple-issue (delivering 4~8 instructions per clock cycle) processors – Branch-target buffers – Integrated instruction fetch unit – Return address prediction CSCE 614 Fall 2009 28 Branch-Target Buffers • To reduce the branch penalty for the classic 5-stage pipeline, we want to know what address to fetch by the end of IF. • Branch-target buffer: a branch-prediction cache that stores the predicted address for the next instruction after a branch. • We access the buffer during the IF stage using the instruction address. (We don’t know what the instruction is.) CSCE 614 Fall 2009 29 Branch-Target Buffers Optional. May be used for extra prediction state bits. Branch-Target Cache CSCE 614 Fall 2009 30 Branch-Target Buffers • We only need to store the predicted-taken branches in the branch-target buffer. – Why? • No branch delay if a branch-prediction entry is found and is correct. CSCE 614 Fall 2009 31 CSCE 614 Fall 2009 32 Return Address Predictors • Predicting indirect jumps (destination address varies at run time) – Procedure returns, procedure calls, case, select, etc. – SPEC89: 85% of indirect jumps are procedure returns. • A small buffer of return addresses operating as a stack – Caches the most recent return addresses – Push a return address on the stack at a call – Pop one off at a return CSCE 614 Fall 2009 33 Integrated Instruction Fetch Units • A separate autonomous unit that feeds instructions to the rest of the pipeline for multiple-issue processors. • Have several functions – Integrated branch prediction – Instruction prefetch – Instruction memory access and buffering CSCE 614 Fall 2009 34