Chapter 3 ILP 1 Three Generic Data Hazards Inst I before inst j in in the program • Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 • Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. 2 Three Generic Data Hazards • Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. • Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 3 Three Generic Data Hazards • Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • • Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 4 Branch Hazards • Loop unrolling • Branch prediction –Static –Dynamic 5 Static Branch Prediction •Scheduling (reordering) code around delayed branch • need to predict branch statically at compile time • use profile information collected from earlier runs •Behavior of branch is often bimodally distributed! • Biased toward taken or not taken •Effectiveness depend on • frequency of branches and accuracy of the scheme 22% 18% 20% 15% 15% 12% 11% 12% 10% 9% 10% 4% 5% 6% Integer r 2c o p dl jd FP su d m hy dr o2 ea r li c do du c gc eq nt ot es t pr es so m pr e ss 0% co Integer benchmarks have higher branch frequency Misprediction Rate 25% 6 Dynamic Branch Prediction • Why does prediction work? • Underlying algorithm has regularities • Data that is being operated on has regularities •Is dynamic branch prediction better than static branch prediction? • Seems to be • There are a small number of important branches in programs which have dynamic behavior Performance = ƒ(accuracy, cost of misprediction) 7 1-Bit Branch Prediction •Branch History Table: Lower bits of PC address index table of 1-bit values • Says whether or not branch taken last time NT 0x40010100 0x40010104 for (i=0; i<100; i++) { …. } addi r10, r0, 100 addi r1, r1, r0 L1: …… …… … 0x40010A04 addi r1, r1, 1 0x40010A08 bne r1, r10, L1 …… 0x40010108 T 1-bit Branch History Table T NT T : : T NT Prediction 8 1-Bit Bimodal Prediction (SimpleScalar Term) • For each branch, keep track of what happened last time and use that outcome as the prediction • Change mind fast 9 1-Bit Branch Prediction •What is the drawback of using lower bits of the PC? • Different branches may have the same lower bit value •What is the performance shortcome of 1-bit BHT? • in a loop, 1-bit BHT will cause two mispredictions • End of loop case, when it exits instead of looping as before • First time through loop on next time through code, when it predicts exit instead of looping 10 2-bit Saturating Up/Down Counter Predictor •Solution: 2-bit scheme where change prediction only if get misprediction twice 2-bit branch prediction State diagram 11 2-Bit Bimodal Prediction (SimpleScalar Term) • For each branch, maintain a 2-bit saturating counter: if the branch is taken: counter = min(3,counter+1) if the branch is not taken: counter = max(0,counter-1) • If (counter >= 2), predict taken, else predict not taken • Advantage: a few atypical branches will not influence the prediction (a better measure of “the common case”) • Can be easily extended to N-bits (in most processors, N=2) 12 Branch History Table •Misprediction reasons: • Wrong guess for that branch • Got branch history of wrong branch when indexing the table 13 Branch History Table (4096-entry, 2-bits) Branch intensive benchmarks have higher miss rate. How can we solve this problem? Increase the buffer size or Increase the accuracy 14 Branch History Table (Increase the size?) Need to focus on increasing the accuracy of the scheme! 15 Correlated Branch Prediction • Standard 2-bit predictor uses local information • Fails to look at the global picture •Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch 16 Correlated Branch Prediction • A shift register captures the local path through the program • For each unique path a predictor is maintained • Prediction is based on the behavior history of each local path • Shift register length determines program region size 17 Correlated Branch Prediction •Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table • In general, (m,n) predictor means record last m branches to select between 2^m history tables each with n-bit counters • Old 2-bit BHT is then a (0,2) predictor If (aa == 2) aa=0; If (bb == 2) bb = 0; If (aa != bb) do something; 18 Correlated Branch Prediction Global Branch History: m-bit shift register keeping T/NT status of last m branches. 19 Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% 16% 14% 12% 11% 10% 8% 6% 6% 5% 6% 6% 5% 4% 4% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry li eqntott expresso gcc fpppp matrix300 0% spice 1% 0% doducd 1% tomcatv 2% nasa7 Frequency of Mispredictions 20% 1,024 entries (2,2) 20 Tournament Predictors • A local predictor might work well for some branches or programs, while a global predictor might work well for others • Provide one of each and maintain another predictor to identify which predictor is best for each branch Local Predictor Global Predictor Branch PC M U X Tournament Predictor 21 Tournament Predictors •Multilevel branch predictor • Selector for the Global and Local predictors of correlating branch prediction •Use n-bit saturating counter to choose between predictors •Usual choice between global and local predictors 22 Tournament Predictors •Advantage of tournament predictor is ability to select the right predictor for a particular branch •A typical tournament predictor selects global predictor 40% of the time for SPEC integer benchmarks • AMD Opteron and Phenom use tournament style 23 Tournament Predictors (Intel Core i7) • Based on predictors used in Core Due chip • Combines three different predictors • Two-bit • Global history • Loop exit predictor • Uses a counter to predict the exact number of taken branches (number of loop iterations) for a branch that is detected as a loop branch • Tournament: Tracks accuracy of each predictor • Main problem of speculation: • A mispredicted branch may lead to another branch being mispredicted ! 24 Branch Prediction •Sophisticated Techniques: • A “branch target buffer” to help us look up the destination • Correlating predictors that base prediction on global behavior and recently executed branches (e.g., prediction for a specific branch instruction based on what happened in previous branches) • Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. • A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) •Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! •Modern processors predict correctly 95% of the time! 25 Branch Target Buffers (BTB) •Branch target calculation is costly and stalls the instruction fetch. •BTB stores PCs the same way as caches •The PC of a branch is sent to the BTB •When a match is found the corresponding Predicted PC is returned •If the branch was predicted taken, instruction fetch continues at the returned predicted PC 26 Branch Target Buffers (BTB) 27 Pipeline without Branch Predictor IF (br) PC Reg Read Compare Br-target PC + 4 In the 5-stage pipeline, a branch completes in two cycles If the branch went the wrong way, one incorrect instr is fetched One stall cycle per incorrect branch 28 Pipeline with Branch Predictor IF (br) PC Branch Predictor Reg Read Compare Br-target 29 Dynamic Vs. Static ILP • Static ILP: + The compiler finds parallelism no extra hw higher clock speeds and lower power + Compiler knows what is next better global schedule - Compiler can not react to dynamic events (cache misses) - Can not re-order instructions unless you provide hardware and extra instructions to detect violations (eats into the low complexity/power argument) - Static branch prediction is poor even statically scheduled processors use hardware branch predictors 30 Dynamic Scheduling •Hardware rearranges instruction execution to reduce stalls • Maintains data flow and exception behavior •Advantages: • Handles cases where dependences are unknown at compile time • Simplifies compiler • Allows processor to tolerate unpredictable delays by executing other code • Cache miss delay • Allows code compiled with one pipeline in mind to run efficiently on a different pipeline •Disadvantage: • Complex hardware 31 Dynamic Scheduling • Simple Pipelining technique: • In-order instruction issue and execution • If an instruction is stalled, no later instructions can proceed • Dependence between two closely spaced instructions leads to hazard • What if there are multiple functional units? • Units could stay idle • If instruction “j” depends on a long running instruction “i” • All instructions after “j” stalls DIV.D ADD.D SUB.D F0,F2,F4 F10,F0,F8 F12,F8,F14 Can be eliminated by not requiring instructions to execute in-order 32 Idea • Classic five stage pipeline • Structural and data hazards could be checked during ID • What do we need to allow us execute the SUB.D? DIV.D ADD.D SUB.D F0,F2,F4 F10,F0,F8 F12,F8,F14 •Separate ID process into two • Check for hazards (Issue) • Decode (read operands) •In-order instruction issue (in program order) • Begin execution as soon as its operands are available • Out-of-order execution (out-of-order completion) 33 Out-of-order complications •Introduces possibility of WAW, WAR hazards • Do not exist in 5 stage pipeline WAR DIV.D ADD.D SUB.D MUL.D F0,F2,F4 F6,F0,F8 F8,F10,F14 F6,F10,F8 WAW • Solution • Register renaming 34 Out-of-order complications •Handling exceptions • Out-of-order completion must preserve exception behavior • Exactly those exceptions that would arise if the program was executed in strict program order actually do arise • Preserve exception behavior by: • Ensuring that no instruction can generate an exception until the processor knows that the instruction raising the exception will be executed 35 Splitting ID Stage •Instruction Fetch: • Fetch into register or queue •Instruction Decode • Issue: • Decode instructions • Check for structural hazards • Read operands • Wait until no data hazards • Then read operands •Execute • Just as 5-stage pipeline execute stage • May take multiple cycles 36 Hardware requirement •Pipeline must allow multiple instructions to be in execution stage • Multiple functional units •Instructions pass through issue stage in order (in-order issue) •Instructions can be stalled and bypass each other in the second stage (read operands) •Instructions enter execution stage out-of-order 37 Dynamic Scheduling using Tomasulo’s Method •Sophisticated scheme to allow out-of-order execution •Objective is to minimize RAW hazards • Introduces register renaming to minimize WAR and WAW hazards •Many variations of this technique are used in modern processors •Common features: • Tracking instruction dependencies to allow as soon as (only when) operands are available (resolves RAW) • Renaming destination registers (WAR, WAW) 38 Dynamic Scheduling using Tomasulo’s Method •Register renaming: true anti DIV.D ADD.D S.D SUB.D MUL.D F0,F2,F4 F6,F0,F8 F6,0(R1) F8,F10,F14 F6,F10,F8 DIV.D ADD.D S.D SUB.D MUL.D F0,F2,F4 S,F0,F8 S,0(R1) T,F10,F14 F6,F10,T output Finding any use of F8 requires sophisticated compiler or hardware 39 - later use of F8! Dynamic Scheduling using Tomasulo’s Method 1 FIFO 2 3 40 Steps of an instruction 1. • Issue —get instruction from instruction queue If a reservation station is free • issue instr & send operands to reservation station • If operand not available • Track functional units producing that will produce • This step renames registers! • Else • Structural hazard, stall 41 Steps of an instruction 2. Execution —operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) Note: several instructions may become ready for the same FU at the same time : arbitrary choice for FUs, Loads and stores complicated! Need to maintain the program order 42 4 Steps of Speculative Tomasulo 3. Write result —finish execution (WB) Write on Common Data Bus to all awaiting FUs mark reservation station available. Load waits for memory unit to be available Store waits for operand and then memory to be available 43 Reservation Station •Op: Operation to perform in the unit (e.g., + or –) •Vj, Vk: Value of Source operands • For loads Vk is used to hold the offset •Qj, Qk: Reservation stations producing source registers (value to be written) Note: Qj,Qk=0 => ready in Vj, Vk • Busy: Indicates reservation station or FU is busy • A : Hold information for the memory address calculation for load/store. • Initially, immediate value is stored in A, • After address calculation: effective address •Register result status (Qi)—Indicates the number of the reservation station that contains the operation whose results should be stored in to this register. 44 • Blank when no pending instructions to write that register. Refer to D. Patterson’s Tomasulo Slides 45 Loop Unrolling •Determine loop unrolling useful by finding that loop iterations were independent • Determine address offsets for different loads/stores • Increases program size •Use different registers to avoid unnecessary constraints forced by using same registers for different computations • Stress on registers •Eliminate the extra test and branch instructions and adjust the loop termination and iteration code 46 Loop Unrolling •If a loop only has dependences within an iteration, the loop is considered parallel multiple iterations can be executed together so long as order within an iteration is preserved • If a loop has dependences across iterations, it is not parallel and these dependences are referred to as “loop-carried” 47 Example For (i=1000; i>0; i=i-1) x[i] = x[i] + s; For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; B[i+1] = B[i] + A[i+1]; } S1 S2 For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; B[i+1] = C[i] + D[i]; } S1 S2 For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1 48 Example For (i=1000; i>0; i=i-1) x[i] = x[i] + s; No dependences For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; B[i+1] = B[i] + A[i+1]; } S1 S2 S2 depends on S1 in the same iteration S1 depends on S1 from prev iteration S2 depends on S2 from prev iteration For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; B[i+1] = C[i] + D[i]; } S1 S2 S1 depends on S2 from prev iteration For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1 S1 depends on S1 from 3 prev iterations Referred to as a recursion Dependence distance 3; limited parallelism 49 Constructing Parallel Loops If loop-carried dependences are not cyclic (S1 depending on S1 is cyclic), loops can be restructured to be parallel For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; B[i+1] = C[i] + D[i]; } S1 S2 S1 depends on S2 from prev iteration A[1] = A[1] + B[1]; For (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } S3 S4 B[101] = C[100] + D[100]; S4 depends on S3 of same iteration Loop unrolling reduces impact of branches on pipeline; another way is branch prediction 50 Dynamic Scheduling using Tomasulo’s Method •Register renaming is provided by reservation stations • Buffer the operands of instructions waiting to issue •Reservation station fetches and buffers an operand as soon as it is available • Eliminates the need to get the operand from a register •Pending instructions designate the reservation station that will provide their input (register renaming) •Successive writes to a register: last one is used to update •There can be more reservations stations than real registers! • Name dependencies that can’t be eliminated by a compiler , now can be eliminated by hardware 51 Dynamic Scheduling using Tomasulo’s Method •Data structures attached to • reservation stations • Load/store buffers: acting like reservation station (hold data or address coming from or going to memory) • Register file •Once an instruction has issued and is waiting for source operand • Refers to the operand by reservation station number where the instruction that will generate the value has been assigned • If reservation station number is 0 • Operand is already in the register file •1 cycle latency between source and result • Effective latency between producing instruction and consuming instruction is at least 1 cycle longer than the latency of the function 52 unit producing the result Dynamic Scheduling using Tomasulo’s Method •Loads and Stores • If they access different address • Can be done out of order safely • Else • Load and then Store program order • WAR • Store and then Load program order • RAW • Store and then Store • WAW •To detect these hazards • Effective address calculation must be in program order • Check the “A” field in Load/Store queue before issuing the Load/Store instruction 53 Dynamic Scheduling using Tomasulo’s Method •Advantage of distributed reservation stations and CDB • If multiple instructions are waiting for the same result • Instructions released concurrently by the broadcast of the result through CDB • Centralized register file: units read from register file when the bus is available •Advantage of using reservation station number instead of register name • eliminates WAR, WAW 54 Dynamic Scheduling using Tomasulo’s Method •Disadvantage • Complex hardware • Can achieve high performance if branches are predicted accurately • <1 cycle per instruction with a dual issue! • Reservation station must be associative • Single CDB! 55 Hardware-Based Speculation •Branch prediction reduces the stalls attributable to branches • For a processor executing multiple instructions • Just predicting branch is not enough • Multiple issue processor may execute a branch every clock cycle •Exploiting parallelism requires that we overcome the limitation of control dependence 56 Hardware-Based Speculation •Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct • extension over branch prediction with dynamic scheduling • Speculation fetch, issue, and execute instructions as if branch predictions were always correct • Dynamic scheduling only fetches and issues such instructions •Essentially a data flow execution model: Operations execute as soon as their operands are available 57 Hardware-Based Speculation •3 components of HW-based speculation: • Dynamic branch prediction to choose which instructions to execute • Speculation to allow execution of instructions before control dependences are resolved • ability to undo effects of incorrectly speculated sequence • Dynamic scheduling to deal with scheduling of different combinations of basic blocks • without speculation only partially overlaps basic blocks • requires that a branch be resolved before actually executing any instructions in the successor basic block. 58 Hardware-Based Speculation in Tomasulo •The key idea • allow instructions to execute out of order • force instructions to commit in order • prevent any irrevocable action (such as updating state or taking an exception) until an instruction commits. •Hence: • Must separate execution from allowing instruction to finish or “commit” • instructions may finish execution considerably before they are ready to commit. •This additional step called instruction commit 59 Hardware-Based Speculation in Tomasulo •When an instruction is no longer speculative, allow it to update the register file or memory •Requires additional set of buffers to hold results of instructions that have finished execution but have not committed : reorder buffer (ROB) •This reorder buffer (ROB) is also used to pass results among instructions that may be speculated 60 Reorder Buffer •In Tomasulo’s algorithm, once an instruction writes its result, any subsequently issued instructions will find result in the register file •With speculation, the register file is not updated until the instruction commits • (we know definitively that the instruction should execute) •Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit • ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulo’s algorithm • ROB extends architectured registers like RS 61 Reorder Buffer Structure (Four fields) •instruction type field • Indicates whether the instruction is a branch (and has no destination result), a store (which has a memory address destination), or a register operation (ALU operation or load, which has register destinations). •destination field • supplies the register number (for loads and ALU operations) or the memory address (for stores) where the instruction result should be written. •value field • hold the value of the instruction result until the instruction commits. •ready field • indicates that the instruction has completed execution, and the 62 value is ready. Reorder Buffer Operation •Holds instructions in FIFO order, exactly as issued •When instructions complete, results placed into ROB • Supplies operands to other instruction between execution complete & commit => more registers like RS • Tag results with ROB buffer number instead of reservation station •Instructions commit =>values at head of ROB placed in registers Reorder •As a result, easy to undo Buffer FP speculated instructions Op on mispredicted branches Queue FP Regs or on exceptions Commit path Res Stations FP Adder Res Stations FP Adder 63 Where is the store queue? 64 4 Steps of Speculative Tomasulo 1- Issue —get instruction from instruction queue If reservation station and reorder buffer slot free, issue instr & send operands (if available: either from ROB or FP registers) to reservation station & send reorder buffer no. allocated for result to reservation station (tag the result when it is placed on CDB) 65 4 Steps of Speculative Tomasulo 2. Execution —operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW Loads still require 2 step process, instructions may take multiple cycles,, stores is only effective address calculation (need Rs), 66 4 Steps of Speculative Tomasulo 3. Write result —finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. If the value to be stored is available, it is written into the Value field of the ROB entry for the store. If the value to be stored is not available yet, the CDB must be monitored until that value is broadcast, at which time the Value field of the ROB entry of the store is updated. 67 4 Steps of Speculative Tomasulo 4. Commit a) when an instruction reaches the head of the ROB and its result is present in the buffer; • update the register with the result and remove the instruction from the ROB. b) Committing a store is similar except that memory is updated rather than a result register. c) If a branch with incorrect prediction reaches the head ROB • it indicates that the speculation was wrong. • ROB is flushed and execution is restarted at the correct successor of the branch. d) If the branch was correctly predicted, the branch is finished. •Once an instruction commits, its entry in the ROB is reclaimed. If the ROB fills, we simply stop issuing instructions 68 until an entry is made free. Tomasulo With Reorder buffer: Dest. Value Instruction FP Op Queue Done? ROB7 ROB6 Newest ROB5 Reorder Buffer ROB4 ROB3 ROB2 LD F0,16(R2) ADDD F10,F4,F0 DIVD F2,F10,F6 F0 LD F0,16(R2) Registers Dest FP adders ROB1 To Memory from Memory Dest Reservation Stations N Dest 1 10+R2 FP multipliers Oldest Tomasulo With Reorder buffer: Dest. Value Instruction Done? FP Op Queue ROB7 ROB6 Newest ROB5 Reorder Buffer ROB4 ROB3 LD F0,16(R2) ADDD F10,F4,F0 DIVD F2,F10,F6 F10 F0 ADDD F10,F4,F0 LD F0,16(R2) Registers Dest 2 ADDD R(F4),ROB1 FP adders ROB2 ROB1 Oldest To Memory from Memory Dest Reservation Stations N N Dest 1 10+R2 FP multipliers 70 Tomasulo With Reorder buffer: Dest. Value Instruction Done? FP Op Queue ROB7 ROB6 ROB5 Reorder Buffer LD F0,16(R2) ADDD F10,F4,F0 DIVD F2,F10,F6 ROB4 F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,16(R2) Registers Dest 2 ADDD R(F4),ROB1 FP adders Newest ROB3 ROB2 ROB1 Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations N N N from Memory Dest 1 10+R2 FP multipliers 71 Avoiding Memory Hazards • WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending • RAW hazards through memory are avoided by two restrictions: 1. not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the Addr. field of the load, and 2. maintaining the program order for the computation of an effective address of a load with respect to all earlier stores. • these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data 72 Multi-Issue - Getting CPI Below 1 • • CPI ≥ 1 if issue only 1 instruction every clock cycle Multiple-issue processors come in 3 flavors: 1. statically-scheduled superscalar processors, 2. dynamically-scheduled superscalar processors, and 3. VLIW (very long instruction word) processors (static sched.) • The 2 types of superscalar processors issue varying numbers of instructions per clock – use in-order execution if they are statically scheduled, or – out-of-order execution if they are dynamically scheduled • VLIW processors, in contrast, issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (Intel/HP Itanium) 73 VLIW: Very Large Instruction Word • Each “instruction” has explicit coding for multiple operations – In IA-64, grouping called a “packet” – In Transmeta, grouping called a “molecule” (with “atoms” as ops) – Moderate LIW also used in Cray/Tera MTA-2 • Tradeoff instruction space for simple decoding – The long instruction word has room for many operations – By definition, all the operations the compiler puts in one long instruction word are independent => can execute in parallel – E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch » 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide – Need compiling techniques to schedule across several branches (called “trace scheduling”) 74 Thrice Unrolled Loop that Eliminates Stalls for Scalar Pipeline Computers 1 Loop: 2 3 4 5 6 7 8 9 10 11 L.D L.D L.D ADD.D ADD.D ADD.D S.D S.D DSUBUI BNEZ S.D F0,0(R1) F6,-8(R1) F10,-16(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 0(R1),F4 -8(R1),F8 R1,R1,#24 R1,LOOP 8(R1),F12 Minimum times between pairs of instructions: L.D to ADD.D: 1 Cycle ADD.D to S.D: 2 Cycles A single branch delay slot follows the BNEZ. ; 8-24 = -16 11 clock cycles, or 3.67 per iteration 75 Loop Unrolling in VLIW L.D to ADD.D: +1 Cycle ADD.D to S.D: +2 Cycles Memory reference 1 Memory reference 2 FP operation 1 1 Loop: 2 3 4 5 6 7 8 9 10 11 FP op. 2 L.D L.D L.D ADD.D ADD.D ADD.D S.D S.D DSUBUI BNEZ S.D Int. op/ branch F0,0(R1) F6,-8(R1) F10,-16(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 0(R1),F4 -8(R1),F8 R1,R1,#24 R1,LOOP 8(R1),F12 Clock 76 Loop Unrolling in VLIW L.D to ADD.D: +1 Cycle ADD.D to S.D: +2 Cycles Memory reference 1 Memory reference 2 FP operation 1 1 Loop: 2 3 4 5 6 7 8 9 10 11 FP op. 2 L.D L.D L.D ADD.D ADD.D ADD.D S.D S.D DSUBUI BNEZ S.D Int. op/ branch F0,0(R1) F6,-8(R1) F10,-16(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 0(R1),F4 -8(R1),F8 R1,R1,#24 R1,LOOP 8(R1),F12 Clock L.D F0,0(R1) 1 S.D 0(R1),F4 2 3 4 5 6 ADD.D F4,F0,F2 77 Loop Unrolling in VLIW L.D to ADD.D: +1 Cycle ADD.D to S.D: +2 Cycles Memory reference 1 Memory reference 2 L.D F0,0(R1) L.D F6,-8(R1) FP operation 1 L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) ADD.D L.D F26,-48(R1) ADD.D ADD.D S.D 0(R1),F4 S.D -8(R1),F8 ADD.D S.D -16(R1),F12 S.D -24(R1),F16 S.D -32(R1),F20 S.D -40(R1),F24 S.D 8(R1),F28 1 Loop: 2 3 4 5 6 7 8 9 10 11 FP op. 2 L.D L.D L.D ADD.D ADD.D ADD.D S.D S.D DSUBUI BNEZ S.D Int. op/ branch F0,0(R1) F6,-8(R1) F10,-16(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 0(R1),F4 -8(R1),F8 R1,R1,#24 R1,LOOP 8(R1),F12 Clock 1 2 F4,F0,F2 ADD.D F8,F6,F2 3 F12,F10,F2 ADD.D F16,F14,F2 4 F20,F18,F2 ADD.D F24,F22,F2 5 F28,F26,F2 6 7 DSUBUI R1,R1,#56 8 BNEZ R1,LOOP 9 Unrolled 7 times to avoid stall delays from ADD.D to S.D 7 results in 9 clocks, or 1.3 clocks per iteration (2.8X: 1.3 vs 3.67) Average: 2.5 ops per clock (23 ops in 45 slots), 51% efficiency Note: 8, not -48, after DSUBUI R1,R1,#56 - which may be out of place. See next slide. Note: We needed more registers in VLIW (used 15 pairs vs. 6 in SuperScalar) 78 Problems with 1st Generation VLIW • Increase in code size – generating enough operations in a straight-line code fragment requires ambitiously unrolling loops – whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding • Operated in lock-step; no hazard detection HW – a stall in any functional unit pipeline caused entire processor to stall, since all functional units must be kept synchronized – Compiler might predict function unit stalls, but cache stalls are hard to predict • Binary code incompatibility – Pure VLIW => different numbers of functional units and unit latencies require different versions of the code 79 Multiple Issue Processors • Exploiting ILP – Unrolling simple loops – More importantly, able to exploit parallelism in a less structured code size • Modern Processors: – Multiple Issue – Dynamic Scheduling – Speculation 80 Multiple Issue, Dynamic, Speculative Processors • How do you issue two instructions concurrently? – What happens at the reservation station if two instructions issued concurrently have true dependency? – Solution 1: » Issue first during first half and Issue second instruction during second half of the clock cycle – Problem: » Can we issue 4 instructions? – Solution 2: » Pipeline and widen the issue logic » Make instruction issue take multiple clock cycles! – Problem: » Can not pipeline indefinitely, new instructions issued every clock cycle » Must be able to assign reservation station » Dependent instruction that is being used should be able to refer to the correct reservation stations for its operands • Issue step is the bottleneck in dynamically scheduled superscalars! 81 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” • IA-64: instruction set architecture – 64 bits per integer • 128 64-bit integer regs + 128 82-bit floating point regs – Not separate register files per functional unit as in old VLIW • Hardware checks dependencies (interlocks => binary compatibility over time) • Itanium™ was first implementation (2001) – Highly parallel and deeply pipelined hardware at 800Mhz – 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process • Itanium 2™ is name of 2nd implementation (2005) – 6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process – Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3 82 Speculation vs. Dynamic Scheduling L.D L.D MUL.D SUB.D DIV.D ADD.D F6, 32(R2) F2,44(R3) F0,F2,F4 F8,F6,F2 F10,F0,F6 F6,F8,F2 • no instruction after the earliest uncompleted instruction (MUL.D) is allowed to complete. In contrast, in dynamic scheduling fast instructions (SUB.D and ADD.D) have also completed. • ROB can dynamically execute code while maintaining a precise interrupt model. • if MUL.D caused an interrupt, we could simply wait until it reached the head of the ROB and take the interrupt, flushing any other pending instructions from the ROB. Because instruction commit happens in order, this yields a precise exception. • By contrast, in Tomasulo’s algorithm, the SUB.D and ADD.D completed before the MUL.D raised the exception. F8 and F6 could be overwritten, and the interrupt would be imprecise. 83 ARM Cortex-A8 and Intel Core i7 • A8: • • •I7: • • • Multiple issue iPad, Motorola Droid, iPhones Multiple issue high end dynamically scheduled speculative High-end desktops, server 84 ARM Cortex-A8 • A8 Design goal: low power, reasonably high clock rate • Dual-issue • Statically scheduled superscalar • Dynamic issue detection • Issue one or two instructions per clock (in-order) • 13 stage pipeline • Fully bypassing • Dynamic branch predictor • 512-entry, 2-way set associative branch target buffer • 4K-entry global history buffer • If branch target buffer misses • Prediction through global history buffer • 8-entry return address stack • I7: aggressive 4-issue dynamically scheduled speculative85 pipeline ARM Cortex-A8 The basic structure of the A8 pipeline is 13 stages. Three cycles are used for instruction fetch and four for instruction decode, in addition to a five-cycle integer pipeline. This yields a 13-cycle branch misprediction penalty. The instruction fetch unit tries to keep the 12-entry instruction queue filled. 86 ARM Cortex-A8 The five-stage instruction decode of the A8. In the first stage, a PC produced by the fetch unit (either from the branch target buffer or the PC incrementer) is used to retrieve an 8-byte block from the cache. Up to two instructions are decoded and placed into the decode queue; if neither instruction is a branch, the PC is incremented for the next fetch. Once in the decode queue, the scoreboard logic decides when the instructions can issue. In the issue, the register operands are read; recall that in a simple scoreboard, the operands always come from the registers. The register operands and opcode are sent to the instruction execution portion of the pipeline. 87 ARM Cortex-A8 The five-stage instruction decode of the A8. Multiply operations are always performed in ALU pipeline 0. 88 ARM Cortex-A8 Figure 3.39 The estimated composition of the CPI on the ARM A8 shows that pipeline stalls are the primary addition to the base CPI. eon deserves some special mention, as it does integer-based graphics calculations (ray tracing) and has very few cache misses. It is computationally intensive with heavy use of multiples, and the single multiply pipeline becomes a major bottleneck. This estimate is obtained by using the L1 and L2 miss rates and penalties to compute the L1 and L2 generated stalls per instruction. These are subtracted from the CPI measured by a detailed simulator to89 obtain the pipeline stalls. Pipeline stalls include all three hazards plus minor effects such as way misprediction. ARM Cortex-A8 vs A9 A9: Issue 2 instructions/clk Dynamic scheduling Speculation Figure 3.40 The performance ratio for the A9 compared to the A8, both using a 1 GHz clock and the same size caches for L1 and L2, shows that the A9 is about 1.28 times faster. Both runs use a 32 KB primary cache and a 1 MB secondary cache, which is 8-way set associative for the A8 and 16-way for the A9. The block sizes in the caches are 64 bytes for the A8 and 32 bytes for the A9. As mentioned in the caption of Figure 3.39, eon makes intensive use of integer multiply, and the combination of dynamic scheduling and a faster multiply pipeline significantly improves performance on the A9. twolf experiences a small slowdown, likely due to the fact that its cache behavior is worse with the smaller L1 block size of the A9. 90 Intel Core i7 • The total pipeline depth is 14 stages. • There are 48 load and 32 store buffers. • The six independent functional units can each begin execution of a ready micro-op in the same cycle. 91 Intel Core i7 • Instruction Fetch: • Multilevel branch target buffer • Return address stack (function return) • Fetch 16 bytes from instruction cache • 16-bytes in predecode instruction buffer • Macro-op fusion: compare followed by branch fused into one instruction • Break 16 bytes into instructions • Place into 18-entry queue 92 Intel Core i7 • Micro-op decode: translate x86 instructions into micro-ops (directly executable by the pipeline) • Generate up to 4 microops/cycle • Place into 28-entry buffer • Micro-op buffer: • loop stream detection: • Small sequence of instructions in a loop (<28 instructions) • Eliminate fetch, decode • Microfusion • Fuse load/ALU, ALU/store pairs • Issue to single reservation station 93 Intel Core i7 vs. Atom 230 (45nm technology) Intel i7 920 ARM A8 Intel Atom 230 4-cores each with FP 1 core, no FP 1 core, with FP Clock rate 2.66GHz 1GHz 1.66GHz Power 130W 2W 4W Cache 3-level, all 4-way, 128 I, 64 D, 512 L2 1-level Fully associative 32 I, 32 D 2-level All 4-way 16 I, 16 D, 64 L2 Pipeline 4ops/cycle 2ops/cycle 2 ops/cycle Speculative, OOO In-order, dynamic issue In-order Dynamic issue Two-level Two-level 512-entry BTB 4K global history 8-entry return Two-level Branch pred 94 Intel Core i7 vs. Atom 230 (45nm technology) Figure 3.45 The relative performance and energy efficiency for a set of single-threaded benchmarks shows the i7 920 is 4 to over 10 times faster than the Atom 230 but that it is about 2 times less power efficient on average! Performance is shown in the columns as i7 relative to Atom, which is execution time (i7)/execution time (Atom). Energy is shown with the line as Energy (Atom)/Energy (i7). The i7 never beats the Atom in energy efficiency, although it is essentially as good on four benchmarks, three of which are floating point. The data shown here were collected by Esmaeilzadeh et al. [2011]. The SPEC benchmarks were compiled with optimization on using the standard Intel compiler, while the Java benchmarks use the Sun (Oracle) Hotspot Java VM. Only one core is active on the i7, and the rest are in deep power saving mode. Turbo Boost is used on the i7, which increases its performance advantage but slightly decreases its relative energy efficiency. Copyright © 2011, Elsevier Inc. All rights Reserved. Improving Performance • Techniques to increase performance: pipelining improves clock speed increases number of in-flight instructions hazard/stall elimination branch prediction register renaming out-of-order execution bypassing increased pipeline bandwidth 96 Deep Pipelining • Increases the number of in-flight instructions • Decreases the gap between successive independent instructions • Increases the gap between dependent instructions • Depending on the ILP in a program, there is an optimal pipeline depth • Tough to pipeline some structures; increases the cost of bypassing 97 Increasing Width • Difficult to find more than four independent instructions • Difficult to fetch more than six instructions (else, must predict multiple branches) • Increases the number of ports per structure 98 Reducing Stalls in Fetch • Better branch prediction novel ways to index/update and avoid aliasing cascading branch predictors • Trace cache stores instructions in the common order of execution, not in sequential order in Intel processors, the trace cache stores pre-decoded instructions 99 Reducing Stalls in Rename/Regfile • Larger ROB/register file/issue queue • Virtual physical registers: assign virtual register names to instructions, but assign a physical register only when the value is made available • Runahead: while a long instruction waits, let a thread run ahead to prefetch (this thread can deallocate resources more aggressively than a processor supporting precise execution) • Two-level register files: values being kept around in the register file for precise exceptions can be moved to 2nd level 100 Performance beyond single thread ILP •There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) •Explicit Thread Level Parallelism or Data Level Parallelism •Thread: process with own instructions and data • thread may be a process part of a parallel program of multiple processes, or it may be an independent program • Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute •Data Level Parallelism: Perform identical operations on data, and lots of data 101 Thread Level Parallelism (TLP) •ILP exploits implicit parallel operations within a loop or straight-line code segment •TLP explicitly represented by the use of multiple threads of execution that are inherently parallel •Goal: Use multiple instruction streams to improve • Throughput of computers that run many programs • Execution time of multi-threaded programs •TLP could be more cost-effective to exploit than ILP 102 Thread-Level Parallelism • Motivation: a single thread leaves a processor under-utilized for most of the time by doubling processor area, single thread performance barely improves • Strategies for thread-level parallelism: multiple threads share the same large processor reduces under-utilization, efficient resource allocation Simultaneous Multi-Threading (SMT) each thread executes on its own mini processor simple design, low interference between threads Chip Multi-Processing (CMP) 103 New Approach: Mulithreaded Execution •Multithreading: multiple threads to share the functional units of 1 processor via overlapping • processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table • memory shared through the virtual memory mechanisms, which already support multiple processes • HW for fast thread switch; much faster than full process switch ~100s to 1000s of clocks •When switch? • Alternate instruction per thread (fine grain) • When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain) 104 Fine-Grained Multithreading •Switches between threads on each instruction, causing the execution of multiples threads to be interleaved •Usually done in a round-robin fashion, skipping any stalled threads •CPU must be able to switch threads every clock •Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls •Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads •Used on Sun’s Niagara 105 Coarse-Grained Multithreading •Switches threads only on costly stalls, such as L2 cache misses •Advantages • Relieves need to have very fast thread-switching • Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall •Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs • Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen • New thread must fill pipeline before instructions can complete •Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time •Used in IBM AS/400 106 M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes Simultaneous Multi-threading ... One thread, 8 units Cycle M M FX FX FP FP BR CC Two threads, 8 units Cycle M M FX FX FP FP BR CC 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 107 Simultaneous Multithreading (SMT) •Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading • Large set of virtual registers that can be used to hold the register sets of independent threads • Register renaming provides unique register identifiers, so instructions from multiple threads can be mixed in datapath without confusing sources and destinations across threads • Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW •Just adding a per thread renaming table and keeping separate PCs • Independent commitment can be supported by logically keeping a separate reorder buffer for each thread 108 Time (processor cycle) Multithreaded Categories Superscalar Fine-Grained Coarse-Grained Thread 1 Thread 2 Multiprocessing Thread 3 Thread 4 Simultaneous Multithreading Thread 5 Idle slot 109 Head to Head ILP competition Processor Micro architecture Fetch / Issue / Execute FU Clock Rate (GHz) Transis -tors Die size Power Intel Pentium 4 Extreme AMD Athlon 64 FX-57 IBM Power5 (1 CPU only) Intel Itanium 2 Speculative dynamically scheduled; deeply pipelined; SMT Speculative dynamically scheduled Speculative dynamically scheduled; SMT; 2 CPU cores/chip Statically scheduled VLIW-style 3/3/4 7 int. 1 FP 3.8 125 M 122 mm2 115 W 3/3/4 6 int. 3 FP 2.8 8/4/8 6 int. 2 FP 1.9 6/5/11 9 int. 2 FP 1.6 114 M 104 115 W mm2 200 M 80W 300 (est.) mm2 (est.) 592 M 130 423 W 2 mm 110 Limits to ILP •Doubling issue rates above today’s 3-6 instructions per clock, say to 6 to 12 instructions, probably requires a processor to • • • • issue 3 or 4 data memory accesses per cycle, resolve 2 or 3 branches per cycle, rename and access more than 20 registers per cycle, and fetch 12 to 24 instructions per cycle. •The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate • E.g, widest issue processor is the Itanium 2, but it also has the slowest clock rate, despite the fact that it consumes the most power! 111 Limits to ILP •Most techniques for increasing performance increase power consumption •The key question is whether a technique is energy efficient: does it increase power consumption faster than it increases performance? •Multiple issue processors techniques all are energy inefficient: • Issuing multiple instructions incurs some overhead in logic that grows faster than the issue rate grows • Growing gap between peak issue rates and sustained performance •Number of transistors switching = f(peak issue rate), and performance = f( sustained rate), growing gap between peak and sustained performance => increasing energy per unit of performance 112 Conclusion •Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options •Explicitly parallel (Data level parallelism or Thread level parallelism) is next step to performance •Coarse grain vs. Fine grained multihreading • Only on big stall vs. every clock cycle •Simultaneous Multithreading if fine grained multithreading based on OOO superscalar microarchitecture • Instead of replicating registers, reuse rename registers •Balance of ILP and TLP decided in marketplace 113