EEF011 Computer Architecture 計算機結構 Chapter 3 ILP and Its Dynamic Exploitation – Branch and Beyond 吳俊興 高雄大學資訊工程學系 October 2004 Chapter Overview 3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & the Algorithm 3.4 Reducing Branch Costs with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of the Limitations of ILP 3.9 Limitations on ILP for Realizable Processors 3.10 The P6 Microarchitecture 2 3.4 Reducing Branch Costs with Dynamic Hardware Prediction Significances of branch 1. When issue N instructions per clock cycle, branches will arrive up to n times faster in an n-issue processor 2. Amdahl’s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor Review: basic schemes – static or software prediction • Predict taken • Predict not taken • Delayed branch 3 Dynamic Hardware Prediction Dynamic Branch Prediction Dynamic Branch Prediction is the ability of the hardware to make an educated guess about which way a branch will go - will the branch be taken or not. The hardware can look for clues based on the instructions, or it can use past history - we will discuss both of these directions. Key Concept: A Branch History Table contains information about what a branch did the last time it was executed. Performance = ƒ(accuracy, cost of misprediction) 4 Branch Prediction Buffers (Branch History Table) 1-bit branch-prediction buffer: a small memory – indexed by the lower portion of the address of the branch instruction – a bit indicating whether the branch was recently taken or not. – Fetching begins in the predicted direction. If mis-predicted, the bit is inverted Bits 2 – 13 define 1024 different possibilities. Based on the address of the branch, its prediction is put into the Branch History table. Address 31 0 0 Bits 13 - 2 1023 Problem: in a loop, 1-bit BHT will cause two mis-predictions: End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping P r e d i c t i o n 5 Dynamic Hardware Prediction Basic Branch Prediction: Branch Prediction Buffers 2-bit Dynamic Branch Prediction 2-bit scheme: change prediction only if get misprediction twice Figure 3.7 (p. 198) 6 Branch History Table Accuracy Mis-prediction: • Wrong guess for that branch • Got branch history of wrong branch when index the table (Figure 3.9) • 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% • 4096 about as good as infinite table, but 4096 is a lot of HW 7 Dynamic Hardware Prediction Basic Branch Prediction: Branch Prediction Buffers Correlating Branches Generated MIPS Code: DSUBUI R3, R1, #2 BNEZ R3, L1 DADD R1, R0, R0 L1: DSUBUI R3, R2, #2 Then the third “if” can be somewhat predicted based on BNEZ R3, L2 the 1st two “ifs” DADD R2, R0, R0 L2: DSUBU R3, R1, R2 This branch is based on the Outcome of the previous 2 branches. BEQZ R3, L3 What if we have the code: If ( aa == 2) aa = 0; If ( bb == 2 ) bb = 0; If ( aa != bb ) { … 8 Dynamic Hardware Prediction Basic Branch Prediction: Branch Prediction Buffers Correlating Branches (2-level Predictors) Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) – Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction • (2,2) predictor: 2-bit global, 2-bit local • (m,n) predictor: uses the behavior of the last m branches to choose from 2m branch predictors, each of which is an n-bit predictor for a single branch # of bits = 2m * n * # of pred. entries Branch address (4 bits) 2-bits per branch local predictors Prediction Prediction 2-bit global branch history (01 = not taken then taken) 9 Example: Multiple Consequent Branches if(d == 0) not taken d=1; else taken; if(d==1) not taken else taken If b1 is not taken, then b2 will be not taken 1-bit predictor: Consider d alternates between 2 and 0. All branches are mispredicted 10 if(d == 0) not taken d=1; else taken; if(d==1) not taken else taken two prediction bits: prediction if last branch not taken, and prediction if last branch taken (1,1) predictor - 1-bit predictor with 1 bit of correlation: last branch (either taken or not taken) decides which prediction bit will be considered or updated 11 Tournament Predictors: Adaptively Combining Local and Global Predictors Use several levels of branch-prediction tables together with an algorithm for choosing among the multiple predictors Advantage: ability for per-branch basis to select the right predictor for the right branch dynamically 2+:0 0:2+ P1/P2 = 0 incorrect 1 correct 1:0 0:1 Figure 3.16 The state transition diagram for a tournament predictor has four states corresponding to which predictor to use. 12 Figure 3.17 The fraction of predictions coming from the local predictor for a tournament predictor (=local 2-bit predictor + 2-bit local/global predictor) using the SPEC89 benchmarks. Figure 3.18 The misprediction rate for three different predictors on SPEC89 as the total number of bits is increased 13 Dynamic Hardware Prediction Basic Branch Prediction: Branch Prediction Buffers Accuracy of Different Schemes (Figure 3.15, p. 206) (2,2) predictor with 1K entries often outperforms a 2-bit predictor with an unlimited number of entries 14 3.5 High Performance Instruction Delivery Predicting well is not enough for multiple-issue pipeline • Expect to deliver a high-bandwidth instruction stream • Ideal no-penalty branch: we need to know the next address to fetch by the end of IF stage • For the classic five-stage pipeline, a branch-prediction buffer is accessed during the ID cycle The goal here is to be able to fetch an instruction from the destination of a branch • You need the next address at the same time as you’ve made the prediction. • This can be tough if you don’t know where that instruction is until the branch has been figured out. • The solution is a table that remembers the resulting destination addresses of previous branches. Three concepts: branch-target buffer, integrated instruction fetch unit, and indirect branches by predicting return addresses 15 Dynamic Hardware Prediction Branch Target Buffer Basic Branch Prediction: Branch Target Buffers •Branch Target Buffer (BTB): a branch-prediction cache that stores the predicted address for the next instruction after a branch – Use address of branch as index to get prediction AND branch address (if taken) – Must check for branch match now, only store predicted-taken branches •Branch-target address – Branch-prediction buffer: accessed during the ID cycle – Branch-target buffer: accessed during the IF stage Branch PC Predicted PC PC of instruction FETCH (Figure 3.19, p. 210) Extra Yes: instruction is branch and prediction =? use predicted PC as next PC state bits No: branch not predicted, proceed normally (NextPC = PC+4) 16 Figure 3.20 Steps for handling BTB Incorrect prediction: • 1-clock-cycle delay fetching the wrong instruction • restart the fetch 1 clock cycle later Total penalty of 2 clock cycles 17 Dynamic Hardware Prediction Example Case 1. 2. 3. 4. Instruction in Buffer Yes Yes No No Prediction Taken Taken Basic Branch Prediction: Branch Target Buffers Actual Branch Taken Not taken Taken Not Taken Penalty Cycles 0 2 2 0 Example on page 211 (Figure 3.21). Determine the total branch penalty for a BTB using the above penalties. Assume also the following: • Prediction accuracy of 90% • Hit rate in the buffer of 90% • Assume that 60% of the branches are taken. Case 2 Branch Penalty = Percent buffer hit rate X Percent incorrect predictions X 2 + ( 1 - percent buffer hit rate) X Taken branches X 2 Branch Penalty = ( 90% X 10% X 2) + (10% X 60% X 2) Branch Penalty = 0.18 + 0.12 = 0.30 clock cycles Case 3 18 Dynamic Hardware Prediction Integrated Instruction Fetch Units As a separate autonomous unit that feeds instructions to the rest of the pipeline – Integrated branch prediction – The branch predictor constantly predicts branches – Instruction prefetch – The unit autonomously manages the prefetching of instructions, integrating with branch prediction – Instruction memory access and buffering – Using prefetch to try to hide the cost of crossing cache blocks Prefetching and trace caches is discussed in Chapter 5 19 Dynamic Hardware Prediction Return Address Predictors Predicting indirect jumps – destination address varies at run time – indirect procedure calls, procedure returns, select or case statements Approaches – predicting with a branch-target buffer – stack for return address predictor: pushing a return address on the stack at a call and popping one off at a return – multi-path fetching to reduce misprediction penalty • Caching addresses or instructions from multiple paths in the target buffer 20 3.6 More ILP with Multiple Issue Previous techniques - achieving an ideal CPI of one • Eliminate data and control stalls Multiple-issue processors – achieving CPI < 1!! • Start more than one instruction in a given cycle • Issue multiple instructions in a clock cycle • Vector Processing: explicit coding of independent loops as operations on large vectors of numbers • Multimedia instructions being added to many processors Two basic flavors • superscalar processors and • VLIW (very long instruction word) processors 21 Issuing Multiple Instructions/Cycle Flavor I: Superscalar processors issue varying number of instructions per clock, can be either – statically scheduled (by the compiler, in-order executing) or – dynamically scheduled (by the hardware, out-of-order execution) Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) Example: a 4-issue static superscalar processor – issue packet: group of instructions received from the fetch unit that could potentially issue in one clock cycle – The IF unit in-order examines each instruction in the issue packet – The instruction is not issued if it would cause a structural hazard or a data hazard (hardware hazard detection) IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 22 Flavor II: VLIW (Very Long Instruction Word) VLIW issues a fixed number of instructions formatted either – as one large instruction or – as a fixed instruction packet with the parallelism among instructions indicated by the VLIW also known as Explicitly Parallel Instruction Computer (EPIC) Inherently statically scheduled by compilers (see chapter 4) – Fixed number of instructions (4-16) scheduled by the compiler; put operators into wide templates – Intel Architecture-64 (IA-64) 64-bit address 23 Summary - Issuing Multiple Instructions/Cycle 24 Statically Scheduled Superscale MIPS processor – Fetch 64-bits/clock cycle; INT on left, FP on right z z Fetch/prefetch multiple instructions but may issue/deliver 0-n instructions z hardware hazard detection In our MIPS example, we can handle 2 instructions/cycle: • Floating Point • Anything Else – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB • 1 cycle load delay causes delay to 3 instructions in Superscalar – FP instruction in right half can’t use it, nor instructions in next slot 25 Dynamic Scheduling with Tomasulo’s algorithm in Superscalar • How to issue two instructions and keep in-order instruction issue for Tomasulo? – – • Assume 1 integer + 1 floating point Two approaches to removing the constraint of issuing one integer and one FP instruction in a clock 1) Issuing in half a clock cycle (or 2X clock rate) 2) 1 Tomasulo control for integer, 1 for floating point Only loads/stores might cause dependency between integer and FP issue: – – – Replace load reservation station with a load queue; operands must be read in the order they are fetched Load checks addresses in Store Queue to avoid RAW violation Store checks addresses in Load Queue to avoid WAR,WAW 26 Example: Integer ALU 1 cycle load/store 2 cycles FP add 3 cycles assume 2 CDBs, 1 integer ALU, 1 FP unit, and perfect branch prediction 27 Integrated ALU One integer functional unit for both ALU operations and effective address calculations Figure 3.25 Figure 3.26 Executes Stage •L.D and S.D - effective address calculation •Branches – when the branch condition can be evaluated •A new loop iteration is fetched and issued every 3 clock cycles •Issue rate: 5/3 = 1.67 •The loop executes in 16 clock cycles •One CDB is enough 28 Separate ALU Separate functional units for effective address calculations and ALU operations Figure 3.27 / Figure 3.28 •The loop executes in 5 clock cycles less (11 versus 16) •Two CDBs are needed 29 Three factors limit the performance of the example pipeline 1. Imbalance between the functional unit structure of the pipeline and the example loop – impossible to fully use the FP units – we need fewer dependent integer operations per loop 2. Amount of overhead per loop iteration is very high – DADDIU and BNE: 2 out of 5 instructions 3. The control hazard causes 1-cycle penalty on every loop iteration – We assume any instructions following a branch cannot start execution until after the branch condition has been evaluated – Accurate branch prediction is not sufficient 30 3.7 Hardware-based Speculation • Motivation – Prediction is not sufficient to have high amount of ILP – Overcome control dependence by speculating on the outcome of branches ⇒ Execute the program as if our guesses were correct • Dynamic scheduling vs. speculation – dynamic scheduling: only fetch and issue instructions as if our branch predictions were always correct – speculation: fetch, issue, and execute such instructions • Incorrect speculation ⇒ Undo Intel Pentium II/III/4, AMD K5/K6/Athlon, PowerPC 603/604/G3/G4, MIPS R10000/R12000, Alpha 21264 31 Key Ideas • Design – dynamic branch prediction to choose which instructions to execute – speculation to allow the execution of instructions before the control dependences are resolved • with the ability to undo the effects of an incorrectly speculated sequence – dynamic scheduling to deal with the scheduling of different combinations of basic blocks • Implementation – allow instructions to execute out of order, – but to force them to commit in order and to prevent any irrevocable action until an instruction commits 32 Reorder Buffer (ROB) • Reorder buffer – an additional set of hardware buffers that hold the results of instructions that have finished execution but have not committed – a source of operands for instructions as the reservation stations – in the interval between completion of instruction execution and instruction commit • Similar to the store buffer in Tomasulo’s algorithm – Speculation – the register file is not updated until the instruction commits – Tomasulo – once an instruction writes its result, any subsequently issued instructions will find the result in the register file ROB completely replaces store buffers in Tomasulo’s algorithm 33 ROB Components •Instruction type field – Indicate whether the instruction is • a branch (and has no destination result), • a store (has a memory address destination), or • a register operation (ALU operation or load, which has register destinations) •Destination field – Supply the register # (for loads and ALU operations) or the memory address (for stores) where the instruction result should be written •Value field – Hold the value of the instruction result until the instruction commits •Ready field – Indicate that the instruction has completed execution, and the value is ready 34 Speculative Tomasulo’s algorithm 1.Issue (=dispatch) •Get an instruction from queue •Issue the instruction if there is –an empty RS –an empty slot in the ROB then indicate they are in use, and ROB # for result also sent to RS 2.Execute •Wait for all operands available 3.Write Result •Write result to CDB and ROB # (value field of the ROB for Store) •Mark RS as available 4.Commit – Three cases •Normal commit –Occur when • an inst reaches the head of the ROB • its result is present in the buffer –Update the register with the result and free the ROB entry •Store –Like Normal Commit, but update memory instead •Mispredicted branch –ROB is flushed and restart execution at correct successor of the branch 35 Instruction status: Example of Speculative Tomasulo’s When MUL.D is ready to commit Instruction LD LD MULTD SUBD DIVD ADDD F6 F2 F0 F8 F10 F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Issue 1 2 3 4 5 6 Exec Write Comp Result 3 4 15 7 4 5 10 11 8 Cycle 15 of original Alg. Add: 2 cycles Multiply: 10 cycles Divide: 40 cycles With speculation, SUB.D and ADD.D will not commit until MUL.D commits, although the results are available and can be used 36 Hazards Through Memory - Load/Store RAW Hazard Question: Given a load that follows a store in program order, are the two related? – (Alternatively: is there a RAW hazard between the store and the load)? E.g.: st 0(R2),R5 ld R6,0(R3) Can we go ahead and start the load early (RAW)? – Store address could be delayed for a long time by some calculation that leads to R2 (divide?). – We might want to issue/begin execution of both operations in same cycle Answer is that we are not allowed to start load until we know that address 0(R2) ≠ 0(R3) – Not allowing a load to initiate the 2nd step if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load – maintaining the program order for the computation of an effective address of a load with respect to all earlier stores How about WAR/WAW hazards through memory? – Stores commit in order, so no worry 37 Multiple Issue Separate functional units for address calculation, ALU operations, and branch condition No speculation L.D must wait until the branch outcome is determined With speculation L.D following the BNE can start execution early with speculation 38 Extended Physical Registers Speculative Tomasulo’s algorithm with ROB – architecturally visible registers (R0, …, R31 and F0, …, F31) – values reside in the visible register set and RS, and temporarily in the ROB Alternative to ROB: a larger physical set of registers and register renaming – An extended set of physical registers is used to hold both architecturally visible registers and temporary values • The extended registers replace functions of ROB and RS • A physical register does not become the architectural register until the instr commits – During instruction issue, renaming process maps names of architectural registers to physical registers with renaming table, allocating a new unused register for the destination • WAW and WAR hazards are avoided by renaming the destination register 39 Register Renaming versus Reorder Buffers •Advantage: simplifies instruction commit mark register as no longer speculative, free register with old value. Require only two simple actions: 1.Record that the mapping between an architectural register # and physical register # is no longer speculative 2.free up any physical register being used to hold the value of the architectural register •Advantage: simplifies instruction issue All results are in the extended registers, so need not examine both the ROB and the register file •Disadvantage: deallocating registers is complex Before freeing up a physical register, we must know that – It no longer corresponds to an architectural register, and •Rewriting an architectural register causes the renaming table to point elsewhere – no further uses of the physical register are outstanding (not as a source) •Examining source register specifiers of all instructions in functional unit queues 20-80 extra registers: Alpha, PowerPC, MIPS, Pentium,… – Size limits no. instructions in execution (used until commit) 40 3.8 Studies of the Limitations of ILP •Conflicting studies of amount of improvement available – Benchmarks (vectorized FP Fortran vs. integer C programs) – Hardware sophistication – Compiler sophistication •Studies of ILP limitations – How much ILP is available using existing mechanisms with increasing HW budgets? – Do we need to invent new HW/SW mechanisms to keep on processor performance curve? 41 Studies of ILP Ideal Hardware Model Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming–infinite virtual/physical registers – all WAW & WAR hazards are avoided – unbounded number of instructions can begin execution simultaneously 2. Branch prediction–perfect; no misprediction 3. Jump prediction–all jumps perfectly predicted – machine with perfect speculation – an unbounded buffer of instructions available 4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal 5. Perfect cache –all loads and stores always complete in one cycle A2+A3: eliminate control dependencies A1+A4: eliminate all but true data dependencies (RAW) 1 cycle latency for all instructions; unlimited number of instructions issued per clock cycle 42 Upper Limit to ILP: Ideal Machine (Figure 3.35, page 242) This is the amount of parallelism when there are no branch mis-predictions and we’re limited only by data dependencies. Integer: 18-63 FP: 75-150 43 Limitations on Window Size and Maximum Issue Count •Operand dependence comparisons required to determine whether n issuing instructions have any register dependences among them: 2Σi = 2*(n-1)n/2 = n2 – n •Window: the set of instructions that is kept in the processor and examined for simultaneous execution # of result comparisons per cycle = maximum completion rate * window size * # of operands per instruction •Assume 2K window and 64 issues later Figure 3.36 Effects of reducing the window size FP (59-61) > Int (15-41) Figure 3.37 Effect of window size and average issue rate 44 Effects of Realistic Branch and Jump Prediction What parallelism do we get when we don’t allow perfect branch prediction, but assume some realistic model? Possibilities include: 1. Perfect - all branches are perfectly predicted (previous slide) 2. Tournament-based branch predictor – use a correlating 2-bit predictor and a non-correlating 2-bit predictor together with a selector. 8K entries for branch and 2K entries for jump 3. Standard 2-bit predictor with 512 2-bit entries 4. Static – A static predictor uses the profile history of the program and predicts that the branch is always taken or not taken 5. None - Parallelism is limited to basic block. Assume 2K window, 64 issues, and tournament-based predictor in later studies 45 Effects of BranchPrediction Schemes Figure 3.40 Branch-prediction accuracy Figure 3.38 Effect of branch-prediction schemes FP (15-48) > Int (9-12) Figure 3.39 sorted by applications 46 Effects of Finite Registers for Renaming FP (16-45) > Int (10-15) • Window size=2K, Max issue=64 instructions, tournament-based branch predictor • The impact on the integer programs is small primarily because the limitations in window size and branch prediction have limited the ILP substantially • Assume 256 integer and 256 FP registers available for renaming in later studies 47 Effects of Imperfect Memory Alias Analysis Different models for memory alias analysis (memory disambiguation): 1. Perfect (no memory disambiguation) 2. Global/stack perfect (by best compiler-based analysis schemes) and heap references conflict 3. Inspection – Examine the accesses to see if they can be determined not to interfere at compile time –eg. Mem[R10 + 20] and Mem[R10+100] never conflict (same base register with different offsets) 4. None – All memory references are assumed to conflicts 48 Effects of Imperfect Memory Alias Analysis • Since there is no heap references in Fortran, there is no difference between perfect and global/stack perfect analysis for Fortran programs • 2K window, 64 issues Figure 3.43 Effect of alias analysis 3.44 sorted by applications 49 3.10 The P6 Microarchitecture – The basis for Pentium Pro, Pentium II and Pentium III – A dynamically scheduled processor that translates each IA-32 instruction to a series of micro-operations (uops) executed by the pipeline • uops are similar to typical RISC instructions • Up to 3 IA-32 instructions are fetched, decoded, and translated into uops every clock cycle • If an IA-32 instruction requires more than 4 uops, implemented by a microcoded sequence that generates the necessary uops in multiple clock cycles => Max=6 Differ in clock rate, cache architecture, and memory interface. Pentium II added MMX (multimedia extension). Pentium III added SSE (Streaming SIMD Extensions) 50 P6 Microarchitecture Pipeline • uops executed by an out-of-order speculative pipeline using register renaming and a ROB (similar to Section 3.7) – Up to 3 uops per clock renamed and dispatched to RS, or committed • 14 super-pipelined stages – 8 stage: in-order instruction fetch, decode, and dispatch • 512-entry, two-level (correlating) branch predictor • decode and issue stages include 40 extended registers for register renaming and dispatch to one of 20 RS and to one of 40 entries in the ROB – 3 stages: out-or-order execution in one of 5 separate FU (ALU, FP, branch, memory address, memory access, 1-32 cycles) – 3 stages: instruction commit Repeat rate of 2 means that operations can start every other cycle 51 Stalls in Decode Cycle Figure 3.50 # of instructions decoded per clock (average = 0.87 instructions per cycle) Figure 3.51 Stall cycles per instruction at decode time (I-cache miss + lack of RS/ROB) 52 Figure 3.52 Number of micro-operations per IA-32 Instruction • Most instruction will take only one uop • On average, 1.37 uops per IA-32 instruction • Other than fpppp, the integer programs typically require more uops 53 Figure 3.53 Number of misses per thousand instructions for L1 and L2 caches •L1=8KB I+8KB D (hide by speculative) •L2=256KB (cost 5 times) (dominate performance) 54 Figure 3.54 BTB miss frequency (dominate) vs.. mispredict frequency If BTB misses, a static prediction is used •backward branches are predicted taken (1-cycle penalty if correctly predicted) •forward branches are predicted not taken (no penalty if correctly predicted) Branch mispredict: •direct penalty: 10-15 cycles •indirect: hard to measure incorrectly speculated instructions On average about 20% use the simple static predictor rule 55 Instruction Commit Figure 3.55 the fraction of issued instructions that do not commit. On average, each mispredicted branch issues 20 uops canceled Figure 3.56 Breakdown in how often 0-3 uops commit in a cycle (average: 55%, 13%, 8%, 23%) 56 Figure 3.57 Actual CPI and Individual CPIs uop cycles assume that 3 uops are completed every cycle and include the # of uops per instruction Average CPI is 1.15 for SPECint programs and 2.0 for SPECFP programs 57 AMD Althon • Similar to P6 microarchitecture (Pentium III), but more resources • Transistors: PIII 24M vs. Althon 37M • Die Size: 106 mm2 vs. 117 mm2 • Power: 30W vs. 76W • Cache: 16K/16K/256K vs. 64K/64K/256K • Window size: 40 vs. 72 uops • Rename registers: 40 vs. 36 int +36 Fl. Pt. • BTB: 512 x 2 vs. 4096 x 2 • Pipeline: 10-12 stages vs. 9-11 stages • Clock rate: 1.0 GHz vs. 1.2 GHz • Memory bandwidth: 1.06 GB/s vs. 2.12 GB/s 58 Pentium 4 – NetBurst Microarchitecture • Still translate from 80x86 to micro-ops • A much deeper pipeline: 24 (vs. 14) • Use register renaming (potentially up to 128) rather than ROB (vs. 40) – Window: 40 v. 126 • 7 execution units (vs. 5, one more ALU and address computation unit) • P4 has better branch predictor, more FUs • aggressive ALU and data cache (operating half a clock cycle) • 8 times larger BTB (4096 vs.. 512) • New SSE2 instructions allow 2 floating operations per instruction • Instruction Cache holds micro-operations vs. 80x86 instructions – no decode stages of 80x86 on cache hit – called “trace cache” (TC) • Faster memory bus: 400 MHz v. 133 MHz • Caches – Pentium III: L1I 16KB, L1D 16KB, L2 256 KB – Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB – Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock • Clock rates: – Pentium III 1 GHz v. Pentium IV 1.5 GHz 59 The Pentium 4 • Pentium, Pentium Pro, Pentium 4 Pipeline Pentium (P5) = 5 stages Pentium Pro, II, III (P6) = 10 stages (1 cycle ex) Pentium 4 (NetBurst) = 20 stages (no decode) 60 The Pentium 4 • • • • Block Diagram of Pentium 4 Microarchitecture BTB = Branch Target Buffer (branch predictor) I-TLB = Instruction TLB, Trace Cache = Instruction cache RF = Register File; AGU = Address Generation Unit "Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.s 61 The Pentium 4 Pentium 4 Die Photo • • • • • 42M Xtors – PIII: 26M 217 mm2 – PIII: 106 mm2 L1 Execution Cache – Buffer 12,000 Micro-Ops 8KB data cache 256KB L2$ 62 The Pentium 4 • • • • • • Benchmarks: Pentium 4 v. PIII v. Althon SPEC base2000 – Int, P4@1.5 GHz: 524, PIII@1GHz: 454, AMD Althon@1.2Ghz:? – FP, P4@1.5 GHz: 549, PIII@1GHz: 329, AMD Althon@1.2Ghz:304 WorldBench 2000 benchmark (business) PC World magazine, Nov. 20, 2000 (bigger is better) – P4 : 164, PIII : 167, AMD Althon: 180 Quake 3 Arena: P4 172, Althon 151 SYSmark 2000 composite: P4 209, Althon 221 Office productivity: P4 197, Althon 209 S.F. Chronicle 11/20/00: "… the challenge for AMD now will be to argue that frequency is not the most important thing-- precisely the position Intel has argued while its Pentium III lagged behind the Althon in clock speed." 63 The Pentium 4 • • • • • • Why is the Pentium 4 Slower than the Pentium 3? Instruction count is the same for x86 Clock rates: P4 > Althon > PIII How can P4 be slower? Time = Instruction count x CPI x 1/Clock rate Average Clocks Per Instruction (CPI) of P4 must be worse than Althon, PIII Will CPI ever get < 1.0 for real programs? 64 The Pentium 4 • • • Another Approach: Multithreaded Execution for Servers Thread: process with own instructions and data – thread may be a process part of a parallel program of multiple processes, or it may be an independent program – Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute Multithreading: multiple threads to share the functional units of 1 processor via overlapping – processor must duplicate independent state of each thread e.g., a separate copy of register file and a separate PC – memory shared through the virtual memory mechanisms Threads execute overlapped, often interleaved – When a thread is stalled, perhaps for a cache miss, another thread can be executed, improving throughput 65 Summary 3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & the Algorithm 3.4 Reducing Branch Costs with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of the Limitations of ILP 3.9 Limitations on ILP for Realizable Processors 3.10 The P6 Microarchitecture 66