Out-of-Order Execution Scheduling A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Instruction Level Parallel Processing • Sequential Execution Semantics • Out-of-Order Execution – How it can help – Issues: • Maintaining Sequential Semantics • Scheduling – Scoreboard • Register Renaming • Initially, we’ll focus on Registers, Memory later on A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Sequential Semantics - Review • Instructions appear as if they executed: – In the order they appear in the program – One after the other Program Order A. Moshovos © Pipelining Superscalar ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Out-of-Order Execution loop: add ld add sub bne r4, r2, r3, r1, r1, r4, 10(r4) r3, r1, r0, 1 r2 1 loop Superscalar fetch decode fetch decode fetch decode add fetch decode sub fetch decode decode fetch decode fetch decode fetch decode fetch decode A. Moshovos © sum += a[++m]; i--; } while (i != 0); add ld bne out-of-order fetch do { add ld add sub bne ECE1773 - Fall ‘07 ECE Toronto Sequential Semantics? • Execution does NOT adhere to sequential semantics inconsistent fetch decode fetch decode fetch decode fetch decode fetch decode add ld add sub bne consistent • • • • To be precise: Eventually it may Simplest solution: Define problem away Not acceptable today: e.g., Virtual Memory Three-phase Instruction execution – In-Progress, Completed and Committed A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Out-of-order Execution Issues • Preserving Sequential Semantics • Stalling Instructions w/ dependences • Issuing Instructions when dependences are satisfied A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Back to Sequential Semantics • Instr. exec. in 3 phases: – In-progress, Completed, Committed – OOO for in-progress and Completed – In-order Commits • Completed - out-of-order: ”Visible only inside” – Results visible to subsequent instructions – Results not visible to outsiders • On interrupts completed results are discarded • Committed - in-order: ”Visible to all” – Results visible to subsequent instructions – Results visible to outsiders • On interrupt committed results are preserved A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto How Completes Help w/ Performance in-order completes out-of-order completes in-order commits DIV R3, _, _ ADD R1, _, _ ADD _, R1, _ In-order commits fetch decode fetch decode fetch decode fetch decode fetch decode add commit ld commit add sub commit bne complete A. Moshovos © commit ECE1773 - Fall ‘07 ECE Toronto commit Implementing Completes/Commits • Key idea: – Maintain sufficient state around to be able to rollback when necessary – Roll-back: • Discard (aka Squash) all not committed • One solution (conceptual): – Upon Complete instruction records previous value of target register – Upon Discard, instruction restores target value – Upon Commit, nothing to do • We will return to this shortly • Focus on scheduling mechanisms A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Overview Processing Phase Static program Dispatch/ dependences dynamic inst. Stream (trace) inst. Issue inst execution inst. Reorder & commit A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Committed completed instructions Completed execution window In-Progress Program Form Out-of-Order Execution: Stages • Fetch: get instruction from memory • Decode/Dispatch: what is it? What are the dependences • Issue: Go – all dependences satisfied • Execute: perform operation • Complete: result available to other insts. • Commit: result available to outsiders • We’ll start w/ Decode/Dispatch • Then we’ll consider Issue A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto OOO Scheduling • Instruction @ Decode: – Do I have dependences yet to be satisfied? – Yes, stall until they are – No, clear to issue • Wakeup Instructions Stalled: – Dependences satisfied – Allow instruction to issue • Dependence: – (later instruction, earlier instruction) & type • We’ll first consider RAW and then move on to WAW and WAR A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Stalling @ Decode for RAW • Are there unsatisfied dependences? – RAW: have to wait for register value – We don’t really care who is producing the value – Only whether it is available • Can use the Register Availability Vector as in pipelining/superscalar – Also known as scoreboard • At Decode – Reset bit corresponding to your target – At writeback set – Check all bits for source regs: if any is 0 stall A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Issuing Instructions: Scheduling • Determine when an instruction can issue – Ignore resources for the time being • Stalled because of RAW w/ preceding instruction • Concept: – Producer (write) notifies consumers (read) • Requirements: – Consumers need to be able to identify producer – The register name is one possible link • Mechanism – Consumer placed in a reservation station – Producers on complete broadcasts identity – Waiting instructions observe – Update Operand Availability – Issue if all operands now available A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Reservation Station • State pertaining to an instruction – What registers it reads – Whether they are available – What is the destination register – What state is the instruction in • Waiting • Executing A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Out-Of-Order Exec. Example loop: add ld add sub bne r4, r2, r3, r1, r1, r4, 10(r4) r3, r1, r0, 4 4 cycles lat r2 1 loop RAV r1 r2 r3 r4 1 1 1 1 op src1 Cycle 0 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto src2 tgt status Out-Of-Order Exec. Example: Cycle 0 loop: add ld add sub bne r4, r2, r3, r1, r1, r4, 10(r4) r3, r1, r0, 4 5 cycles lat r2 1 loop Ready to be executed RAV r1 r2 r3 r4 op src1 src2 tgt status 1 1 1 0 add r4/1 NA/1 r4/0 Rdy Cycle 0 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Cycle 1 loop: add ld add sub bne r4, r2, r3, r1, r1, r4, 10(r4) r3, r1, r0, 4 r2 1 loop Notify those waiting for R4 RAV r1 r2 r3 r4 op src1 src2 tgt status 1 0 1 1 add r4/1 NA/1 r4 Exec ld r4/1 NA/1 r2 Rdy R4 gets produced now A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Cycle 2 loop: add ld add sub bne r4, r2, r3, r1, r1, r4, 10(r4) r3, r1, r0, 4 r2 1 loop Result available @ cycle 6 RAV r1 r2 r3 r4 op src1 src2 tgt status 1 0 0 1 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec add r3/1 r2/0 r3 Wait Wait for r2 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Cycle 3 loop: add ld add sub bne r4, r2, r3, r1, r1, r4, 10(r4) r3, r1, r0, 4 r2 1 loop Result available @ cycle 6 RAV r1 r2 r3 r4 op src1 src2 tgt status 0 0 0 1 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec add r3/1 r2/0 r3 Wait sub r1/1 NA/1 r1 Rdy Wait for r2 No dependences A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Cycle 4 loop: add ld add sub bne r4, r2, r3, r1, r1, r4, 10(r4) r3, r1, r0, 4 r2 1 loop Result available @ cycle 6 RAV r1 r2 r3 r4 op src1 src2 tgt status 1 0 0 1 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec add r3/1 r2/0 r3 Wait sub r1/1 NA/1 r1 Exec bne r1/1 r0/1 NA Rdy Wait for r2 r1 produced now Notify consumers r1 will be available next cycle A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Cycle 5 loop: add ld add sub bne r4, r2, r3, r1, r1, r4, 10(r4) r3, r1, r0, 4 r2 1 loop Result available @ cycle 6 RAV r1 r2 r3 r4 op src1 src2 tgt status 1 0 0 1 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec add r3/1 r2/0 r3 Wait sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec Wait for r2 Completed executing A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Cycle 6 loop: add ld add sub bne r4, r2, r3, r1, r1, r4, 10(r4) r3, r1, r0, 4 r2 1 loop RAV Result available @ cycle 6 Notify consumers r1 r2 r3 r4 op src1 src2 tgt status 1 1 0 1 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Exec add r3/1 r2/1 r3 Rdy sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec Wait for r2 Completed executing A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Cycle 7 loop: add ld add sub bne r4, r2, r3, r1, r1, r4, 10(r4) r3, r1, r0, 4 r2 1 loop Notify consumers RAV r1 r2 r3 r4 op src1 src2 tgt status 1 1 1 1 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Cmtd add r3/1 r2/1 r3 Exec sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Compl Executing Completed A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Cycle 8 loop: add ld add sub bne r4, r2, r3, r1, r1, r4, 10(r4) r3, r1, r0, 4 r2 1 loop RAV r1 r2 r3 r4 op src1 src2 tgt status 1 1 1 1 add r4/1 NA/1 r4 Cmtd ld r4/1 NA/1 r2 Cmtd add r3/1 r2/1 r3 Cmtd sub r1/1 NA/1 r1 Cmtd bne r1/1 r0/1 NA Cmtd A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Notifying Consumers • Identity of Producer • Uniquely Identify the Instruction • Easily retrievable @ decode by others – Target Register • Recall we stall on WAR or WAW – Functional Unit • If not pipelined – Place in instruction window – PC? not. Why? A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Name Dependences and OOO • WAW or WAR: We need to update register but others are still using it – add r1, r1, 10 – sw r1, 20(r2) – add r1, r3, 30 – sub r2, r1, 40 • There is only one r1 – sw needs to see the value of 1st add – sub needs to wait for 2nd add and not 1st • Solution: Stall decode when WAW or WAR A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Detecting WAW and WAR • WAW? Look at Scoreboard – If bit is 0 then there is a pending write – Stall • WAR? Need to know whether all preceding consumers have read the value – Keep a count per register – Increase at decode for all reads – Decrease on issue • More elegant solution via register renaming – Soon A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto instructions Window vs. Scheduler A. Moshovos © • Window – Distance between oldest and youngest instruction that can co-exist inside the CPU – Larger window Potential for more ILP • Scheduler – Number of instructions that are waiting to be issued • Window – Instructions enter at Fetch – Exit at Commit • Scheduler – Instructions enter at Decode – Leave at writeback • Window >= Scheduler – Can be the same structure • In window but not in scheduler completed ECE1773 - Fall ‘07 ECE Toronto Scoreboarding • Schedule based on RAW dependences • WAW and WAR cause stalls – WAW at decode – WAR at writeback • Optimization: Why is this OK? • Implemented in the CDC 6600 in ‘64 – 18 non-pipelined FUs • 4 FP: 2 mul, 1 add, 1 div • 7 MEM: 5 load, 2 store • 7 INT: add, shift, logical etc. • Centralized Control Scheme – Controls all Instruction Issue – Detects all hazards A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto MIPS/DLX w/ Scoreboarding FP mul FP mul Register File FP divide FP add FP integer scoreboard A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Scoreboarding Overview • Ignore IF and MEM for simplicity • 4-stage execution – Issue Check for structural hazards Check for WAW hazards Stall until all clear – ReadOp Check for RAW hazards Wait until all operands ready Read Registers – Execute Execute Operations Notify scoreboard when complete – Write Check for WAR hazards Stall Write until all clear • A completing instruction cannot write dest if an earlier instruction has not read dest. A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Scoreboarding Optimizations/Tricks • WAW as in original OOO • WAR is optimized – Second Producer is allowed to execute up to complete – It is stalled there until preceding consumers complete • No Commit – No precise interrupts • Window is implemented in the scoreboard • One entry per Functional Unit – Recall not pipelined – Instructions identified by FU id A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Scoreboarding Organization • Three structures – Instruction Status – Functional Unit Status – Register Result Status • Instruction Status – Which stage the instruction is currently in • Functional Unit Status: scheduling – – – – – – Busy OP Fi Fj, Fk Qj, Qk Rj, Rk Operation Dest. Reg. Source Regs FUs producing sources Ready bits for sources • Register Result Status: dep. determination – Which FU will produce a register A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Scoreboarding explained • Register status reg: – Which FU produces the register • Use at decode – Source reg match is a RAW – Target reg macth is a WAW stall A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Functional Unit Status • Busy: – resource allocation • OP: – what to do once issued (e.g., add, sub) • Dest. Reg.: – Where to write result – To find WAR • Fj, Fk Source Regs – for WAR: can’t write if consumers pending for previous value of register (if FU not the same) • Qj, Qk FUs producing sources – To wait for appropriate producer • Rj, Rk Ready bits for sources – To determine when ready: all ready A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Scoreboarding Example Instruction status Instruction j LD F6 34+ LD F2 45+ MULTD F0 F2 SUBD F8 F6 DIVD F10 F0 ADDD F6 F8 k R2 R3 F4 F2 F6 F2 Read Execution Write Issue operandscomplete Result Functional Unit Status Name Integer Mult1 Mult2 Add Divide Busy Op No No No No No dest S1 Fi Fj S2 Fk FU for j Qj FU for k Fj? Qk Rj Fk? Rk F4 F8 F10 F12 F30 Register result status Clock F0 F2 F6 FU A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto ... Example: Cycle 0 Instruction status Instruction j LD F6 34+ LD F2 45+ MULTD F0 F2 SUBD F8 F6 DIVD F10 F0 ADDD F6 F8 k R2 R3 F4 F2 F6 F2 Read Execution Write Issue operandscomplete Result 1 Functional Unit Status Name Integer Mult1 Mult2 Add Divide Busy Op yes LD No No No No dest S1 Fi Fj F6 S2 Fk FU for j Qj FU for k Fj? Qk Rj Fk? Rk F10 F12 F30 Register result status Clock F0 FU A. Moshovos © F2 F4 F6 F8 integer ... ECE1773 - Fall ‘07 ECE Toronto Example, contd. • The rest you’ll find on the web site • Go through it • Source: Patterson • Summary: – Execution proceeds in an order dictated by dependences – RAW, WAR and WAW force ordering – Tricks may be possible A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Beyond Simple OoO A: LF B: LF C: MULF D: SUBF E: ADDF F6, F2, F0, F2, 34(R2) 45(R3) F2, F4 F8, F2, F7, F4 F6 A B D C E • • • • E will wait for B, C and D. WAR w/ C and D WAW w/ B Can we do better? A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto What if we had infinite registers A: LF B: LF C: MULF D: SUBF E: ADDF F6, F2, F0, A: LF B: LF C: MULF D: SUBF E: ADDF F6, F2, F0, F2, F9, 34(R2) 45(R3) F2, F4 F8, F2, F7, F4 34(R2) 45(R3) F2, F4 F8, F2, F7, F4 F6 F6 No false dependences anymore Since we do not reuse a name we can’t have WAW and WAR A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Why we can’t have Infinite Registers • False/Name dependences (WAR and WAW) – Artifact of having finite registers • There is no such thing as infinite • There is no such thing as large enough – Well there is (in a sec.) – Computers execute Billions of Instructions per sec. Even a multi-billion register file would soon be exhausted • Want to exploit parallelism across several instances of the same code – Loops, recursive functions (most frequent part) A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Yes, there is “large enough” • At any given point there will be a finite number of instructions in the window • if each instruction has a single register target • if there are N instructions • How many registers do we need? • N? • N + X? A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Register Renaming • Register Version – Every Write creates a new version – Uses read the last version – Need to keep a version until all uses have read it. • Register Renaming: – Architectural vs. Physical Registers • more phys. than arch. – Maintain a map of arch. to phys. regs. – Use in-order decoding to properly identify dependences. – Instructions wait only for input op. availability. – Only last version is written to reg. file. A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Register Renaming A: DIVF F3, B: SUBF r2, -, C: MULF r3, r2, D: SUBF r4, r2, r1 Register Rename Table E: ADDFF0 F1 F2 F2, F3 F: ADDF A R1 B r6, r3, r5R2 R1 C D E F R3 R3 R3 R6 R2 R2 R5 R5 R1 R1 R1 R1 F1, F2, F0 F1, r1, -, F0 F0, F2, F4 F6, F2, F3 F5, F6 F4 F5 F7 F0, F0, -, ... r5,F30 F2 R4 R4 R4 Need more physical registers than architectural Ignore control flow for the time being. A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Register Renaming Process • Only need to remember last producer of each architectural register – Vector • At decode – Find the most recent producers for all source registers – After: declare self as most recent producer of target register • Complication: – May have to retract • Speculative Execution, e.g., interrupts – Need to be able to restore the mapping state A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Register Renaming Support Structures • Register Rename Table – f(aR) = pR – one entry per architectural Register • Free Register List – Lists not used Physical Registers • At Decode – grab a new register from the free list – Change mapping in rename table • At Commit – Release Register? Not… Why? – Could release previous version A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto How Many Physical Registers? • Correctness: – At least as many architectural plus? • Performance: – As many as possible – Not correctness – Recall not all instructions produce register results • stores and branches A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Dynamic Scheduling A: DIVF B: SUBF C: MULF D: SUBF E: ADDF F: ADDF F3, F1, F0 F2, F1, F0 F0, F2, F4 F6, F2, F3 F2, F5, F4 F0, F0, F2 r1, r2, r3, r4, r5, r6, -, -, r2, r2, -, r3, r1 r5 - Values and Names flow together Name Value - Writeback specifies both value and name - A waiting instruction inspects all results - It is allowed to execute when all inputs are available A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Physical Registers • Physical register file is just one option • What we need is separate storage – Consumers could keep values in their reservation station – Tomasulo’s next A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Tomasulo’s Algorithm • IBM 360/91 - Fast 360 for scientific code – Completed in 1967 – Dynamic scheduling – Predates cache memories • Pipelined FUs – Adder up to 3 instructions – Multiplier up to 2 instructions • Tomasulo vs. Scoreboard – Distributed hazard detection and control – Results are bypassed to FUs – Common Data Bus (CDB) for results • All results visible to all instead of via a register A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto DLX w/ Tomasulo • Tomasulo’s Algorithm – Use “tags” to identify data values – Reservation stations distributed control – CDB broadcasts all results to all RSs • Extend DLX as example – Assume multiple FUs than pipelined – Main difference is Register-Memory Insts. • I.e., DLX does not have them • But that’s really a detail :-) • Physical Registers? – Not really. What we need is different storage and name for every version. – Here it’s the producing reservation station A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Dynamic DLX Operation Stack Registers RS RS adders Mults CDB Store buffers Load buffers A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Tomasulo’s Algorithm • 3 major steps – Dispatch • Get instruction from fetch queue • ALU op: check for available RS • Load: Check for available load buffer • If available: dispatch and copy read regs to RS or load buffer • if not: stall - structural hazard – Issue • If all ops are available: issue • If not monitor CDB for operands – Complete • If CDB available, broadcast result • else stall A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Tomasulo’s Algorithm contd. • Reservation stations – Handle distributed hazard detection and instruction control • Everything receiving data get its tag – 4-bit tag specifies reservation station or load buffer – Also which FU will produce result • Register specifier is used to assign tags – Then they are discarded – Input register specifiers are ONLY used in dispatch. (Rename table) • Common Data Bus: – value + “tag” = where this comes from – vs. typical bus: value + “tag” = where this goes to A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Tomasulo’s Algorithm Contd. • Reservation Stations – Op Opcode – Qj, Qk Tag Fields (source ops) – Vj, Vk Operand values (source ops) – Busy Currently in use • Register file and Store Buffer – Qi Tag field – Busy Currently in use – Vi Value • Load Buffers – Busy Currently in Use A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Tomasulo’s: Understanding Speculative vs. Architectural State • add r1, r2, 10 • sub r4, r1, 20 • add r1, r3, 30 Register file Arch. Reg. Name I have it Value of r1 I have it Value of r2 I have it Value of r3 I have it Value of r4 Reservation Stations tgt src1 Can be: “I have it”, “reservation station id” src2 NA NA Value of Src1 NA Value of Src2 NA NA Value of Src1 NA Value of Src2 Reg Arch. name A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Renaming 1st Instruction • add r1, r2, 10 • sub r4, r1, 20 • add r1, r3, 30 Register file • Read sources (r2) • Rename r1 to RS0 RS0 ----- I have it Value of r2 I have it Value of r3 I have it Value of r4 Reservation Stations tgt RS0 src1 src2 r1 I have it Value of R2 I have it 10 NA NA Value of Src1 NA Value of Src2 NA NA Value of Src1 NA Value of Src2 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Renaming 2nd Instruction • add r1, r2, 10 • sub r4, r1, 20 • add r1, r3, 30 Register file • Sources: r1 in RS0 NYA • Rename r4 to RS1 RS0 ----- I have it Value of r2 I have it Value of r3 RS1 ---- Reservation Stations tgt RS1 src1 src2 r1 I have it Value of R2 I have it 10 r4 RS0 ---- I have it 20 NA NA Value of Src1 NA Value of Src2 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Renaming 3rd Instruction • add r1, r2, 10 • sub r4, r1, 20 • add r1, r3, 30 Register file • Sources: r3 Avail. • Rename r1 to RS2 RS2 ----- I have it Value of r2 I have it Value of r3 RS1 ---- Reservation Stations tgt RS2 src1 src2 r1 I have it Value of R2 I have it 10 r4 RS0 ---- I have it 20 r1 I have it Value of R3 I have it 30 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Example: cycle 0 Instruction status Instruction j LD F6 34+ LD F2 45+ MULTD F0 F2 SUBD F8 F6 DIVD F10 F0 ADDD F6 F8 Reservation Stations Time Name Busy 0 Add1 No 0 Add2 No 0 Add3 No 0 Mult1 No 0 Mult2 No Register result status F0 F2 Issue Execution complete Write Result Op S1 Vj S2 Vk RS for j Qj RS for k Qk F4 F6 F8 F10 ... k R2 R3 F4 F2 F6 F2 FU load buffers Busy Address Load1 No Load2 No Load3 No A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Example: cycle 1 Instruction status Instruction j LD F6 34+ LD F2 45+ MULTD F0 F2 SUBD F8 F6 DIVD F10 F0 ADDD F6 F8 Reservation Stations Time Name Busy 0 A1 No 0 A2 No 0 A3 No 0 M1 No 0 M2 No Register result status F0 F2 Execution complete Write Result Op S1 Vj S2 Vk RS for j Qj RS for k Qk F4 F6 F8 F10 ... k R2 R3 F4 F2 F6 F2 FU L1 L2 L3 Issue 1 L1 load buffers Busy Address yes 34+R2 No No A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Example: cycle 3 Instruction status Instruction j LD F6 34+ LD F2 45+ MULTD F0 F2 SUBD F8 F6 DIVD F10 F0 ADDD F6 F8 Reservation Stations Time Name Busy 0 A1 No 0 A2 No 0 A3 No 0 M1 Yes 0 M2 No Register result status FU L1 L2 L3 F0 F2 M1 L2 k R2 R3 F4 F2 F6 F2 Op S1 Vj Mul F4 load buffers Busy Address yes 34+R2 No 45+R3 No A. Moshovos © Issue 1 2 3 F6 Execution complete 3 Write Result S2 Vk RS for j Qj R(F4) L2 F8 F10 RS for k Qk ... L1 - Mul is issued vs. scoreboard - What’s waiting for L1? ECE1773 - Fall ‘07 ECE Toronto Example… • Check the web site… • Too much for in-class • Summary: – Execution proceeds in any order that does not violate RAW dependences – WAR and WAW are removed A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Tomasulo’s vs. Scoreboard Instruction status Execution Write Instruction j k Issue complete Result LD F6 34+ R2 1 3 4 LD F2 45+ R3 2 4 5 MULTD F0 F2 F4 3 15 16 SUBD F8 F6 F2 4 7 8 DIVDF10 F0 F6 5 56 57 ADDD F6 F8 F2 6 10 11 Scoreboard: Instruction status Instruction j LD F6 34+ LD F2 45+ MULTD F0 F2 SUBD F8 F6 DIVD F10 F0 ADDD F6 F8 A. Moshovos © - In-order issue - Out-of-order execution Read Execution Write k Issue operandscomplete Result- Out-of-order completion R2 R3 F4 F2 F6 F2 1 5 6 7 8 13 2 6 9 9 21 14 3 7 19 11 61 16 4 8 20 12 62 22 ECE1773 - Fall ‘07 ECE Toronto Tomasulo’s • Out-of-order loads and stores? – What about WAW, RAW and WAR? – Compare all load addresses against the addresses of all preceding store buffers – Stall if they match • CDB is a bottleneck – One write per cycle – Could duplicate – But, come at a cost – Datapath + duplicated tags and control • Complex Implementation – Scalability? – All results to all sources – What if we want 128 instrs? A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Tomasulo’s • Advantages – Distribution of hazard detection – Elimination of WAR and WAW stalls • Common Data Bus – Broadcasts result to multiple instrs (+) – Bottleneck • Register Renaming – Removes WAR and WAW hazards – More interesting when same code appears twice • Think of loops • More on this later – BUT: Associative lookups – RECALL: direct map is faster A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto In Summary Feature Scoreboarding Tomasulo's CDC6600 IBM 360 Structural Stall in Issue for FU RAW Via Registers Stall in Dispatch for RS Stall in RS for FU From CDB WAR Stall in WB Copy Value to RS WAW Stall in Issue Register Renaming Logic Centralized Distributed Bottlenecks No Register One CDB Bypass Stall in issue block A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto