CS252 Graduate Computer Architecture Lecture 7 Dynamic Scheduling 2: Precise Interrupts February 9th, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252 Registers FP Mult FP Mult FP Divide FP Add Integer SCOREBOARD 2/9/2011 CS252-S11, lecture 7 Functional Units Review: Scoreboard (CDC 6600) Memory 2 Review: Four Stages of Scoreboard Control • Issue—decode instructions & check for structural hazards – Instructions issued in program order (for hazard checking) – Don’t issue if structural hazard – Don’t issue if instruction is output dependent on any previously issued but uncompleted instruction (no WAW hazards) • Read operands—wait operands ready, then read them – All real dependencies (RAW hazards) resolved in this stage – No forwarding of data in this model! • Execution—operate on operands – The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. • Write result—finish execution – Stall if WAR hazards with previous instructions: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands Example: 2/9/2011 CS252-S11, lecture 7 3 Review: Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 FP adders Reservation Stations To Mem FP multipliers Common Data Bus (CDB) 2/9/2011 CS252-S11, lecture 7 4 Review: Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available • Normal data bus: data + destination (“go to” bus) • Common data bus: data + source (“come from” bus) – 64 bits of data + 4 bits of Functional Unit source address – Write if matches expected Functional Unit (produces result) – Does the broadcast 2/9/2011 CS252-S11, lecture 7 5 Review: Compare to Scoreboard Cycle 62 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 Read Exec Write k Issue Oper Comp Result R2 R3 F4 F2 F6 F2 1 5 6 7 8 13 2 6 9 9 21 14 3 7 19 11 61 16 4 8 20 12 62 22 Exec Write Issue ComplResult 1 2 3 4 5 6 3 4 15 7 56 10 4 5 16 8 57 11 • Why take longer on scoreboard/6600? • Structural Hazards • Lack of forwarding 2/9/2011 CS252-S11, lecture 7 6 Review: Loop Example Cycle 9 Instruction status: ITER Instruction 1 1 1 2 2 2 LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4 j k 0 F0 0 0 F0 0 R1 F2 R1 R1 F2 R1 1 2 3 6 7 8 9 Vj S1 Vk S2 Qj Reservation Stations: Time Exec Write Issue CompResult Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd RS Qk R(F2) Load1 R(F2) Load2 Busy Addr Fu Load1 Load2 Load3 Store1 Store2 Store3 Yes Yes No Yes Yes No 80 72 80 72 Mult1 Mult2 Code: LD MULTD SD SUBI BNEZ F0 F4 F4 R1 R1 0 F0 0 R1 Loop R1 F2 R1 #8 ... F30 Register result status Clock 9 R1 72 F0 Fu Load2 F2 F4 F6 F8 F10 F12 Mult2 • Dataflow graph constructed completely in hardware • Renaming detaches early iterations from registers 2/9/2011 CS252-S11, lecture 7 7 Explicit Register Renaming • Tomasulo provides Implicit Register Renaming – User registers renamed to reservation station tags • Explicit Register Renaming: – Use physical register file that is larger than number of registers specified by ISA • Keep a translation table: – ISA register => physical register mapping – When register is written, replace table entry with new register from freelist. – Physical register becomes free when not being used by any instructions in progress. • Pipeline can be exactly like “standard” DLX pipeline – IF, ID, EX, etc…. • Advantages: – – – – 2/9/2011 Removes all WAR and WAW hazards Like Tomasulo, good for allowing full out-of-order completion Allows data to be fetched from a single register file Makes speculative execution/precise interrupts easier: » All that needs to be “undone” for precise break point is to undo the table mappings CS252-S11, lecture 7 8 Registers FP Mult FP Mult FP Divide FP Add Integer SCOREBOARD Functional Units Question: Can we use explicit register renaming with scoreboard? Memory Rename Table 2/9/2011 CS252-S11, lecture 7 9 Scoreboard Example Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result Functional unit status: Op dest Fi S1 Fj S2 Fk Register Rename and Result Clock F0 F2 F4 F6 F8 F10 F12 P4 P6 Time Name Int1 Int2 Mult1 Add Divide FU Busy FU Qj FU Qk Fj? Rj Fk? Rk ... F30 No No No No No P0 P2 P8 P10 P12 P30 • Initialized Rename Table 2/9/2011 CS252-S11, lecture 7 10 Renamed Scoreboard 1 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 Functional unit status: Time Name Int1 Int2 Mult1 Add Divide Busy Op dest Fi Yes No No No No Load P32 Register Rename and Result Clock F0 F2 1 FU P0 P2 S1 Fj S2 Fk FU Qj FU Qk Fj? Rj R2 F4 F6 P4 P32 Yes F8 F10 F12 P8 Fk? Rk P10 P12 ... F30 P30 • Each instruction allocates free register • Similar to single-assignment compiler transformation 2/9/2011 CS252-S11, lecture 7 11 Renamed Scoreboard 2 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 2 Functional unit status: Time Name Int1 Int2 Mult1 Add Divide Busy Op dest Fi Yes Yes No No No Load Load P32 P34 Register Rename and Result Clock F0 F2 2 2/9/2011 FU P0 P34 S1 Fj S2 Fk FU Qj FU Qk Fj? Rj R2 R3 F4 F6 P4 P32 CS252-S11, lecture 7 Yes Yes F8 F10 F12 P8 Fk? Rk P10 P12 ... F30 P30 12 Renamed Scoreboard 3 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 2 3 Functional unit status: Time Name Int1 Int2 Mult1 Add Divide Busy Op dest Fi Yes Yes Yes No No Load Load Multd P32 P34 P36 P34 F4 F6 P4 P32 Register Rename and Result Clock F0 F2 3 2/9/2011 FU 3 P36 P34 S1 Fj S2 Fk R2 R3 P4 CS252-S11, lecture 7 Fj? Rj Fk? Rk Int2 No Yes Yes Yes F8 F10 F12 ... F30 P8 FU Qj P10 FU Qk P12 P30 13 Renamed Scoreboard 4 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 2 3 3 4 4 Busy Op dest Fi S1 Fj S2 Fk No Yes Yes Yes No Load Multd Sub P34 P36 P38 P34 P32 R3 P4 P34 F4 F6 F8 F10 F12 P4 P32 P38 Functional unit status: Time Name Int1 Int2 Mult1 Add Divide Register Rename and Result Clock F0 F2 4 2/9/2011 FU P36 P34 CS252-S11, lecture 7 FU Qj FU Qk Int2 Int2 P10 P12 Fj? Rj Fk? Rk No Yes Yes Yes No ... F30 P30 14 Renamed Scoreboard 5 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 2 3 3 4 4 5 Busy Op dest Fi S1 Fj S2 Fk No No Yes Yes Yes Multd Sub Divd P36 P38 P40 P34 P32 P36 P4 P34 P32 F4 P4 Functional unit status: Time Name Int1 Int2 Mult1 Add Divide Register Rename and Result Clock F0 F2 5 2/9/2011 FU P36 P34 Fj? Rj Fk? Rk Mult1 Yes Yes No Yes Yes Yes F6 F8 F10 F12 ... F30 P32 P38 CS252-S11, lecture 7 FU Qj P40 FU Qk P12 P30 15 Renamed Scoreboard 6 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 2 3 6 6 Functional unit status: Time Name Int1 Int2 10 Mult1 2 Add Divide 2/9/2011 FU 4 5 S1 Fj S2 Fk Busy Op dest Fi No No Yes Yes Yes Multd Sub Divd P36 P38 P40 P34 P32 P36 P4 P34 P32 Mult1 Yes Yes No F4 F6 F8 F10 F12 ... P4 P32 P38 Register Rename and Result Clock F0 F2 6 3 4 P36 P34 CS252-S11, lecture 7 FU Qj P40 FU Qk P12 Fj? Rj Fk? Rk Yes Yes Yes F30 P30 16 Renamed Scoreboard 7 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 2 3 6 6 Functional unit status: Time Name Int1 Int2 9 Mult1 1 Add Divide 2/9/2011 FU 4 5 S1 Fj S2 Fk Busy Op dest Fi No No Yes Yes Yes Multd Sub Divd P36 P38 P40 P34 P32 P36 P4 P34 P32 Mult1 Yes Yes No F4 F6 F8 F10 F12 ... P4 P32 P38 Register Rename and Result Clock F0 F2 7 3 4 P36 P34 CS252-S11, lecture 7 FU Qj P40 FU Qk P12 Fj? Rj Fk? Rk Yes Yes Yes F30 P30 17 Renamed Scoreboard 8 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 2 3 6 6 Functional unit status: Time Name Int1 Int2 8 Mult1 0 Add Divide 2/9/2011 FU 4 5 8 Busy Op dest Fi No No Yes Yes Yes Multd Sub Divd P36 P38 P40 P34 P32 P36 P4 P34 P32 Mult1 Yes Yes No F4 F6 F8 F10 F12 ... P4 P32 P38 Register Rename and Result Clock F0 F2 8 3 4 P36 P34 S1 Fj S2 Fk CS252-S11, lecture 7 FU Qj P40 FU Qk P12 Fj? Rj Fk? Rk Yes Yes Yes F30 P30 18 Renamed Scoreboard 9 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 2 3 6 6 Functional unit status: Time Name Int1 Int2 7 Mult1 Add Divide 2/9/2011 FU 4 5 8 9 S1 Fj S2 Fk Busy Op dest Fi No No Yes No Yes Multd P36 P34 P4 Divd P40 P36 P32 Mult1 No Yes F4 F6 F8 F10 F12 ... F30 P4 P32 P38 Register Rename and Result Clock F0 F2 9 3 4 P36 P34 CS252-S11, lecture 7 FU Qj P40 FU Qk P12 Fj? Rj Fk? Rk Yes Yes P30 19 Renamed Scoreboard 10 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 10 2 3 6 6 Functional unit status: Time Name Int1 Int2 6 Mult1 Add Divide Busy No No Yes Yes Yes Op FU P36 4 5 8 9 dest Fi S1 Fj S2 Fk FU Qj FU Qk Fj? Rj Fk? Rk WAR Hazard gone! Multd Addd Divd Register Rename and Result Clock F0 F2 10 3 4 P34 P36 P42 P40 P34 P38 P36 P4 P34 P32 Mult1 Yes Yes No Yes Yes Yes F4 F6 F8 F10 F12 ... F30 P4 P42 P38 P40 P12 P30 • Notice that P32 not listed in Rename Table – Still live. Must not be reallocated by accident 2/9/2011 CS252-S11, lecture 7 20 Renamed Scoreboard 11 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 10 2 3 6 6 2/9/2011 FU 8 9 S1 Fj S2 Fk Busy Op dest Fi No No Yes Yes Yes Multd Addd Divd P36 P42 P40 P34 P38 P36 P4 P34 P32 Mult1 Yes Yes No F4 F6 F8 F10 F12 ... P4 P42 P38 Register Rename and Result Clock F0 F2 11 4 5 11 Functional unit status: Time Name Int1 Int2 5 Mult1 2 Add Divide 3 4 P36 P34 CS252-S11, lecture 7 FU Qj P40 FU Qk P12 Fj? Rj Fk? Rk Yes Yes Yes F30 P30 21 Renamed Scoreboard 12 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 10 2 3 6 6 2/9/2011 FU 8 9 S1 Fj S2 Fk Busy Op dest Fi No No Yes Yes Yes Multd Addd Divd P36 P42 P40 P34 P38 P36 P4 P34 P32 Mult1 Yes Yes No F4 F6 F8 F10 F12 ... P4 P42 P38 Register Rename and Result Clock F0 F2 12 4 5 11 Functional unit status: Time Name Int1 Int2 4 Mult1 1 Add Divide 3 4 P36 P34 CS252-S11, lecture 7 FU Qj P40 FU Qk P12 Fj? Rj Fk? Rk Yes Yes Yes F30 P30 22 Renamed Scoreboard 13 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 10 2 3 6 6 3 4 4 5 8 9 11 13 Busy Op dest Fi S1 Fj S2 Fk No No Yes Yes Yes Multd Addd Divd P36 P42 P40 P34 P38 P36 P4 P34 P32 F4 P4 Functional unit status: Time Name Int1 Int2 3 Mult1 0 Add Divide Register Rename and Result Clock F0 F2 13 2/9/2011 FU P36 P34 Fj? Rj Fk? Rk Mult1 Yes Yes No Yes Yes Yes F6 F8 F10 F12 ... F30 P42 P38 CS252-S11, lecture 7 FU Qj P40 FU Qk P12 P30 23 Renamed Scoreboard 14 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 10 2 3 6 6 3 4 4 5 8 9 11 13 14 Busy Op dest Fi S1 Fj S2 Fk No No Yes No Yes Multd P36 P34 P4 Divd P40 P36 P32 F4 P4 Functional unit status: Time Name Int1 Int2 2 Mult1 Add Divide Register Rename and Result Clock F0 F2 14 2/9/2011 FU P36 P34 Fj? Rj Fk? Rk Yes Yes Mult1 No Yes F6 F8 F10 F12 ... F30 P42 P38 CS252-S11, lecture 7 FU Qj P40 FU Qk P12 P30 24 Renamed Scoreboard 15 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 10 2 3 6 6 3 4 4 5 8 9 11 13 14 Busy Op dest Fi S1 Fj S2 Fk No No Yes No Yes Multd P36 P34 P4 Divd P40 P36 P32 F4 P4 Functional unit status: Time Name Int1 Int2 1 Mult1 Add Divide Register Rename and Result Clock F0 F2 15 2/9/2011 FU P36 P34 Fj? Rj Fk? Rk Yes Yes Mult1 No Yes F6 F8 F10 F12 ... F30 P42 P38 CS252-S11, lecture 7 FU Qj P40 FU Qk P12 P30 25 Renamed Scoreboard 16 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 10 2 3 6 6 3 4 16 8 4 5 11 13 14 Busy Op dest Fi S1 Fj S2 Fk No No Yes No Yes Multd P36 P34 P4 Divd P40 P36 P32 F4 P4 Functional unit status: Time Name Int1 Int2 0 Mult1 Add Divide Register Rename and Result Clock F0 F2 16 2/9/2011 FU P36 P34 9 Fj? Rj Fk? Rk Yes Yes Mult1 No Yes F6 F8 F10 F12 ... F30 P42 P38 CS252-S11, lecture 7 FU Qj P40 FU Qk P12 P30 26 Renamed Scoreboard 17 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 10 2 3 6 6 3 4 16 8 4 5 17 9 11 13 14 Busy Op dest Fi S1 Fj S2 Fk FU Qj No No No No Yes Divd P40 P36 P32 F4 P4 Functional unit status: Time Name Int1 Int2 Mult1 Add Divide Register Rename and Result Clock F0 F2 17 2/9/2011 FU P36 P34 Fj? Rj Fk? Rk Mult1 Yes Yes F6 F8 F10 F12 ... F30 P42 P38 CS252-S11, lecture 7 P40 FU Qk P12 P30 27 Renamed Scoreboard 18 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Read Exec Write Issue Oper Comp Result 1 2 3 4 5 10 2 3 6 6 18 11 Functional unit status: Time Name Int1 Int2 Mult1 Add 40 Divide 2/9/2011 FU 4 5 17 9 13 14 S1 Fj S2 Fk FU Qj Busy Op dest Fi No No No No Yes Divd P40 P36 P32 Mult1 Yes Yes F4 F6 F8 F10 F12 ... F30 P4 P42 P38 Register Rename and Result Clock F0 F2 18 3 4 16 8 P36 P34 CS252-S11, lecture 7 P40 FU Qk P12 Fj? Rj Fk? Rk P30 28 Explicit Renaming Support Includes: • Rapid access to a table of translations • A physical register file that has more registers than specified by the ISA • Ability to figure out which physical registers are free. – No free registers stall on issue • Thus, register renaming doesn’t require reservation stations. However: – Many modern architectures use explicit register renaming + Tomasulo-like reservation stations to control execution. 2/9/2011 CS252-S11, lecture 7 29 How many instructions can be in the pipeline? Which features of an ISA limit the number of instructions in the pipeline? Number of Registers Which features of a program limit the number of instructions in the pipeline? Control transfers Out-of-order dispatch by itself does not provide any significant performance improvement ! 2/9/2011 CS252-S11, lecture 7 30 How important is renaming? Consider execution without it 1 LD F2, 34(R2) 2 LD F4, 45(R3) latency 1 2 4 3 long 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 5 DIVD F4, F2, F8 4 6 ADDD F10, F6, F4 1 In-order: Out-of-order: 1 5 6 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6 1 (2,1) 4 4 . . . . 2 3 . . 3 5 . . . 5 6 6 Out-of-order execution did not allow any significant improvement! 2/9/2011 CS252-S11, lecture 7 31 Little’s Law Throughput (T) = Number in Flight (N) / Latency (L) Issue Execution WB Example: • 4 floating point registers • 8 cycles per floating point operation maximum of ½ issue per cycle without renaming! 2/9/2011 CS252-S11, lecture 7 32 Instruction-level Parallelism via Renaming 1 LD F2, 34(R2) 2 LD F4, 45(R3) latency 1 2 4 3 long 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 5 DIVD F4’, F2, F8 4 6 ADDD F10, F6, F4’ 1 In-order: Out-of-order: 1 X 5 6 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6 1 (2,1) 4 4 5 . . . 2 (3,5) 3 6 6 Any antidependence can be eliminated by renaming. (renaming additional storage) Can be done either in Software or Hardware 2/9/2011 CS252-S11, lecture 7 33 Administrative • Midterm I: Wednesday 3/14 Location: 320 Soda Hall TIME: 2:30-5:30 – Can have 1 sheet of 8½x11 handwritten notes – both sides – No microfiche of the book! • This info is on the Lecture page • Meet at LaVal’s afterwards for Pizza and Beverages – Great way for me to get to know you better – I’ll Buy! • Don’t quite have new set of papers up. Will do that by tomorrow at the latest…. 2/9/2011 CS252-S11, lecture 7 34 Data-Flow Architectures • Basic Idea: Hardware respresents direct encoding of compiler dataflow graphs: Input: a,b y:= (a+b)/x x:= (a*(a+b))+b output: y,x B A + • Data flows along arcs in “Tokens”. • When two tokens arrive at compute box, box “fires” and produces new token. • Split operations produce copies of tokens * / + X(0) Y 2/9/2011 CS252-S11, lecture 7 X 35 Paper by Dennis and Misunas Operation Unit 0 Operation Unit m-1 Instruction Operand 1 Operand 2 Operation Packet Data Packets Instruction Cell “Reservation Station?” 2/9/2011 CS252-S11, lecture 7 Instruction Cell 0 Instruction Cell 1 Memory Instruction Cell n-1 36 What about Precise Exceptions/Interrupts? • Both Scoreboard and Tomasulo have: – In-order issue, out-of-order execution, out-of-order completion • Recall: An interrupt or exception is precise if there is a single instruction for which: – All instructions before that have committed their state – No following instructions (including the interrupting instruction) have modified any state. • Need way to resynchronize execution with instruction stream (I.e. with issue-order) – Easiest way is with in-order completion (i.e. reorder buffer) – Other Techniques (Smith paper): Future File, History Buffer 2/9/2011 CS252-S11, lecture 7 37 Exception Handling Commit Point (In-Order Five-Stage Pipeline) PC Select Handler PC Inst. Mem PC Address Exceptions Kill F Stage • • • • 2/9/2011 D Decode E Illegal Opcode + M Overflow Data Mem Data Addr Except W Kill Writeback Exc D Exc E Exc M Cause PC D PC E PC M EPC Kill D Stage Kill E Stage Asynchronous Interrupts Hold exception flags in pipeline until commit point (M stage) Exceptions in earlier pipe stages override later exceptions Inject external interrupts at commit point (override others) If exception at commit: update Cause and EPC registers, kill all stages, inject handler PC into fetch stage CS252-S11, lecture 7 38 Complex In-Order Pipeline: Precise Exceptions PC Inst. Mem D Decode • Delay writeback so all operations have same latency to W stage GPRs FPRs X1 + X1 – Write ports never oversubscribed (one inst. in & one inst. out every cycle) – Instructions commit in order, simplifies precise exception implementation X2 Data Mem X3 Xn W X2 Fadd X3 Xn W X2 • How to prevent increase latency for single-cycle ops? – Bypassing – However: can be very expensive Fmul X3 Xn X3 Xn Commit Point Unpipelined FDiv X2 divider • Other downside: no out-of-order execution 2/9/2011 CS252-S11, lecture 7 39 In-Order Commit for Precise Exceptions In-order Fetch Out-of-order Reorder Buffer Decode Kill In-order Commit Kill Kill Execute Inject handler PC Exception? • Instructions fetched and decoded into instruction reorder buffer in-order • Execution is out-of-order ( out-of-order completion) • Commit (write-back to architectural state, i.e., regfile & memory) is in-order Temporary storage needed to hold results before commit (shadow registers and store buffers) 2/9/2011 CS252-S11, lecture 7 40 Program Counter Valid Exceptions? Result Reorder Table FP Op Queue Res Stations Compar network Dest Reg What are the hardware complexities with reorder buffer (ROB)? Reorder Buffer FP Regs Res Stations FP Adder FP Adder • How do you find the latest version of a register? – As specified by Smith paper, need associative comparison network – Could use future file or just use the register result status buffer to track which specific reorder buffer has received the value • Need as many ports on ROB as register file 2/9/2011 CS252-S11, lecture 7 41 Four Steps of Speculative Tomasulo 1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”) 2/9/2011 CS252-S11, lecture 7 42 Tomasulo With Reorder buffer: FP Op Queue Done? ROB7 ROB6 Newest ROB5 Reorder Buffer ROB4 ROB3 ROB2 F0 LD F0,10(R2) Registers Dest 2/9/2011 ROB1 Oldest To Memory from Memory Dest FP adders N Reservation Stations Dest 1 10+R2 FP multipliers CS252-S11, lecture 7 43 Tomasulo With Reorder buffer: FP Op Queue Done? ROB7 ROB6 Newest ROB5 Reorder Buffer ROB4 ROB3 F10 F0 ADDD F10,F4,F0 LD F0,10(R2) Registers Dest 2 ADDD R(F4),ROB1 FP adders 2/9/2011 N N ROB2 ROB1 Oldest To Memory from Memory Dest Reservation Stations Dest 1 10+R2 FP multipliers CS252-S11, lecture 7 44 Tomasulo With Reorder buffer: FP Op Queue Done? ROB7 ROB6 Newest ROB5 Reorder Buffer ROB4 F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) Registers Dest 2 ADDD R(F4),ROB1 FP adders 2/9/2011 N N N ROB3 ROB2 ROB1 Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations from Memory Dest 1 10+R2 FP multipliers CS252-S11, lecture 7 45 Tomasulo With Reorder buffer: FP Op Queue ROB7 Reorder Buffer F0 F4 -F2 F10 F0 ADDD F0,F4,F6 LD F4,0(R3) BNE F2,<…> DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) Registers Dest 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) FP adders 2/9/2011 Done? N N N N N N ROB6 Newest ROB5 ROB4 ROB3 ROB2 ROB1 Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations FP multipliers CS252-S11, lecture 7 from Memory Dest 1 10+R2 6 0+R3 46 Tomasulo With Reorder buffer: FP Op Queue Reorder Buffer -- ROB5 F0 F4 -F2 F10 F0 Done? ST 0(R3),F4 N ROB7 ADDD F0,F4,F6 N ROB6 LD F4,0(R3) N ROB5 BNE F2,<…> N ROB4 DIVD F2,F10,F6 N ROB3 ADDD F10,F4,F0 N ROB2 LD F0,10(R2) N ROB1 Registers Dest 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) FP adders 2/9/2011 Newest Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations FP multipliers CS252-S11, lecture 7 from Memory Dest 1 10+R2 6 0+R3 47 Tomasulo With Reorder buffer: FP Op Queue Reorder Buffer -- M[10] F0 F4 M[10] -F2 F10 F0 Done? ST 0(R3),F4 Y ROB7 ADDD F0,F4,F6 N ROB6 LD F4,0(R3) Y ROB5 BNE F2,<…> N ROB4 DIVD F2,F10,F6 N ROB3 ADDD F10,F4,F0 N ROB2 LD F0,10(R2) N ROB1 Registers Dest 2 ADDD R(F4),ROB1 6 ADDD M[10],R(F6) FP adders 2/9/2011 Newest Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations from Memory Dest 1 10+R2 FP multipliers CS252-S11, lecture 7 48 Tomasulo With Reorder buffer: FP Op Queue Reorder Buffer Done? -- M[10] ST 0(R3),F4 Y ROB7 F0 <val2> ADDD F0,F4,F6 Ex ROB6 F4 M[10] LD F4,0(R3) Y ROB5 -BNE F2,<…> N ROB4 F2 DIVD F2,F10,F6 N ROB3 F10 ADDD F10,F4,F0 N ROB2 F0 LD F0,10(R2) N ROB1 Registers Dest 2 ADDD R(F4),ROB1 FP adders 2/9/2011 Newest Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations from Memory Dest 1 10+R2 FP multipliers CS252-S11, lecture 7 49 Tomasulo With Reorder buffer: FP Op Queue Reorder Buffer What about memory hazards??? Done? -- M[10] ST 0(R3),F4 Y ROB7 F0 <val2> ADDD F0,F4,F6 Ex ROB6 F4 M[10] LD F4,0(R3) Y ROB5 -BNE F2,<…> N ROB4 F2 DIVD F2,F10,F6 N ROB3 F10 ADDD F10,F4,F0 N ROB2 F0 LD F0,10(R2) N ROB1 Registers Dest 2 ADDD R(F4),ROB1 FP adders 2/9/2011 Newest Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations from Memory Dest 1 10+R2 FP multipliers CS252-S11, lecture 7 50 Memory Disambiguation: Sorting out RAW Hazards in memory • Question: Given a load that follows a store in program order, are the two related? – (Alternatively: is there a RAW hazard between the store and the load)? Eg: st ld 0(R2),R5 R6,0(R3) • Can we go ahead and start the load early? – Store address could be delayed for a long time by some calculation that leads to R2 (divide?). – We might want to issue/begin execution of both operations in same cycle. – Today: Answer is that we are not allowed to start load until we know that address 0(R2) 0(R3) – Next Week: We might guess at whether or not they are dependent (called “dependence speculation”) and use reorder buffer to fixup if we are wrong. 2/9/2011 CS252-S11, lecture 7 51 Hardware Support for Memory Disambiguation • Need buffer to keep track of all outstanding stores to memory, in program order. – Keep track of address (when becomes available) and value (when becomes available) – FIFO ordering: will retire stores from this buffer in program order • When issuing a load, record current head of store queue (know which stores are ahead of you). • When have address for load, check store queue: – If any store prior to load is waiting for its address, stall load. – If load address matches earlier store address (associative lookup), then we have a memory-induced RAW hazard: » store value available return value » store value not available return ROB number of source – Otherwise, send out request to memory • Actual stores commit in order, so no worry about WAR/WAW hazards through memory. 2/9/2011 CS252-S11, lecture 7 52 Memory Disambiguation: Done? FP Op Queue ROB7 ROB6 Newest ROB5 Reorder Buffer -F2 R[F5] F0 -- <val 1> LD ST LD ST F4, 10(R3) 10(R3), F5 F0,32(R2) 0(R3), F4 Registers Dest 2/9/2011 ROB4 ROB3 ROB2 ROB1 Oldest To Memory from Memory Dest FP adders N N N Y Reservation Stations FP multipliers CS252-S11, lecture 7 Dest 2 32+R2 4 ROB3 53 Relationship between precise interrupts and speculation: • Speculation is a form of guessing – Branch prediction, data prediction – If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly – This is exactly same as precise exceptions! • Branch prediction is a very important! – Need to “take our best shot” at predicting branch direction. – If we issue multiple instructions per cycle, lose lots of potential instructions otherwise: » Consider 4 instructions per cycle » If take single cycle to decide on branch, waste from 4 - 7 instruction slots! • Technique for both precise interrupts/exceptions and speculation: in-order completion or commit – This is why reorder buffers in all new processors 2/9/2011 CS252-S11, lecture 7 54 Quick Recap: Explicit Register Renaming • Make use of a physical register file that is larger than number of registers specified by ISA • Keep a translation table: – ISA register => physical register mapping – When register is written, replace table entry with new register from freelist. – Physical register becomes free when not being used by any instructions in progress. Fetch Decode/ Rename Execute Rename Table 2/9/2011 CS252-S11, lecture 7 55 Explicit register renaming: R10000 Freelist Management P0 P2 P4 F6 F8 P10 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Done? Current Map Table Newest P32 P34 P36 P38 P60 P62 Freelist Oldest • Physical register file larger than ISA register file • On issue, each instruction that modifies a register is allocated new physical register from freelist • Used on: R10000, Alpha 21264, HP PA8000 2/9/2011 CS252-S11, lecture 7 56 Explicit register renaming: R10000 Freelist Management P32 P2 P4 F6 F8 P10 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Done? Current Map Table Newest P34 P36 P38 P40 Freelist P60 P62 F0 P0 LD P32,10(R2) N Oldest • Note that physical register P0 is “dead” (or not “live”) past the point of this load. – When we go to commit the load, we free up 2/9/2011 CS252-S11, lecture 7 57 Explicit register renaming: R10000 Freelist Management P32 P2 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Done? Current Map Table Newest P36 P38 P40 P42 Freelist 2/9/2011 P60 P62 F10 P10 ADDD P34,P4,P32 N F0 P0 LD P32,10(R2) N CS252-S11, lecture 7 Oldest 58 Explicit register renaming: R10000 Freelist Management P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Done? Current Map Table -- P38 P40 P44 P48 P60 P62 Freelist -F2 P2 F10 P10 F0 P0 Newest BNE P36,<…> DIVD P36,P34,P6 ADDD P34,P4,P32 LD P32,10(R2) N N N N Oldest P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 P38 P40 P44 P48 2/9/2011 P60 P62 Checkpoint at BNE instruction CS252-S11, lecture 7 59 Explicit register renaming: R10000 Freelist Management P40 P36 P38 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Current Map Table P42 P44 P48 P50 P0 P10 Freelist -F0 P32 F4 P4 -F2 P2 F10 P10 F0 P0 Done? ST 0(R3),P40 Y Newest ADDD P40,P38,P6 Y LD P38,0(R3) Y BNE P36,<…> N DIVD P36,P34,P6 N ADDD P34,P4,P32 y Oldest LD P32,10(R2) y P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 P38 P40 P44 P48 2/9/2011 P60 P62 Checkpoint at BNE instruction CS252-S11, lecture 7 60 Explicit register renaming: R10000 Freelist Management P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Done? Current Map Table Newest P38 P40 P44 P48 P0 P10 Freelist F2 P2 DIVD P36,P34,P6 N F10 P10 ADDD P34,P4,P32 y F0 P0 LD P32,10(R2) y Oldest Error fixed by restoring map table and merging freelist P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 P38 P40 P44 P48 2/9/2011 P60 P62 Checkpoint at BNE instruction CS252-S11, lecture 7 61 Advantages of Explicit Renaming • Decouples renaming from scheduling: – Pipeline can be exactly like “standard” DLX pipeline (perhaps with multiple operations issued per cycle) – Or, pipeline could be tomasulo-like or a scoreboard, etc. – Standard forwarding or bypassing could be used • Allows data to be fetched from single register file – No need to bypass values from reorder buffer – This can be important for balancing pipeline • Many processors use a variant of this technique: – R10000, Alpha 21264, HP PA8000 • Another way to get precise interrupt points: – All that needs to be “undone” for precise break point is to undo the table mappings – Provides an interesting mix between reorder buffer and future file » Results are written immediately back to register file » Registers names are “freed” in program order (by ROB) 2/9/2011 CS252-S11, lecture 7 62 Superscalar Register Renaming • During decode, instructions allocated new physical destination register • Source operands renamed to physical register with newest value • Execution unit only sees physical register numbers Inst 1 Op Write Ports Update Mapping Dest Src1 Src2 Op Dest Src1 Src2 Read Addresses Register Free List Rename Table Read Data Op PDest PSrc1 PSrc2 Inst 2 Op PDest PSrc1 PSrc2 Does this work? 2/9/2011 CS252-S11, lecture 7 63 Superscalar Register Renaming (Try #2) Update Mapping Op Dest Src1 Src2 Write Ports Inst 1 Op Dest Src1 Src2 Inst 2 Read Addresses =? Rename Table Read Data Must check for RAW hazards between instructions issuing in same cycle. Can be done in parallel with rename Op PDest PSrc1 PSrc2 lookup. =? Register Free List Op PDest PSrc1 PSrc2 MIPS R10K renames 4 serially-RAW-dependent insts/cycle 2/9/2011 CS252-S11, lecture 7 64 Summary • DataFlow view: – Data triggers execution rather than instructions triggering data • Dynamic hardware schemes can unroll loops dynamically in hardware – Form of limited dataflow – Register renaming is essential • Explicit Renaming: more physical registers than needed by ISA. – Rename table: tracks current association between architectural registers and physical registers – Uses a translation table to perform compiler-like transformation on the fly • Precise Interrupts: – Must commit things back in order – Reorder buffer: temporarily holds results until commit possible – Toss out things to achieve precise interrupt point 2/9/2011 CS252-S11, lecture 7 65