Dynamic Instruction Scheduling (Example) High Performance Computer Architecture http://www.dii.unisi.it/~giorgi/teaching/hpca2 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -1 di 18 Example loop: r3 <- mem(r4+r2) r7 <- mem(r5+r2) r7 <- r7 * r3 r1 <- r1 - 1 mem(r6+r2)<- r7 r2 <- r2 + 8 P <- loop; r1!=0 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -2 di 18 # # # # # # # load b(i) load c(i) b(i) * c(i) decr. Counter store a(i) bump index close loop DISPATCH CYCLE 1 ISSUE Mult. 1 Mult. 2 Mult. 3 Mult. 4 Common Data Bus M RS I-cache access LQ CIP NIP Address Add Decode r3 <- mem(r4+r2) r7 <- mem(r5+r2) r7 <- r7 * r3 r1 <- r1 – 1 mem(r6+r2)<- r7 r2 <- r2 + 8 P <- loop; r1!=0 D-Cache LS RS Regs SQ Integer A RS WRITE-BACK Reg 0 1 2 3 4 5 6 7 Q V 0 100 0 0 6 0 1000 0 2000 0 3000 0 49 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -3 RS A1 A2 A3 M1 M2 LS1 LS2 LS3 di 18 Id 1 2 3 4 5 6 7 8 Busy 1 Op load Vj 1000 Vk Qj 0 0 Qk 0 DISPATCH CYCLE 2 ISSUE Mult. 1 Mult. 2 Mult. 3 Mult. 4 Common Data Bus M RS I-cache access LQ CIP NIP Address Add Decode r3 <- mem(r4+r2) r7 <- mem(r5+r2) r7 <- r7 * r3 r1 <- r1 – 1 mem(r6+r2)<- r7 r2 <- r2 + 8 P <- loop; r1!=0 D-Cache LS RS Regs SQ Integer A RS WRITE-BACK Reg 0 1 2 3 4 5 6 7 Q V 0 100 0 0 6 0 1000 0 2000 0 3000 7 RS A1 A2 A3 M1 M2 LS1 LS2 LS3 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -4 di 18 Id 1 2 3 4 5 6 7 8 Busy 1 1 Op load load Vj 1000 2000 Vk Qj 0 0 0 0 Qk 0 0 DISPATCH CYCLE 3 ISSUE Mult. 1 Mult. 2 Mult. 3 Mult. 4 Common Data Bus M RS I-cache access LQ CIP NIP Address Add Decode r3 <- mem(r4+r2) r7 <- mem(r5+r2) r7 <- r7 * r3 r1 <- r1 – 1 mem(r6+r2)<- r7 r2 <- r2 + 8 P <- loop; r1!=0 D-Cache LS RS Regs SQ Integer A RS WRITE-BACK Reg 0 1 2 3 4 5 6 7 Q V 0 0 6 0 0 0 74 100 0 1000 2000 3000 RS A1 A2 A3 M1 M2 LS1 LS2 LS3 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -5 di 18 Id 1 2 3 4 5 6 7 8 Busy Op 1 mult 1 1 load load Vj 1000 2000 Vk 0 0 Qj Qk 6 7 0 0 0 0 - The first load complets: let’s assume that reads ’13’ DISPATCH CYCLE 4 ISSUE Mult. 1 Mult. 2 Mult. 3 Mult. 4 Common Data Bus M RS I-cache access LQ CIP NIP Address Add Decode r3 <- mem(r4+r2) r7 <- mem(r5+r2) r7 <- r7 * r3 r1 <- r1 – 1 mem(r6+r2)<- r7 r2 <- r2 + 8 P <- loop; r1!=0 D-Cache LS RS Regs SQ Integer A RS WRITE-BACK Reg 0 1 2 3 4 5 6 7 Q V 1 0 0 0 13 0 1000 0 2000 0 3000 4 RS A1 A2 A3 M1 M2 LS1 LS2 LS3 Id 1 2 3 4 5 6 7 8 Busy 1 Op sub Vj 100 1 mult 13 0 1 load 2000 Vk 1 0 - The first load writes on the CDB (the value 13) - The sub goes in dispatch - The second load is issued - The mult can’t be issued until it gets Qi=Qk=0 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -6 di 18 Qj 0 Qk 0 0 7 0 0 DISPATCH CYCLE 5 ISSUE Mult. 1 Mult. 2 Mult. 3 Mult. 4 Common Data Bus M RS I-cache access LQ CIP NIP Address Add Decode r3 <- mem(r4+r2) r7 <- mem(r5+r2) r7 <- r7 * r3 r1 <- r1 – 1 mem(r6+r2)<- r7 r2 <- r2 + 8 P <- loop; r1!=0 D-Cache LS RS Regs SQ Integer conflict on the CDB RS A1 A2 A3 M1 M2 LS1 LS2 LS3 A RS WRITE-BACK Reg 0 1 2 3 4 5 6 7 Q V 1 0 0 0 13 0 1000 0 2000 0 3000 4 SQ: A Q V 4 Id 1 2 3 4 5 6 7 8 Busy 1 Op sub Vj 100 1 mult 13 1 1 sto load 3000 2000 Vk 1 0 0 Qj 0 Qk 0 0 7 0 0 0 0 - The second load complets and let’s assume it reads ’11’ - The mult waits and the sub is issued - The store goes in dispatch -Simultaneously we allocate one element in the SQ -The sub is going to conflict on the CDB with the load, then will have to wait -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -7 di 18 DISPATCH CYCLE 6 ISSUE Mult. 1 Mult. 2 Mult. 3 Mult. 4 Common Data Bus M RS I-cache access LQ CIP NIP Address Add Decode r3 <- mem(r4+r2) r7 <- mem(r5+r2) r7 <- r7 * r3 r1 <- r1 – 1 mem(r6+r2)<- r7 r2 <- r2 + 8 P <- loop; r1!=0 D-Cache LS RS Regs SQ Integer RS A1 A2 A3 M1 M2 LS1 LS2 LS3 A RS WRITE-BACK Reg 0 1 2 3 4 5 6 7 Q 1 2 0 0 0 0 4 V 13 1000 2000 3000 SQ: A Q V 3000 4 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -8 di 18 Id 1 2 3 4 5 6 7 8 Busy 1 1 Op sub add Vj 100 0 Vk 1 8 1 mult 13 11 1 0 sto 3000 0 LQ Qj 0 0 A 0 0 Qk 0 0 0 0 - The second load writes on the CDB (the value 11) - The mult is issued, and the sub is waiting the CDB - The store is issued: in the SQ it gets the effective address A - but it can’t advance, until Qi != 0 - The add goes in dispatch DISPATCH CYCLE 7 ISSUE Mult. 1 Mult. 2 Mult. 3 Mult. 4 Common Data Bus M RS I-cache access LQ CIP NIP Address Add Decode r3 <- mem(r4+r2) r7 <- mem(r5+r2) r7 <- r7 * r3 r1 <- r1 – 1 mem(r6+r2)<- r7 r2 <- r2 + 8 P <- loop; r1!=0 D-Cache LS RS Regs SQ Integer A RS WRITE-BACK Reg 0 1 2 3 4 5 6 7 Q V 0 99 2 0 13 0 1000 0 2000 0 3000 4 RS A1 A2 A3 M1 M2 LS1 LS2 LS3 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -9 di 18 Id 1 2 3 4 5 6 7 8 Busy 0 1 1 1 1 0 Op add brch mult sto Vj 0 99 13 3000 Vk 8 0 LQ 11 0 Qj Qk 0 A00 0 0 0 0 0 - The mult proceeds and the store waits - The sub complets and updates R1 (and the CDB) with ’99’ - The add is issued - The branch goes in dispatch DISPATCH CYCLE 8 ISSUE Mult. 1 Mult. 2 Mult. 3 Mult. 4 Common Data Bus M RS I-cache access LQ CIP NIP Address Add Decode r3 <- mem(r4+r2) r7 <- mem(r5+r2) r7 <- r7 * r3 r1 <- r1 – 1 mem(r6+r2)<- r7 r2 <- r2 + 8 P <- loop; r1!=0 D-Cache LS RS Regs SQ Integer A RS WRITE-BACK Reg 0 1 2 3 4 5 6 7 Q V 0 99 0 8 0 13 0 1000 0 2000 0 3000 4 RS A1 A2 A3 M1 M2 LS1 LS2 LS3 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -10 di 18 Id 1 2 3 4 5 6 7 8 Busy 0 0 1 1 Op 1 0 sto brch mult Vj Vk 99 13 3000 - The mult proceeds and the store waits - The add writes on the CDB - The branch is issued Qj Qk 0 11 0 0 0 0 0 0 0 DISPATCH CYCLE 9 ISSUE Mult. 1 Mult. 2 Mult. 3 Mult. 4 Common Data Bus M RS I-cache access LQ CIP NIP Address Add Decode r3 <- mem(r4+r2) r7 <- mem(r5+r2) r7 <- r7 * r3 r1 <- r1 – 1 mem(r6+r2)<- r7 r2 <- r2 + 8 P <- loop; r1!=0 D-Cache LS RS Regs SQ Integer A RS WRITE-BACK Reg 0 1 2 3 4 5 6 7 Q V 0 99 0 8 0 13 0 1000 0 2000 0 3000 4 RS A1 A2 A3 M1 M2 LS1 LS2 LS3 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -11 di 18 Id 1 2 3 4 5 6 7 8 Busy 0 0 0 1 Op 1 0 sto mult Vj Vk LQ Qj 13 11 3000 0 SQ A A Q - The mult complets and calculates 13*11=143 - The store waits - The branch complets Qk 0 0 0 0 V DISPATCH CYCLE 10 ISSUE Mult. 1 Mult. 2 Mult. 3 Mult. 4 Common Data Bus M RS I-cache access LQ CIP NIP Address Add Decode r3 <- mem(r4+r2) r7 <- mem(r5+r2) r7 <- r7 * r3 r1 <- r1 – 1 mem(r6+r2)<- r7 r2 <- r2 + 8 P <- loop; r1!=0 D-Cache LS RS Regs SQ Integer A RS WRITE-BACK Reg 0 1 2 3 4 5 6 7 Q V 0 99 0 8 0 13 0 1000 0 2000 0 3000 0 143 RS A1 A2 A3 M1 M2 LS1 LS2 LS3 SQ: A Q V 3000 0 143 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -12 di 18 Id 1 2 3 4 5 6 7 8 Busy 0 0 0 0 1 0 Op sto Vj 3000 Vk 0 Qj 0 Qk 0 - The mult writes on the CDB (the value ‘143’) - The store gets the value ‘143’ and can finally complete DISPATCH CYCLE 11 ISSUE Mult. 1 Mult. 2 Mult. 3 Mult. 4 Common Data Bus M RS I-cache access LQ CIP NIP Address Add Decode r3 <- mem(r4+r2) r7 <- mem(r5+r2) r7 <- r7 * r3 r1 <- r1 – 1 mem(r6+r2)<- r7 r2 <- r2 + 8 P <- loop; r1!=0 D-Cache LS RS Regs SQ Integer A RS WRITE-BACK Reg 0 1 2 3 4 5 6 7 Q V 0 99 0 8 0 13 0 1000 0 2000 0 3000 0 143 RS A1 A2 A3 M1 M2 LS1 LS2 LS3 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -13 di 18 Id 1 2 3 4 5 6 7 8 Busy 0 0 0 0 0 0 Op Vj Vk LQ SQ Qj A A Q V Qk Tomasulo: Summary • Reservation Stations • Allow the "out-of-order issue" based on the availability of data (E.g. sub and add issued without waiting for the mult) • Register Renaming (tags) + Avoids the WAR and WAW hazards Especially important when there are few registers available (as originally in the IBM 360) + Realize a dynamic "loop unrolling" - Requires a relatively complex logic • Common Data Bus + Simultaneously broadcast the results to more waiting instructions - It’s a "bottleneck", but it can be replicated more times (of course at a cost greater hw) • The scheme does not handle "precise exceptions" -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -14 di 18 Tomasulo: hazard management summary Hazard Management method Structural on RS (RS finite) Stall in the Dispatch stage (*1) Structural on CDB (CDB occupied) Stall in the Issue stage (*2) Structural on FU (FU occupied) Stall in the Issue stage (*3) RAW WAR WAW Avoided by using the tags Avoided by coping operands in RS at dispatch-time Avoided by using SW Register Renaming (*1) avoidable with a larger number of RSs (*2) avoidable with a larger number of CDBs (*3) avoidable with multiple FUs (or can reduce with pipelined FUs) -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -15 di 18 Reservation Station -- implementation dispatch dispatch: move to res. station issue: move to functional unit REGISTER Qi value Busy RES STAT. RS No. j operand like k operand Op Vj Qj compare MUX Vk ld AND ld clr Qk OR set clr Busy MUX compare enbl FF =0? =0? AND to to functional functional unit unit CDB data CDB tag tag 1 cycle before data to functional unit -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -16 di 18 ready to issue logic busy to dispatch logic set clr Full issue General organization of IBM 360/91 pipeline • “In-order” pipeline with the following stages: • I-fetch, decode, address generation • Floating point decoupled from the Integer (Fixed Point) through memory buffers • Effective-address generation done in the integer unit • A memory pipeline for loading the data -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -17 di 18 IBM 360/91 -- Floating Point Unit From: R.M. Tomasulo, “An efficient Algorithm for Exploring Arithmetic Units”, IBM Journal, Jan.1967, pp.25-33 -Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -18 di 18