2 | Processors designed for low power | Architectural state is correct at basic block granularity rather than instruction granularity 3 | Background | B-Processor mechanisms | Results | Conclusion 4 | Depending on when instructions read their source operands two pipeline designs are possible Operand values are read before issue Operand values are read after issue Issue instruction sent to functional unit for execution Dispatch instruction inserted into instruction scheduler 5 | Pipeline has a Data-Capture (DC) Scheduler Fetch, Decode and Dispatch Read ARF ROB/ Rename Buffer Data-Capture Scheduler Update Bypass and Wake up Execution Units DC Scheduler + ARF + ROB with Data – Intel Nehalem, Intel Core 6 | Results produced by instructions are copied twice First to ROB – on instruction completion Then to ARF – on instruction commit | ROB + ARF consume a significant portion of the total core power > 10% [Brooks et al. ISCA 2000] 7 | Design mechanism(s) to reduce the power consumption of the ROB + ARF reduce the number of writes to these structures 8 | Change the organization of these structures ports, hierarchical organization, banking [MICRO’92, MICRO’94] | Reduce accesses to these structures Register File Caches [Yung et al, ICCD ‘95] Reduce writes Target short-lived variables (mostly VLIW) 9 | Many instruction results within a basic block are not visible outside the basic block we call such values BB-Internal values Basic Block … ADD SUB … MUL … JGZ R1, R2, R3 R4, R1, R6 Inst-M R1, R1, R4 Inst-N R10 | Values visible outside a basic block are called BB-External values The last value written to a register within a basic block is a BBExternal value 10 | Dependency Distance (Dep-Distance) – integer value defined for every instruction For instructions producing BB-Internal value(s) only it is the distance of last consumer from the instruction For instructions producing BB-External value(s) it is infinite 11 | Many BB-Internal values become dead shortly after being produced i.e., all consumers of BB-Internal value are found within a short distance of the instruction producing the BB-Internal value 100 90 80 70 60 50 40 30 20 10 0 BB-External Dep-Distance > 8 Dep-Distance = [5, 8] Dep-Distance = 4 AMean bwaves gamess milc zeusmp gromacs cactusADM leslie3d namd dealII soplex povray calculix GemsFDTD tonto lbm wrf sphinx3 perlbench bzip2 gcc mcf gobmk hmmer sjeng libquantum h264ref omnetpp astar xalancbmk Dep-Distance = 3 Dep-Distance = 2 Dep-Distance = 1 >22% of all instructions produce BB-Internal values only and those values are consumed within 4 instructions of being produced 12 | Instruction results are broadcast over the bypass network | If we can guarantee that instructions dependent on BBInternal values produced by a instruction have received the BB-Internal values from the bypass network then we can skip writing the BB-Internal values to the operand store(s) 13 | If results of a instruction are not being written to operand stores (Mechanism #1), then we can stop broadcast of results beyond first stage of bypass 14 | Assistance of the Compiler | Changes to ISA | Changes to hardware 15 | Do analysis of life-time of variables and identify the depdistance of instructions in basic blocks 16 | Add 2-bits to instruction encoding Compiler passes dep-distance of instructions via this encoding Bits can be encoded in several ways Example encoding using multiples of 2 Encoding Meaning 00 Dep-Distance is Infinite 01 1 ≤ Dep-Distance < 2 * 1 10 2 ^ 1 ≤ Dep-Distance < 2 * 2 [2-3] 11 2 ^ 2 ≤ Dep-Distance < 2 * 3 [4-7] [1] 17 | Add a bit-mask (Presence Vector) to track the presence of instructions in Scheduler Bit-mask of same size as ROB Bit mask has head and tail pointers First 0 (from tail) in mask is set when a new instruction is dispatched First 1 (from head) in mask is cleared when a instruction is retired 18 | When instruction is issued, check instructions have been dispatched if all dependent If dep-distance is n, check if nth bit from bit for this instruction is set If set then do not write to ROB and ARF Check hit 0 – 1 Ia 1 Ib 1 Ic 1 Id … ... 0 – PV Scheduler Scheduler DD = 3 19 01 10 11 Dep-Distance | 𝑑1𝑑0𝑏1 + 𝑑1𝑑0𝑏3 + 𝑑1𝑑0𝑏7 d1d0 – 2 bit encoding for the instruction bxbx-1…b0 – Presence Vector d1d0 = d1d0 = d1d0 = d1d0 = 00 01 10 11 must write to ROB and ARF dep-distance is 1 dep-distance in [2,3] dep-distance in [4,7] 20 | Precise exceptions are not supported Many instructions will not update the architectural state as they are supposed to do But at end of a basic block architectural state matches state obtained with regular execution Soln: Check-point RF at the end of each basic block, whenever there is an exception, rollback to start of basic block and execute in instruction-precise mode Use a light weight RF check-pointing mechanism 21 ARF ARF-0 ARF Dirty and State Masks ARF-1 2 copies of ARF | ARF 2 ARF + 1 Dirty Mask + Several State Masks Each bit mask is equal to size of ARF # of state masks is equal to the maximum number of basic blocks supported by pipeline + 1 22 | Dirty mask Tracks which registers have been written by the current basic block | State mask Holds current mapping of registers i.e., whether latest value of register is in ARF0 or in ARF1 | First write to a register in a basic block flips the bit in the state mask register value at end of last basic block is untouched subsequent writes to same register use the current mapping 23 | MacSim Simulator with integrated McPAT-based tool for modeling power | Nehalem like core 4-wide, 128 entry ROB, 36 entry scheduler, 16 IRegs, 32 Fregs 22nm Gmean bwaves gamess milc zeusmp gromacs cactusADM leslie3d namd dealII soplex povray calculix GemsFDTD tonto lbm wrf sphinx3 perlbench bzip2 gcc mcf gobmk hmmer sjeng libquantum h264ref omnetpp astar xalancbmk Total power consumption for ROB + ARFs and other data stores relative to Baseline 24 | Power savings for ROB + ARF 1 0.9 0.8 0.7 0.6 0.5 0.4 RFC-32 0.3 B-Processor 0.2 0.1 0 15% over baseline, 7% over RFC-32 FP benchmarks – B-Processor skips writing many results and RFC mechanism writes lot of live values to ROB 10% savings on average GMean perlbench bzip2 gcc mcf gobmk hmmer sjeng lq h264 omnetpp astar xalancbmk bwaves gamess milc zeusmp gromacs cactusADM leslie3d namd dealII soplex povray calculix GemsFDTD tonto lbm wrf sphinx3 % saving in Power over Baseline for the Bypass Network 25 | Power savings for Bypass Network baseline has two levels of bypass 40 35 30 25 20 15 10 B-Processor-C 5 0 26 | ROB + ARF contribute a significant fraction of total power propose mechanism to reduce their power consumption | For bb-internal values, if all dependent instructions read value off bypass network then skip writes to ROB and ARF and broadcast beyond first stage of bypass | Mechanism results in correct architecture state at basic block granularity | Mechanism reduces ROB + ARF power consumption by 15% and bypass power consumption by 10% relative to conventional design 27 Thank You!