Low-Power Processors with Reduced Operand Store Writes and

advertisement
2
| Processors designed for low power
| Architectural state is correct at basic block granularity rather
than instruction granularity
3
| Background
| B-Processor mechanisms
| Results
| Conclusion
4
| Depending on when instructions read their source operands
two pipeline designs are possible




Operand values are read before issue
Operand values are read after issue
Issue  instruction sent to functional unit for execution
Dispatch  instruction inserted into instruction scheduler
5
| Pipeline has a Data-Capture (DC) Scheduler
Fetch, Decode
and Dispatch
Read
ARF
ROB/
Rename Buffer
Data-Capture
Scheduler
Update
Bypass
and Wake up
Execution Units

DC Scheduler + ARF + ROB with Data – Intel Nehalem, Intel Core
6
| Results produced by instructions are copied twice


First to ROB – on instruction completion
Then to ARF – on instruction commit
| ROB + ARF consume a significant portion of the total core
power

> 10% [Brooks et al. ISCA 2000]
7
| Design mechanism(s) to reduce the power consumption of
the ROB + ARF

reduce the number of writes to these structures
8
| Change the organization of these structures

ports, hierarchical organization, banking [MICRO’92, MICRO’94]
| Reduce accesses to these structures


Register File Caches [Yung et al, ICCD ‘95]
Reduce writes

Target short-lived variables (mostly VLIW)
9
| Many instruction results within a basic block are not visible
outside the basic block

we call such values BB-Internal values
Basic Block
…
ADD
SUB
…
MUL
…
JGZ
R1, R2, R3
R4, R1, R6
Inst-M
R1, R1, R4
Inst-N
R10
| Values visible outside a basic block are called BB-External
values

The last value written to a register within a basic block is a BBExternal value
10
| Dependency Distance (Dep-Distance) – integer value defined
for every instruction

For instructions producing BB-Internal value(s) only


it is the distance of last consumer from the instruction
For instructions producing BB-External value(s)

it is infinite
11
| Many BB-Internal values become dead shortly after being
produced

i.e., all consumers of BB-Internal value are found within a short distance
of the instruction producing the BB-Internal value
100
90
80
70
60
50
40
30
20
10
0
BB-External
Dep-Distance > 8
Dep-Distance = [5, 8]
Dep-Distance = 4

AMean
bwaves
gamess
milc
zeusmp
gromacs
cactusADM
leslie3d
namd
dealII
soplex
povray
calculix
GemsFDTD
tonto
lbm
wrf
sphinx3
perlbench
bzip2
gcc
mcf
gobmk
hmmer
sjeng
libquantum
h264ref
omnetpp
astar
xalancbmk
Dep-Distance = 3
Dep-Distance = 2
Dep-Distance = 1
>22% of all instructions produce BB-Internal values only and those values
are consumed within 4 instructions of being produced
12
| Instruction results are broadcast over the bypass network
| If we can guarantee that instructions dependent on BBInternal values produced by a instruction have received the
BB-Internal values from the bypass network then we can skip
writing the BB-Internal values to the operand store(s)
13
| If results of a instruction are not being written to operand
stores (Mechanism #1), then we can stop broadcast of
results beyond first stage of bypass
14
| Assistance of the Compiler
| Changes to ISA
| Changes to hardware
15
| Do analysis of life-time of variables and identify the depdistance of instructions in basic blocks
16
| Add 2-bits to instruction encoding



Compiler passes dep-distance of instructions via this encoding
Bits can be encoded in several ways
Example encoding using multiples of 2
Encoding
Meaning
00
Dep-Distance is Infinite
01
1 ≤ Dep-Distance < 2 * 1
10
2 ^ 1 ≤ Dep-Distance < 2 * 2
[2-3]
11
2 ^ 2 ≤ Dep-Distance < 2 * 3
[4-7]
[1]
17
| Add a bit-mask (Presence Vector) to track the presence of
instructions in Scheduler

Bit-mask of same size as ROB

Bit mask has head and tail pointers


First 0 (from tail) in mask is set when a new instruction is dispatched
First 1 (from head) in mask is cleared when a instruction is retired
18
| When instruction is issued, check
instructions have been dispatched

if
all
dependent
If dep-distance is n, check if nth bit from bit for this instruction is set

If set then do not write to ROB and ARF
Check
hit
0
–
1
Ia
1
Ib
1
Ic
1
Id
…
...
0
–
PV
Scheduler
Scheduler
DD = 3
19
01
10
11
Dep-Distance
| 𝑑1𝑑0𝑏1 + 𝑑1𝑑0𝑏3 + 𝑑1𝑑0𝑏7






d1d0 – 2 bit encoding for the instruction
bxbx-1…b0 – Presence Vector
d1d0 =
d1d0 =
d1d0 =
d1d0 =
00
01
10
11




must write to ROB and ARF
dep-distance is 1
dep-distance in [2,3]
dep-distance in [4,7]
20
| Precise exceptions are not supported

Many instructions will not update the architectural state as they are
supposed to do


But at end of a basic block architectural state matches state obtained with
regular execution
Soln: Check-point RF at the end of each basic block, whenever there is
an exception, rollback to start of basic block and execute in
instruction-precise mode

Use a light weight RF check-pointing mechanism
21
ARF
ARF-0
ARF
Dirty and State Masks
ARF-1
2 copies of ARF
| ARF  2 ARF + 1 Dirty Mask + Several State Masks


Each bit mask is equal to size of ARF
# of state masks is equal to the maximum number of basic blocks
supported by pipeline + 1
22
| Dirty mask

Tracks which registers have been written by the current basic block
| State mask

Holds current mapping of registers i.e., whether latest value of register
is in ARF0 or in ARF1
| First write to a register in a basic block flips the bit in the
state mask


register value at end of last basic block is untouched
subsequent writes to same register use the current mapping
23
| MacSim Simulator

with integrated McPAT-based tool for modeling power
| Nehalem like core


4-wide, 128 entry ROB, 36 entry scheduler, 16 IRegs, 32 Fregs
22nm


Gmean
bwaves
gamess
milc
zeusmp
gromacs
cactusADM
leslie3d
namd
dealII
soplex
povray
calculix
GemsFDTD
tonto
lbm
wrf
sphinx3
perlbench
bzip2
gcc
mcf
gobmk
hmmer
sjeng
libquantum
h264ref
omnetpp
astar
xalancbmk
Total power consumption for ROB + ARFs
and other data stores relative to Baseline
24
| Power savings for ROB + ARF
1
0.9
0.8
0.7
0.6
0.5
0.4
RFC-32
0.3
B-Processor
0.2
0.1
0
15% over baseline, 7% over RFC-32
FP benchmarks – B-Processor skips writing many results and RFC
mechanism writes lot of live values to ROB

10% savings on average
GMean

perlbench
bzip2
gcc
mcf
gobmk
hmmer
sjeng
lq
h264
omnetpp
astar
xalancbmk
bwaves
gamess
milc
zeusmp
gromacs
cactusADM
leslie3d
namd
dealII
soplex
povray
calculix
GemsFDTD
tonto
lbm
wrf
sphinx3
% saving in Power over Baseline for
the Bypass Network
25
| Power savings for Bypass Network
baseline has two levels of bypass
40
35
30
25
20
15
10
B-Processor-C
5
0
26
| ROB + ARF contribute a significant fraction of total power

propose mechanism to reduce their power consumption
| For bb-internal values, if all dependent instructions read
value off bypass network then skip writes to ROB and ARF
and broadcast beyond first stage of bypass
| Mechanism results in correct architecture state at basic block
granularity
| Mechanism reduces ROB + ARF power consumption by 15%
and bypass power consumption by 10% relative to
conventional design
27
Thank You!
Download