Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines

advertisement
Addressing Instruction Fetch Bottlenecks
by Using an Instruction Register File
Stephen Hines, Gary Tyson, and David Whalley
Computer Science Dept.
Florida State University
June 8-16, 2007
Instruction Packing


Store frequently occurring instructions as
specified by the compiler in a small, lowpower Instruction Register File (IRF)
Allow multiple instruction fetches from the
IRF by packing instruction references
together



Tightly packed – multiple IRF references
Loosely packed – piggybacks an IRF reference
onto an existing instruction
Facilitate parameterization of some
instructions using an Immediate Table (IMM)
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
2
Execution of IRF Instructions
Instruction Fetch Stage
Instruction Cache
packed instruction
IF/ID
insn1
insn2
insn3
insn4
packed instruction
PC
First Half of Instruction Decode Stage
IRF
insn2
insn4
insn1
insn3
IRWP
IMM
imm3
To Instruction
Decoder
imm3
Executing a Tightly Packed Param4c Instruction
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
3
Outline







Introduction
IRF and Instruction Packing Overview
Integrating an IRF with an L0 I-Cache
Decoupling Instruction Fetch
Experimental Evaluation
Related Work
Conclusions & Future Work
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
4
MIPS+IRF Instruction Formats
T-type
R-type
I-type
J-type
6 bits
5 bits
5 bits
5 bits
opcode
inst1
inst2
inst3
6 bits
5 bits
5 bits
5 bits
6 bits
5 bits
rt
rd
function
inst
opcode
rs
shamt
5 bits 1 bit
inst4
param
5 bits
s inst5 param
6 bits
5 bits
5 bits
11 bits
5 bits
opcode
rs
rt
immediate
inst
6 bits
2 bits
24 bits
opcode
win
immediate
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
5
Previous Work in IRF


Register Windowing + Loop Cache
(MICRO 2005)
Compiler Optimizations (CASES 2006)



Instruction Selection
Register Renaming
Instruction Scheduling
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
6
Integrating an IRF with an L0 I-Cache

L0 or Filter Caches

Small and direct-mapped




256B L0 I-cache 8B line size [Kin97]




Fast hit time
Low energy per access
Higher miss rate than L1
Fetch energy reduced 68%
Cycle time increased 46%!!!
IRF reduces code size, while L0 only focuses on
energy reduction at the cost of performance
IRF can alleviate performance penalty associated
with L0 cache misses, due to overlapping fetch
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
7
L0 Cache Miss Penalty
Cycle
1 2 3 4 5 6 7 8 9
Insn1
IF ID EX M WB
Insn2
Insn3
IF
ID EX M WB
IF
Insn4
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
ID EX M WB
IF
ID EX M WB
8
Overlapping Fetch with an IRF
Cycle
1 2 3 4 5 6 7 8 9
Insn1
IF ID EX M WB
Pack2a
IFab IDa EXa Ma WBa
Pack2b
Insn3
IDb EXb Mb WBb
IF
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
ID EX M WB
9
Decoupling Instruction Fetch

Instruction bandwidth in a pipeline is usually
uniform (fetch, decode, issue, commit, …)



Artificially limits the effective design space
Front-end throttling improves energy utilization by
reducing the fetch bandwidth in areas of low ILP
IRF can provide virtual front-end throttling



Fetch fewer instructions every cycle, but allow multiple
issue of packed instructions
Areas of high ILP are often densely packed
Lower ILP for infrequently executed sections of code
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
10
Out-of-order Pipeline Configurations
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
11
Experimental Evaluation


MiBench embedded benchmark suite – 6
categories representing common tasks for
various domains
SimpleScalar MIPS/PISA architectural
simulator


Wattch/Cacti extensions for modeling energy
consumption (inactive portions of pipeline only
dissipate 10% of normal energy when using cc3
clock gating)
VPO – Very Portable Optimizer targeted for
SimpleScalar MIPS/PISA
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
12
L0 Study Configuration Data
Parameter
Low-Power In-order Embedded Processor
I-Fetch Queue
4 entries
Branch Predictor
Bimodal-128 entries, 3 cycle penalty
Fetch/Decode/Issue
Single instruction
RUU size
8
LSQ size
8
L1 Data Cache
16 KB, 256 lines, 16B line, 4-way s.a., 1 cycle hit
L1 Instruction
Cache
16 KB, 256 lines, 16B line, 4-way s.a., 1 / 2 cycle hit
L0 Instruction
Cache
256 B, 32 lines, 8B line, direct mapped, 1 cycle hit
Memory Latency
32 cycles
IRF/IMM
4 windows, 32-entry IRF (128 total), 32-entry IMM. 1 branch/pack
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
13
Execution Efficiency for L0 I-Caches
IRF
L0
L0+IRF
2cycle
2cycle+IRF
130%
110%
100%
90%
80%
70%
60%
50%
Benchmark
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
14
Average
CRC32
FFT
Adpcm
Gsm
Blowfish
Pgp
Rijndael
Sha
Ispell
Rsynth
Stringsearch
Jpeg
Lame
Tiff2bw
30%
Dijkstra
Patricia
40%
Basicmath
Bitcount
Qsort
Susan
Normalized IPC
120%
Energy Efficiency for L0 I-Caches
Benchmark
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
15
Average
2cycle+IRF
CRC32
FFT
Adpcm
Gsm
2cycle
Blowfish
Pgp
Rijndael
Sha
L0+IRF
Ispell
Rsynth
Stringsearch
L0
Dijkstra
Patricia
Jpeg
Lame
Tiff2bw
Basicmath
Bitcount
Qsort
Susan
Normalized Energy
IRF
125%
120%
115%
110%
105%
100%
95%
90%
85%
80%
75%
70%
65%
60%
Decoupled Fetch Configurations
Parameter
High-end Out-of-order Embedded Processor
I-Fetch Queue
4/8 entries
Branch Predictor
Bimodal-2048 entries, 3 cycle penalty
Fetch Width
1/2/4
Decode/Issue/Commit Width
1/2/3/4
RUU size
16
LSQ size
8
L1 Data Cache
32 KB, 512 lines, 16B line, 4-way s.a., 1 cycle hit
L1 Instruction Cache
32 KB, 512 lines, 16B line, 4-way s.a., 1 cycle hit
Unified L2 Cache
256 KB, 1024 lines, 64B line, 4-way s.a. 6 cycle hit
Memory Latency
32 cycles
IRF/IMM
4 windows, 32-entry IRF (128 total), 32-entry IMM. 1 branch/pack
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
16
4|4
+I R
F
4|4
2|
2|4 4
+I R
F
2|3
2|3
+I R
F
2|2
2|2
+I R
F
1|4
1|4
+I R
F
1|3
1|3
+I R
F
1|1
1|2
1|2
+I R
F
250%
240%
230%
220%
210%
200%
190%
180%
170%
160%
150%
140%
130%
120%
110%
100%
90%
1|1
+I R
F
Normalized IPC
Execution Efficiency for
Asymmetric Pipeline Bandwidth
Fetch Width/Execute Width Configuration
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
17
Energy Efficiency for Asymmetric
Pipeline Bandwidth
115%
Normalized Energy
110%
105%
100%
95%
90%
85%
80%
75%
4|4
+IR
F
4|4
2|4
2|4
+IR
F
2|3
+IR
F
2|3
2|2
+IR
F
2|2
1|4
+IR
F
1|4
1|3
+IR
F
1|3
1|2
+IR
F
1|2
1|1
+IR
F
1|1
70%
Fetch Width/Execute Width Configuration
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
18
Energy-Delay2 for Asymmetric
Pipeline Bandwidth
100%
90%
Energy-Delay^2
80%
70%
60%
50%
40%
30%
20%
10%
4
4|4 |4
+IR
F
2
2|4 |4
+IR
F
2
2|3 |3
+IR
F
2
2|2 |2
+IR
F
1
1|4 |4
+IR
F
1
1|3 |3
+IR
F
1
1|2 |2
+IR
F
1
1|1 |1
+IR
F
0%
Fetch Width/Execute Width Configuration
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
19
Related Work




L-caches – subdivide instruction
cache, such that one portion contains
the most frequently accessed code
Loop Caches – capture simple loop
behaviors and replay instructions
Zero Overhead Loop Buffers (ZOLB)
Pipeline gating / Front-end throttling –
stall fetch when in areas of low IPC
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
20
Conclusions and Future Work

Future Topics




IRF can alleviate fetch bottlenecks from L0 I-Cache
misses or branch mispredictions



Can we pack areas where L0 is likely to miss?
IRF + encrypted or compressed I-Caches
IRF + asymmetric frequency clustering (of pipeline
backend functional units)
Increased IPC of L0 system by 6.75%
Further decreased energy of L0 system by 5.78%
Decoupling fetch provides a wider spectrum of
design points to be evaluated (energy/performance)
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
21
The End
Questions ???
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
22
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
23
Energy Consumption
No optimizations
Promotion
Inst Selection
Reg Re-assign
Intra-sched
Inter-sched
100.0%
Total Energy
95.0%
90.0%
85.0%
80.0%
75.0%
70.0%
oti
m
o
t
Au
ve
er
um
s
n
Co
tw
Ne
ork
fice
Of
ri
cu
e
S
ty
m
om
c
le
Te
e
rag
e
Av
Benchmark Category
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
24
Static Code Size
No optimizations
Promotion
Inst Selection
Reg Re-assign
Intra-sched
Inter-sched
97.5%
Static Code Size
92.5%
87.5%
82.5%
77.5%
72.5%
67.5%
62.5%
57.5%
oti
m
o
t
Au
ve
er
um
s
n
Co
tw
Ne
ork
fice
Of
ri
cu
e
S
ty
m
om
c
le
Te
e
rag
e
Av
Benchmark Category
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
25
Conclusions & Future Work





Compiler optimizations targeted specifically for IRF
can further reduce energy (12.2%15.8%), code
size (16.8%28.8%) and execution time
Unique transformation opportunities exist due to
IRF, such as code duplication for code size
reduction and predication
As processor designs become more idiosyncratic, it
is increasingly important to explore the possibility of
evolving existing compiler optimizations
Register targeting and loop unrolling should also be
explored with instruction packing
Enhanced parameterization techniques
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
26
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
27
Instruction Redundancy


Profiled largest benchmark in each of six MiBench
categories
Most frequent 32 instructions comprise 66.5% of
total dynamic and 31% of total static instructions
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
28
Compilation Framework
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
29
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File
30
Download