Facilitating Compiler Optimizations Through the Dynamic Mapping of Alternate Register Structures Cases 2007

advertisement
Facilitating Compiler Optimizations Through the
Dynamic Mapping of Alternate Register
Structures
Cases 2007
Florida State University
Chris Zimmer, Steve Hines, Prasad Kulkarni
Gary Tyson, David Whalley
Motivation
 Embedded Processors have
fewer registers.
 Compiler Optimizations increase register
pressure
 Difficult to apply aggressive compiler
optimizations on embedded systems
2
Vector Multiply Example
 Even before aggressive optimizations, 60% of
available registers are already used
 Further optimizations like Loop Unrolling and
Software Pipelining are inhibited
int A[1000], B[1000];
void vmul() {
int I;
for (I=2; I < 1000; I++)
B[I] = A[I] * B[I-2];
}
3
.L3:
ldr
ldr
mul
str
add
cmp
blt
r1,[r2,r3, lsl #2]
r12,[r4], #4
r0,r12,r1
r0,[r5,r3, lsl #2]
r3,r3,#1
r3, #1000
.L3
Application Configurable
Processors
 Exploit common reference patterns found in
code
 Small register files mimic these reference
behaviors.
 Map Table provides register redirection.
 Changed architecture to add more registers,
but have minimal impact on ISA support,
particularly not increasing operand size
4
Architectural Modifications
R0
R0
R1
Q1
Register
File
Map
Table
R6
R6
R15
R15
Queue Q1
Queue Q2
Queue Q3
Stack Q4
Circular Buffer Q5
5
Software Pipelining
 Software pipelining is not often found in
embedded compilers.
 Software pipelining
cycle time of a loop.
reduces the overall
 Extracts iterations
 Consumes Stalls
 Consumes registers!!
6
Software Pipelining Example
int A[1000], B[1000];
void vmul() {
int I;
for (I=2; I < 1000; I++)
B[I] = A[I] * C[I];
}
Stalls Present when Loop
Run
.L3:
ldr r1,[r2,r3, lsl #2]
ldr r12,[r4], #4
stall
stall
.L3:
7
stall
ldr
r1,[r2,r3, lsl #2]
ldr
r12,[r4], #4
mul
r0,r12,r1
stall
str
r0,[r5,r3, lsl #2]
stall
add
r3,r3,#1
stall
cmp
r3, #1000
str r0,[r5,r3, lsl #2]
blt
.L3
add r3,r3,#1
mul r0,r12,r1
cmp r3, #1000
bgt .L3
Instruction
 Goal:
Minimal modification to existing
instruction set.
 Single cycle instruction latency
 Method: Add a single instruction to the ISA
that is used to map and unmap a common
register specifier into a customized register
structure.
qmap <Reg Specifier> <Custom reg map information> <Custom
reg specifier>
qmap r3,#4,q3
8
Architectural Modifications
R0
R0
R1
Q1
Register
File
Map
Table
R6
R6
R15
R15
Queue Q1
Queue Q2
An access to R0, which has
no mapping in the table
would get the data from
the register file.
9
R1 is mapped into Q1 and
would retrieve its data
from there.
Queue Q3
Destructive Queue Q4
Circular Buffer Q5
Software Pipelining Example
qmap r1,
int A[1000], B[1000];
void vmul() {
int I;
for (I=2; I < 1000; I++)
B[I] = A[I] * C[I];
}
Q1
30
25
15
5
Q2
34
2
1
Q3
30
75
10
5
,q1
qmap r12,
qmap r0,
,q2
,q3
Prolog:
6 loads and 2 mults
Loop:
ldr
r1,[r2,r3, lsl #2]
ldr
r12,[r4], #4
mul
r0,r12,r1
str
r0,[r5,r3, lsl #2]
add
r3,r3,#1
cmp
r3, #1000
blt
.L3
Epilog
1 multiply and 3 stores
Register Usage
Loads 8x4 Register Savings Using Register Structures
Benchmark
AR in Original Loop
AR needed to Pipeline AR contained in customized structures
N Real Updates
10
10
6
Dot Product
9
9
4
Matrix Multiply
9
9
4
Fir
6
6
4
Mac
10
8
10
Fir2Dim 3 Similar Loops
10
10
4
Loads 16x4 Register Savings Using Register Structures
N Real Updates
10
10
6
Dot Product
9
9
4
Matrix Multiply
9
9
4
Fir
6
6
4
Mac
10
8
12
Fir2Dim
10
10
4
Loads 32x4 Register Savings Using Register Structures
N Real Updates
10
10
9
Dot Product
9
9
8
Matrix Multiply
9
9
8
Fir
6
6
12
Mac
10
8
18
Fir2Dim
10
10
8
11
Results – Multiplies varying latency,
load latency set at four
Percent Cycle Reduction
In-Order Issue
50
Dot Product
40
Matrix
30
Fir
20
N Real Updates
10
Conv 45
0
Mac
2
4
8
Multiply Latency
12
16
32
Fir2Dim
Results – Loads varying latency,
multiply latency set at four
Percent Cycle Reduction
In-Order Issue
60
50
Dot Product
40
Matrix
30
Fir
20
N Real Upates
10
Conv45
0
-10
Mac
2
4
8
Load Latency
13
16
32
Fir2Dim
Conclusions
 Customized register structures reduce
register pressure.
 Software pipelining is viable in resource
constrained environments
 Performance can be improved with minor
impact to the ISA.
14
Extra’s
Reference Behaviors
Stack Reference Behavior
ldr r1,[r6,r4, lsl #4]
ldr r12,[r6,r4, lsl #8]
ldr r8,[r6,r4, lsl #12]
str r8,[r3,r4, lsl #16]
str r12,[r3,r4, lsl #20]
str r1,[r3,r4, lsl #24]
16
Application Configurable
Architecture
 Application configurable processors are
designed using a mapping table similar to a
register rename table found in many out of
order implementations.
 The map table is read during every access to
the architected register file.
 This serves as a method of determining if a
register specifier is used in the original
architected register file or a customized
register structure.
17
Application Configurable
Architecture
 The customized register files are
small in
size but they efficiently manage the values
that would require many architected
registers.
 The customized register files can mimic
queues, stacks, and circular buffers.
 These structures are accessed using the
same register specifier that is used to
access the architected register file.
18
Remove
Reference Behaviors
ldr r1,[r6,r4, lsl #4]
ldr r12,[r6,r4, lsl #8]
ldr r8,[r6,r4, lsl #12]
str r8,[r3,r4, lsl #16]
r1
R8
str r12,[r3,r4, lsl #20]
str r1,[r3,r4, lsl #24]
R12
Stack Reference
Behavior
R1
ldr r1,[r6,r4, lsl #4]
ldr r1,[r6,r4, lsl #8]
ldr r1,[r6,r4, lsl #12]
str r1,[r3,r4, lsl #16]
str r1,[r3,r4, lsl #20]
19
str r1,[r3,r4, lsl #24]
Free up r8 and r12 for
use.
Remove
Qmap Instruction
q0
ldr r1,[r6,r4, lsl #4]
R8
ldr r1,[r6,r4, lsl #8]
ldr r1,[r6,r4, lsl #12]
str r1,[r3,r4, lsl #16]
R12
str r1,[r3,r4, lsl #20]
str r1,[r3,r4, lsl #24]
R1
Free up r8 and r12 for
use.
20
Modulo Scheduling
 For our work we used modulo scheduling.
This requires using the dependences and
latencies of the loop instructions to
generate a modulo scheduled loop.
 The prolog and epilog are then built based
off of this schedule.
 The prolog and epilog in require register
renaming of loop carried dependencies to
verify a correct loop.
 Renaming in embedded processors is often not
possible.
21
Register Renaming due to software
pipelining
 Renaming doesn’t work… not enough
registers.
 Rotating registers would require a
significant rewrite of the embedded ISA.
 The loop carried values can simply be
mapped into a register queue to hold the
value across several iterations.
22
Results Register Savings
latency grows for the instructions more
iterations of the loop are extracted to spread
 As
out the latency.
 The extra registers that would be required to
perform renaming have measured from 25%
to 200% of the available registers in the
ARM.
23
Download