Compiler and Optimization

advertisement
ECE 454
Computer Systems Programming
Compiler and Optimization (II)
Ding Yuan
ECE Dept., University of Toronto
http://www.eecg.toronto.edu/~yuan
Content
• Compiler Basics (last lec.)
• Understanding Compiler Optimization (last lec.)
• Manual Optimization
• Advanced Optimizations
• Parallel Unrolling
• Profile-Directed Feedback
2
Ding Yuan, ECE454
Manual Optimization
Example: Vector Sum Function
3
Ding Yuan, ECE454
Vector Procedures
vec_ptr:
length
data
0 1 2
length–1
  
• Procedures:
vec_ptr new_vec(int len)
• Create vector of specified length
int get_vec_element(vec_ptr v, int index, int *dest)
• Retrieve vector element, store at *dest
• Return 0 if out of bounds, 1 if successful
int *get_vec_start(vec_ptr v)
• Return pointer to start of vector data
4
Ding Yuan, ECE454
Vector Sum Function: Original Code
void vsum(vec_ptr v, int *dest)
{
int i;
*dest = 0;
for (i=0; i < vec_length(v); i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
• What it does:
• Add all the vector elements together
• Store the resulting sum at destination location *dest
• Performance Metric: CPU Cycles per Element (CPE)
• CPE = 42.06 (Compiled -g)
5
Ding Yuan, ECE454
After Compiler Optimization
void vsum(vec_ptr v, int *dest)
{
int i;
*dest = 0;
for (i=0; i<vec_length(v); i++)
{
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
// same as vsum
void vsum1(vec_ptr v, int *dest)
{
int i;
*dest = 0;
for (i=0; i<vec_length(v); i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
• Impact of compiler optimization
• vsum:
• CPE = 42.06 (Compiled -g)
• vsum1: (a C-code view of the resulting assembly instructions)
• CPE = 31.25 (Compiled -O2, will compile –O2 from now on)
• Result: better scheduling & reg alloc, lots of missed opportunities
6
Ding Yuan, ECE454
Optimization Blocker: Procedure Calls
• Why didn’t compiler move vec_len out of the inner loop?
• Procedure may have side effects
• Function may not return same value for given arguments
• Depends on other parts of global state
• Why doesn’t compiler look at code for vec_len?
• Linker may overload with different version
• Unless declared static
• Interprocedural optimization is not used extensively due to cost
• Warning:
• Compiler treats procedure call as a black box
• Weak optimizations in and around them
7
Ding Yuan, ECE454
Manual LICM
void vsum1(vec_ptr v, int *dest)
{
int i;
*dest = 0;
for (i=0; i<vec_length(v); i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
void vsum2(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
*dest = 0;
for (i = 0; i < length; i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
• CPE = 20.66
• Next Opportunity: Inlining get_vec_element()?
• compiler may not have thought worth it
• O2 will prioritize code size increase rather than performance
8
Ding Yuan, ECE454
Manual Inlining
void vsum2(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
*dest = 0;
for (i = 0; i < length; i++) {
int val;
get_vec_element(v, i, &val);
*dest += val;
}
}
void vsum3i(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
*dest = 0;
for (i = 0; i < length; i++) {
int val;
int *data = get_vec_start(v);
val = data[i];
*dest += val;
}
}
• Inlining: replace a function call with its body
• shouldn’t normally have to do this manually
• could also convince the compiler to do it via directives
• Inlining enables further optimizations
9
Ding Yuan, ECE454
Further LICM, val unnecessary
void vsum3i(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
*dest = 0;
for (i = 0; i < length; i++) {
int val;
int *data = get_vec_start(v);
val = data[i];
*dest += val;
}
}
void vsum3(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
*dest = 0;
for (i = 0; i < length; i++) {
*dest += data[i];
}
}
CPE = 6.00
 Compiler
may not always inline when it should
 It may not know how important a certain loop is
 It may be blocked by missing function definition
10
Ding Yuan, ECE454
Reduce Unnecessary Memory Refs
void vsum3(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
*dest = 0;
for (i = 0; i < length; i++) {
*dest += data[i];
}
}
void vsum4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}
• Don’t need to store in destination until end
• Can use local variable sum instead
• It is more likely to be allocated to a register!
• Avoids 1 memory read, 1 memory write per iteration
• CPE = 2.00
• Difficult for the compiler to promote pointer de-refs to registers
11
Ding Yuan, ECE454
Why are pointers so hard for compilers?
• Pointer Aliasing
• Two different pointers might point to a single location
• Example:
int x = 5, y = 10;
int *p = &x;
int *q = &y;
if (one_in_a_million){
q = p;
}
// compiler must assume q==p = &x from here on, though unlikely
12
Ding Yuan, ECE454
Minimizing the impact of Pointers
• Easy to over-use pointers in C/C++
• Since allowed to do address arithmetic
• Direct access to storage structures
• Get in habit of introducing local variables
• Eg., when accumulating within loops
• Your way of telling compiler not to worry about aliasing
13
Ding Yuan, ECE454
Understanding Instruction-Level
Parallelism: Unrolling and
Software Pipelining
14
Ding Yuan, ECE454
Superscalar CPU Design
Instruction Control
Address
Fetch
Control
Retirement
Unit
Register
File
Instruction
Cache
Instrs.
Instruction
Decode
Operations
Register
Updates
Prediction
OK?
Integer/
Branch
General
Integer
FP
Add
FP
Mult/Div
Operation Results
Load
Addr.
Store
Functional
Units
Addr.
Data
Data
Data
Cache
Execution
15
Ding Yuan, ECE454
Assumed CPU Capabilities
• Multiple Instructions Can Execute in Parallel
•
•
•
•
•
1 load
1 store
2 integer (one may be branch)
1 FP Addition
1 FP Multiplication or Division
• Some Instructions Take > 1 Cycle, but Can be Pipelined
•
•
•
•
•
•
•
Instruction
Latency
Load / Store
3
Integer Multiply
4
Double/Single FP Multiply
5
Double/Single FP Add
3
Integer Divide
36
Double/Single FP Divide
38
16
Cycles/Issue
1
1
2
1
36
38
Ding Yuan, ECE454
Intel Core i7
17
Ding Yuan, ECE454
Zooming into x86 assembly
void vsum4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}
.L24:
addl (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
#
#
#
#
#
//
//
//
//
i => %edx
length => %esi
data => %eax
sum => %ecx
Loop code only:
sum += data[i] // data+i*4
i++
compare i and length
if i < length goto L24
18
Ding Yuan, ECE454
Note: addl includes a load!
addl (%eax,%edx,4),%ecx
%edx(i)
3 cycles
%ecx.0
(sum)
1 cycle
load
incl
load
%ecx.i
cmpl+1
cc.1
%edx.1
addl
jl
t.1
addl
%ecx.1
19
Ding Yuan, ECE454
Visualizing the vsum Loop
.L24:
addl (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
load
%ecx.0
(sum)
Time
incl
cmpl+1
%ecx.i
cc.1
jl
t.1
addl
%ecx.1
Loop code only:
sum += data[i] // data+i*4
i++
compare i and length
if i < length goto L24
• Operations
%edx.0 (i)
load
#
#
#
#
#
%edx.1
• Vertical position denotes time at which
executed
• Cannot begin operation until operands
available
• Height denotes latency
• addl is split into a load and addl
• Eg., X86 converts to micro-ops
20
Ding Yuan, ECE454
4 Iterations of vsum
%edx.0
incl
1
load
2
3
4
cmpl+1
%ecx.i
cc.1
jl
%ecx.0
incl
load
t.1
addl
i=0
%ecx.1
5
6
%edx.1
Iteration 1
Cycle
%edx.2
cmpl+1
%ecx.i
cc.2
jl
incl
load
t.2
addl
i=1
cmpl+1
%ecx.i
cc.3
jl
incl
load
t.3
%ecx.2
Iteration 2
%edx.3
addl
i=2
%ecx.3
7
Iteration 3
• Unlimited Resource Analysis
%edx.4
4 integer ops
cmpl+1
%ecx.i
cc.4
jl
t.4
addl
i=3
%ecx.4
Iteration 4
• Assume operation can start as soon as operands available
• Operations for multiple iterations overlap in time
• CPE = 1.0
• But needs to be able to issue 4 integer ops per cycle
• Performance on assumed CPU:
• CPE = 2.0
• Number of parallel integer ops (2) is the bottleneck
21
Ding Yuan, ECE454
Loop Unrolling: vsum
void vsum5(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-2;
int *data = get_vec_start(v);
int sum = 0; int i;
for (i = 0; i < limit; i+=3){
sum += data[i];
sum += data[i+1];
sum += data[i+2];
}
for ( ; i < length; i++){
sum += data[i]
*dest = sum;
}
22
• Optimization
• Combine multiple
iterations into single
loop body
• Amortizes loop
overhead across
multiple iterations
• Finish extras at end
• Measured CPE = 1.33
Ding Yuan, ECE454
Visualizing Unrolled Loop: vsum
• Loads can pipeline,
since don’t have
dependences
• Only one set of loop
%ecx.0c
control operations
%edx.0
addl
load
%edx.1
cmpl+1
%ecx.i
load
jl
cc.1
t.1a
addl
%ecx.1a
load
addl
%ecx.1b
23
Time
t.1b
t.1c
addl
%ecx.1c
Ding Yuan, ECE454
Executing with Loop Unrolling: vsum
%edx.2
5
6
addl
7
8
load
10
cc.3
jl
t.3a
addl
%ecx.3a
11
12
cmpl+1
%ecx.i
load
9 %ecx.2c
%edx.3
i=6
13
load
addl
%edx.4
t.3b
addl
%ecx.3b
load
cmpl+1
%ecx.i
t.3c
addl
%ecx.3c
%ecx.4a
14 Cycle
• Predicted Performance
15
• Can complete
iteration in 3 cycles
• Should give CPE of 1.0
jl
t.4a
addl
Iteration 3
cc.4
load
i=9
load
t.4b
addl
%ecx.4b
t.4c
addl
%ecx.4c
Iteration 4
24
Ding Yuan, ECE454
Why 3 times?
%edx.0
addl
load
%edx.1
cmpl+1
%ecx.i
load
jl
%edx.0
cc.1
addl
%edx.1
t.1a
addl
%ecx.1a
load
t.1b
addl
%ecx.1b
%ecx.0c
load
cmpl+1%edx.0
%ecx.i
cc.1
jl
addl
%edx.
t.1a
addl
%ecx.1a
load
cmpl+1
%ecx.i
t.1b
addl
%ecx.1b
Unroll 2 times: what is the problem?
25
%ecx.0c
load
jl
cc.1
t.1a
addl
%ecx.1a
t.1b
addl
%ecx.1b
Ding Yuan, ECE454
Vector multiply function: vprod
.L24:
imull (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
#
#
#
#
#
Loop:
prod *= data[i]
i++
i:length
if < goto Loop
%edx.0
incl
load
%ecx.0
%edx.1
cmpl
jl
void vprod4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int prod = 1;
for (i = 0; i < length; i++)
prod *= data[i];
*dest = prod;
}
cc.1
t.1
Time
imull
%ecx.1
26
Ding Yuan, ECE454
3 Iterations of vprod
• Unlimited Resource
Analysis
%edx.0
incl
1
load
2
cmpl
cc.1
jl
3
incl
load
cc.2
i=0
4
%edx.2
cmpl
t.1
%ecx.0
5
%edx.1
jl
incl
load
cmpl
t.2
cc.3
jl
imull
t.3
6
%ecx.1
7
Iteration 1
8
9
10
11
12
13
14
15
• Performance on CPU:
imull
Cycle
%edx.3
• Assume operation can
start as soon as
operands available
• Operations for
multiple iterations
overlap in time
• Limiting factor
becomes latency of
integer multiplier
i=1
%ecx.2
Iteration 2
• Data hazard
imull
i=2
• Gives CPE of 4.0
%ecx.3
27 3
Iteration
Ding Yuan, ECE454
Unrolling: Performance Summary
Unrolling
Degree
Relevant FU
Latency
1
2
3
4
8
16
vsum
1 cycle
2.00
1.50
1.33
1.50
1.25
1.06
vprod
4 cycles
4.00
vFPsum
3 cycles
3.00
vFPprod
5 cycles
5.00
• Only helps integer sum for our examples
• Other cases constrained by functional unit latencies
• Effect is nonlinear with degree of unrolling
• Many subtle effects determine exact scheduling of operations
can we do better for vprod?
28
Ding Yuan, ECE454
Visualizing vprod
1 x0
*
x1
• Computation
*
x2
*
((((((((((((1 * x0) * x1) * x2) * x3) * x4)
* x5) * x6) * x7) * x8) * x9) * x10) * x11)
x3
*
• Performance
x4
*
x5
*
• N elements, D cycles/operation
• N*D cycles = N * 4
x6
*
x7
*
x8
*
x9
*
x10
*
x11
*
29
Ding Yuan, ECE454
Parallel Unrolling Method1: vprod
void vprod5pu1(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-1;
int *data = get_vec_start(v);
int x = 1; int i;
for (i = 0; i < limit; i+=2) {
x = x * (data[i] * data[i+1]);
}
if (i < length) // fix-up
x *= data[i];
*dest = x;
}
• Code Version
• Integer product
• Optimization
• Multiply pairs of
elements together
• And then update
product
• Performance
• CPE = 2
Assume unlimited hardware resources from
now on
30
Ding Yuan, ECE454
%edx.0
addl
load
%edx.1
cmpl
load
jl
addl
load
cmpl
load
mull
tmp=data[i] *
data[i+1]
mull
mull
x=x*tmp
jl
• Still have the data
dependency btw.
multiplications, but it’s
one dep. every two
elements
mull
31
Ding Yuan, ECE454
Order of Operations
Can Matter!
1 x0
*
x1
*
x2
*
x3
*
x4
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = (x * data[i]) * data[i+1];
}
*
x5
*
x6
*
x7
*
x8
*
• CPE = 4.00
• All multiplies performed in sequence
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = x * (data[i] * data[i+1]);
}


CPE = 2
Multiplies overlap
x9
*
Compiler can
do it for you
x10
*
x11
*
x0 x1
1
* x2 x3
*
* x4 x5
*
* x6 x7
*
* x8 x9
*
* x10 x11
*
*
*
32
Ding Yuan, ECE454
Parallel Unrolling Method2: vprod
void vprod6pu2(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-1;
int *data = get_vec_start(v);
int x0 = 1;
int x1 = 1;
int i;
for (i = 0; i < limit; i+=2)
x0 *= data[i];
x1 *= data[i+1]
}
if (i < length) // fix-up
x0 *= data[i];
*dest = x0 * x1;
}
• Code Version
• Integer product
• Optimization
• Accumulate in two
different products
• Can be performed
simultaneously
• Combine at end
• Performance
• CPE = 2.0
• 2X performance
Compiler won’t do it for you
33
Ding Yuan, ECE454
Method2: vprod6pu2
visualized
1 x0
*
1 x0
x1
*
*
x2
*
for (i = 0; i < limit; i+=2)
x0 *= data[i];
x1 *= data[i+1]
}
…
*dest = x0 * x1;
x1
*
x3
*
x2
*
x4
*
x3
*
x5
*
x4
*
x6
*
x5
*
x7
*
x6
*
x8
*
x7
*
x9
*
x8
*
x9
x10
*
x10
*
x11
*
*
34
x11
*
Ding Yuan, ECE454
Trade-off for loop unrolling
• Benefits
• Reduce the loop operations (e.g., condition check & loop variable
increment)
• Reason for the integer add (vsum5) to achieve a CPE of 1
• Enhance parallelism (often require manually rewrite the loop body)
• Harm
• Increased code size
• Register pressure
• Lots of Registers to hold sums/products
• IA32: Only 6 usable integer registers, 8 FP registers
• When not enough registers, must spill temporaries onto stack
• Wipes out any performance gains
35
Ding Yuan, ECE454
Needed for Effective Parallel Unrolling:
• Mathematical
• Combining operation must be associative & commutative
• OK for integer multiplication
• Not strictly true for floating point
• OK for most applications
• Hardware
• Pipelined functional units, superscalar execution
• Lots of Registers to hold sums/products
36
Ding Yuan, ECE454
Summary
37
Ding Yuan, ECE454
Summary of Optimizations
Version
Optimization
vsum
-g
vsum1
-O2
Vsum2
LICM
vsum3
Applied to
Manual or GCC
CPE
42.06
GCC
31.25
vec_length()
manual
20.66
Inlining + LICM
get_vec_element()
manual
6.00
vsum4
mem-ref reduction
*dest
manual
2.00
vsum5
unrolling (3x)
for loop
-funroll-loops*
1.33
vprod4
(same as vsum4)
4.00
vprod5pu1 par. unrolling 1
for loop
manual
2.0
vprod5pu2 par. unrolling 2
for loop
manual
2.0
38
Ding Yuan, ECE454
Punchlines
• Before you start: will optimization be worth the effort?
• profile to estimate potential impact
• Get the most out of your compiler before going manual
• trade-off: manual optimization vs readable/maintainable
• Exploit compiler opts for readable/maintainable code
• have lots of functions/modularity
• debugging/tracing code (enabled by static flags)
• Limit use of pointers
• pointer-based arrays and pointer arithmetic
• function pointers and virtual functions (unfortunately)
• For super-performance-critical code:
• zoom into assembly to look for optimization opportunities
• consider the instruction-parallelism capabilities of CPU
39
Ding Yuan, ECE454
How to know what the compiler did?
run: objdump –d a.out
• prints a listing of instructions , with inst address and encoding
void main(){
int i = 5;
int x = 10;
…
08048394 <main>:
inst
address
inst
encoding
8048394:
55
push %ebp
8048395:
89 e5
mov
8048397:
83 ec 10
sub
804839a:
c7 45 f8 05 00 00 00
movl $0x5,-0x8(%ebp)
80483a1:
c7 45 fc 0a 00 00 00
movl $0xa,-0x4(%ebp)
40
%esp,%ebp
$0x10,%esp
Ding Yuan, ECE454
Advanced Topics
• Profile-Directed Feedback (PDF)
• Basically feeding gprof-like measurements into compiler
• Allows compiler to make smarter choices/optimizations
• Eg., handle the “if (one-in-a-million)” case well
• Just-in-Time Compilation and Optimization (JIT)
• Performs simple version of many of the basic optimizations
• Does them dynamically at run-time, exploits online PDF
• Eg., Java JIT
• Current hot topics in compilation (more later):
• Automatic parallelization (exploiting multicores)
41
Ding Yuan, ECE454
Download