Code Optimization II: Machine Dependent Optimizations Feb 18, 2003 15-213

advertisement
15-213
“The course that gives CMU its Zip!”
Code Optimization II:
Machine Dependent Optimizations
Feb 18, 2003
Topics
• Machine-Dependent Optimizations
• Pointer code
• Unrolling
• Enabling instruction level parallelism
• Understanding Processor Operation
• Translation of instructions into operations
• Out-of-order execution of operations
• Branches and Branch Prediction
• Advice
class11.ppt
Previous Best Combining Code
void combine4(vec_ptr v, int *dest)
{
int i;
int length = vec_length(v);
int *data = get_vec_start(v);
int sum = 0;
for (i = 0; i < length; i++)
sum += data[i];
*dest = sum;
}
Task
• Compute sum of all elements in vector
• Vector represented by C-style abstract data type
• Achieved CPE of 2.00
• Cycles per element
–2–
15-213, S’03
General Forms of Combining
void abstract_combine4(vec_ptr v, data_t *dest)
{
int i;
int length = vec_length(v);
data_t *data = get_vec_start(v);
data_t t = IDENT;
for (i = 0; i < length; i++)
t = t OP data[i];
*dest = t;
}
Data Types
• Use different declarations
for data_t
•int
•float
•double
–3–
Operations
• Use different definitions of
OP and IDENT
• + / 0
• * / 1
15-213, S’03
Machine Independent Opt. Results
Optimizations
• Reduce function calls and memory references within loop
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Floating Point
+
*
*
42.06
31.25
20.66
6.00
2.00
41.86
33.25
21.25
9.00
4.00
41.44
31.25
21.15
8.00
3.00
160.00
143.00
135.00
117.00
5.00
Performance Anomaly
• Computing FP product of all elements exceptionally slow.
• Very large speedup when accumulate in temporary
• Caused by quirk of IA32 floating point
–4–
• Memory uses 64-bit format, register use 80
• Benchmark data caused overflow of 64 bits, but not 80
15-213, S’03
Pointer Code
void combine4p(vec_ptr v, int *dest)
{
int length = vec_length(v);
int *data = get_vec_start(v);
int *dend = data+length;
int sum = 0;
while (data < dend) {
sum += *data;
data++;
}
*dest = sum;
}
Optimization
• Use pointers rather than array references
• CPE: 3.00 (Compiled -O2)
• Oops! We’re not making progress here!
Warning: Some compilers do better job optimizing
array code
–5–
15-213, S’03
Pointer vs. Array Code Inner Loops
Array Code
.L24:
addl (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
#
#
#
#
#
Loop:
sum += data[i]
i++
i:length
if < goto Loop
Pointer Code
.L30:
addl (%eax),%ecx
addl $4,%eax
cmpl %edx,%eax
jb .L30
# Loop:
# sum += *data
# data ++
# data:dend
# if < goto Loop
Performance
• Array Code: 4 instructions in 2 clock cycles
• Pointer Code: Almost same 4 instructions in 3 clock cycles
–6–
15-213, S’03
Modern CPU Design
Instruction Control
Fetch
Control
Retirement
Unit
Register
File
Address
Instrs.
Instruction
Decode
Instruction
Cache
Operations
Register
Updates
Prediction
OK?
Integer/
Branch
General
Integer
FP
Add
Operation Results
FP
Mult/Div
Load
Addr.
Store
Functional
Units
Addr.
Data
Data
Data
Cache
Execution
–7–
15-213, S’03
CPU Capabilities of Pentium III
Multiple Instructions Can Execute in Parallel
• 1 load
• 1 store
• 2 integer (one may be branch)
• 1 FP Addition
• 1 FP Multiplication or Division
Some Instructions Take > 1 Cycle, but Can be Pipelined
• Instruction
Latency
Cycles/Issue
• Load / Store
3
1
• Integer Multiply
4
1
• Integer Divide
36
36
• Double/Single FP Multiply 5
2
• Double/Single FP Add
3
1
• Double/Single FP Divide
38
38
–8–
15-213, S’03
Instruction Control
Instruction Control
Retirement
Unit
Register
File
Fetch
Control
Instruction
Decode
Address
Instrs.
Instruction
Cache
Operations
Grabs Instruction Bytes From Memory
• Based on current PC + predicted targets for predicted branches
• Hardware dynamically guesses whether branches taken/not taken and
(possibly) branch target
Translates Instructions Into Operations
• Primitive steps required to perform instruction
• Typical instruction requires 1–3 operations
Converts Register References Into Tags
• Abstract identifier linking destination of one operation with sources of later
operations
–9–
15-213, S’03
Translation Example
Version of Combine4
• Integer data, multiply operation
.L24:
imull (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
#
#
#
#
#
Loop:
t *= data[i]
i++
i:length
if < goto Loop
Translation of First Iteration
.L24:
imull (%eax,%edx,4),%ecx
incl %edx
cmpl %esi,%edx
jl .L24
– 10 –
load (%eax,%edx.0,4)
imull t.1, %ecx.0
incl %edx.0
cmpl %esi, %edx.1
jl-taken cc.1




t.1
%ecx.1
%edx.1
cc.1
15-213, S’03
Translation Example #1
imull (%eax,%edx,4),%ecx
• Split into two operations
load (%eax,%edx.0,4)  t.1
imull t.1, %ecx.0
 %ecx.1
• load reads from memory to generate temporary result t.1
• Multiply operation just operates on registers
• Operands
• Registers %eax does not change in loop. Values will be
retrieved from register file during decoding
• Register %ecx changes on every iteration. Uniquely identify
different versions as %ecx.0, %ecx.1, %ecx.2, …
• Register renaming
• Values passed directly from producer to consumers
– 11 –
15-213, S’03
Translation Example #2
incl %edx
incl %edx.0
 %edx.1
• Register %edx changes on each iteration. Rename as
%edx.0, %edx.1, %edx.2, …
– 12 –
15-213, S’03
Translation Example #3
cmpl %esi,%edx
cmpl %esi, %edx.1
 cc.1
• Condition codes are treated similar to registers
• Assign tag to define connection between producer and
consumer
– 13 –
15-213, S’03
Translation Example #4
jl .L24
jl-taken cc.1
• Instruction control unit determines destination of jump
• Predicts whether will be taken and target
• Starts fetching instruction at predicted destination
• Execution unit simply checks whether or not prediction
was OK
• If not, it signals instruction control
• Instruction control then “invalidates” any operations generated
from misfetched instructions
• Begins fetching and decoding instructions at correct target
– 14 –
15-213, S’03
Visualizing Operations
%edx.0
incl
load
cmpl
cc.1
%ecx.0
jl
%edx.1
load (%eax,%edx,4)
imull t.1, %ecx.0
incl %edx.0
cmpl %esi, %edx.1
jl-taken cc.1




t.1
%ecx.1
%edx.1
cc.1
t.1
Operations
• Vertical position denotes time at which
Time
imull
executed
%ecx.1
• Cannot begin operation until operands
available
• Height denotes latency
Operands
• Arcs shown only for operands that are
passed within execution unit
– 15 –
15-213, S’03
Visualizing Operations (cont.)
%edx.0
load
incl
load
cmpl+1
%ecx.i
%edx.1
load (%eax,%edx,4)
iaddl t.1, %ecx.0
incl %edx.0
cmpl %esi, %edx.1
jl-taken cc.1




t.1
%ecx.1
%edx.1
cc.1
cc.1
%ecx.0
Time
jl
t.1
addl
%ecx.1
Operations
• Same as before, except that
add has latency of 1
– 16 –
15-213, S’03
3 Iterations of Combining Product
%edx.0
incl
1
load
2
cmpl
cc.1
jl
3
incl
load
cc.2
i=0
4
%edx.2
cmpl
t.1
%ecx.0
5
%edx.1
jl
incl
load
cmpl
t.2
cc.3
jl
imull
t.3
6
%ecx.1
7
Iteration 1
8
9
10
11
12
13
14
15
– 17 –
start as soon as
operands available
• Operations for multiple
iterations overlap in
time
Performance
• Limiting factor becomes
imull
Cycle
%edx.3
Unlimited Resource
Analysis
• Assume operation can
i=1
latency of integer
multiplier
• Gives CPE of 4.0
%ecx.2
Iteration 2
imull
i=2
%ecx.3
Iteration 3
15-213, S’03
4 Iterations of Combining Sum
%edx.0
incl
1
load
2
3
4
cmpl+1
%ecx.i
cc.1
jl
%ecx.0
addl
5
7
incl
load
t.1
i=0
%ecx.1
6
%edx.1
Iteration 1
Cycle
%edx.2
cmpl+1
%ecx.i
cc.2
jl
incl
load
t.2
addl
i=1
%ecx.2
Iteration 2
%edx.3
cmpl+1
%ecx.i
cc.3
jl
incl
load
t.3
addl
i=2
%ecx.3
Iteration 3
Unlimited Resource Analysis
%edx.4
4 integer ops
cmpl+1
%ecx.i
cc.4
jl
t.4
addl
i=3
%ecx.4
Iteration 4
Performance
• Can begin a new iteration on each clock cycle
• Should give CPE of 1.0
• Would require executing 4 integer operations in parallel
– 18 –
15-213, S’03
Combining Sum: Resource Constraints
%edx.3
6
7
8 %ecx.3
9
load
%edx.4
cmpl+1
%ecx.i
t.4
addl
incl
cc.4
jl
%ecx.4
Iteration 4
12
%edx.5
load
i=3
10
11
incl
t.5
addl
%ecx.5
cmpl+1
%ecx.i
cc.5
load
%edx.6
jl
t.6
i=4
addl
cmpl
load
cc.6
cc.6
Iteration 5
jl
13
14
incl
%ecx.6
incl
t.7
i=5
addl
jl
15
operands 18available
• Set priority based on program order
cmpl
cc.7
Iteration 6
• Only have16two integer functional units
Cycle
• Some operations
delayed even though
17
%edx.7
load
i=6
%ecx.7
incl
%edx.8
cmpl+1
%ecx.i
t.8
addl
Iteration 7
cc.8
jl
i=7
%ecx.8
Iteration 8
Performance
• Sustain CPE of 2.0
– 19 –
15-213, S’03
Loop Unrolling
void combine5(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-2;
int *data = get_vec_start(v);
int sum = 0;
int i;
/* Combine 3 elements at a time */
for (i = 0; i < limit; i+=3) {
sum += data[i] + data[i+2]
+ data[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
sum += data[i];
}
*dest = sum;
}
– 20 –
Optimization
• Combine multiple
iterations into single
loop body
• Amortizes loop
overhead across
multiple iterations
• Finish extras at end
• Measured CPE = 1.33
15-213, S’03
Visualizing Unrolled Loop
• Loads can pipeline, since
%edx.0
addl
don’t have dependencies
• Only one set of loop
control operations
load
cmpl+1
%ecx.i
load
%ecx.0c
%ecx.1a
load
t.1b
– 21 –
t.1a
%ecx.1a
t.1b
%ecx.1b
t.1c
%ecx.1c
%edx.1
cc.1
Time
addl
%ecx.1b








jl
cc.1
t.1a
addl
load (%eax,%edx.0,4)
iaddl t.1a, %ecx.0c
load 4(%eax,%edx.0,4)
iaddl t.1b, %ecx.1a
load 8(%eax,%edx.0,4)
iaddl t.1c, %ecx.1b
iaddl $3,%edx.0
cmpl %esi, %edx.1
jl-taken cc.1
%edx.1
t.1c
addl
%ecx.1c
15-213, S’03
Executing with Loop Unrolling
%edx.2
5
6
addl
7
load
8
9 %ecx.2c
cmpl+1
%ecx.i
11
i=6
13
load
addl
%edx.4
t.3b
addl
%ecx.3b
load
cmpl+1
%ecx.i
t.3c
addl
%ecx.3c
Iteration 3
• Predicted Performance
14
jl
t.3a
%ecx.3a
12
cc.3
load
addl
10
%edx.3
jl
t.4a
addl
%ecx.4a
Cycle
15
• Can complete
iteration in 3 cycles
• Should give CPE of 1.0
cc.4
load
i=9
load
t.4b
addl
%ecx.4b
t.4c
addl
%ecx.4c
Iteration 4
• Measured Performance
• CPE of 1.33
• One iteration every 4 cycles
– 22 –
15-213, S’03
Effect of Unrolling
Unrolling Degree
1
2
3
4
8
16
2.00
1.50
1.33
1.50
1.25
1.06
Integer
Sum
Integer
Product
4.00
FP
Sum
3.00
FP
Product
5.00
• Only helps integer sum for our examples
• Other cases constrained by functional unit latencies
• Effect is nonlinear with degree of unrolling
• Many subtle effects determine exact scheduling of operations
– 23 –
15-213, S’03
1 x0
*
Serial Computation
x1
*
Computation
x2
*
((((((((((((1 * x0) *
x1) * x2) * x3) * x4) *
x5) * x6) * x7) * x8) *
x9) * x10) * x11)
x3
*
x4
*
x5
*
Performance
x6
*
x7
*
• N elements, D cycles/operation
• N*D cycles
x8
*
x9
*
x10
*
x11
*
– 24 –
15-213, S’03
Parallel Loop Unrolling
void combine6(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-1;
int *data = get_vec_start(v);
int x0 = 1;
int x1 = 1;
int i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x0 *= data[i];
x1 *= data[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x0 *= data[i];
}
*dest = x0 * x1;
}
– 25 –
Code Version
• Integer product
Optimization
• Accumulate in two
different products
• Can be performed
simultaneously
• Combine at end
Performance
• CPE = 2.0
• 2X performance
15-213, S’03
Dual Product Computation
Computation
1 x0
*
((((((1 * x0) * x2) * x4) * x6) *
x8) * x10) *
((((((1 * x1) * x3) * x5) * x7) *
x9) * x11)
1 x1
*
x2
*
x3
*
x4
*
Performance
x5
*
x6
*
• N elements, D cycles/operation
• (N/2+1)*D cycles
• ~2X performance improvement
x7
*
x8
*
x10
x9
*
*
x11
*
*
– 26 –
15-213, S’03
Requirements for Parallel Computation
Mathematical
• Combining operation must be associative & commutative
• OK for integer multiplication
• Not strictly true for floating point
• OK for most applications
Hardware
• Pipelined functional units
• Ability to dynamically extract parallelism from code
– 27 –
15-213, S’03
Visualizing Parallel Loop
%edx.0
• Two multiplies within loop
addl
no longer have data
dependency
load
• Allows them to pipeline
%ecx.0
%edx.1
cmpl
load
cc.1
jl
t.1a
%ebx.0
t.1b
Time
imull
imull
load (%eax,%edx.0,4)
imull t.1a, %ecx.0
load 4(%eax,%edx.0,4)
imull t.1b, %ebx.0
iaddl $2,%edx.0
cmpl %esi, %edx.1
jl-taken cc.1
– 28 –






t.1a
%ecx.1
t.1b
%ebx.1
%edx.1
cc.1
%ecx.1
%ebx.1
15-213, S’03
Executing with Parallel Loop
%edx.0
addl
1
load
2
load
cmpl
load
cc.2
jl
addl
load
%edx.3
cmpl
cc.3
load
jl
imull
imull
7
%ecx.1
i=0
t.2a
%ebx.1
t.2b
Iteration 1
9
10
jl
%edx.2
t.1b
6
8
cc.1
addl
t.1a
4 %ebx.0
5
cmpl
load
3 %ecx.0
%edx.1
imull
Cycle
imull
11
12
• Predicted Performance
13
• Can keep 4-cycle
multiplier busy
performing14two simultaneous
multiplications
15
• Gives CPE of
16 2.0
%ecx.2
i=2
t.3a
%ebx.2
t.3b
Iteration 2
imull
imull
%ecx.3
i=4
%ebx.3
Iteration 3
– 29 –
15-213, S’03
Optimization Results for Combining
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Pointer
Unroll 4
Unroll 16
2X2
4X4
8X4
Theoretical Opt.
Worst : Best
– 30 –
Floating Point
+
*
*
42.06
31.25
20.66
6.00
2.00
3.00
1.50
1.06
1.50
1.50
1.25
1.00
39.7
41.86
33.25
21.25
9.00
4.00
4.00
4.00
4.00
2.00
2.00
1.25
1.00
33.5
41.44
31.25
21.15
8.00
3.00
3.00
3.00
3.00
2.00
1.50
1.50
1.00
27.6
160.00
143.00
135.00
117.00
5.00
5.00
5.00
5.00
2.50
2.50
2.00
2.00
80.0
15-213, S’03
Parallel Unrolling: Method #2
void combine6aa(vec_ptr v, int *dest)
{
int length = vec_length(v);
int limit = length-1;
int *data = get_vec_start(v);
int x = 1;
int i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x *= (data[i] * data[i+1]);
}
/* Finish any remaining elements */
for (; i < length; i++) {
x *= data[i];
}
*dest = x;
}
– 31 –
Code Version
• Integer product
Optimization
• Multiply pairs of
elements together
• And then update
product
• “Tree height reduction”
Performance
• CPE = 2.5
15-213, S’03
Method #2 Computation
Computation
((((((1 * (x0 * x1)) * (x2 * x3)) *
(x4 * x5)) * (x6 * x7)) * (x8 *
x9)) * (x10 * x11))
x0 x1
1
Performance
* x2 x3
*
• N elements, D cycles/operation
• Should be (N/2+1)*D cycles
* x4 x5
*
* x6 x7
*
• CPE = 2.0
• Measured CPE worse
* x8 x9
*
* x10 x11
*
*
*
– 32 –
Unrolling
CPE
CPE
(measured) (theoretical
)
2
2.50
2.00
3
1.67
1.33
4
1.50
1.00
6
1.78
1.00
15-213, S’03
Understanding
Parallelism
1 x0
*
x1
*
x2
*
x3
*
x4
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = (x * data[i]) * data[i+1];
}
*
x5
*
x6
*
x7
*
x8
• CPE = 4.00
• All multiplies perfomed in sequence
*
x9
*
x10
*
x11
*
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = x * (data[i] * data[i+1]);
}
• CPE = 2.50
• Multiplies overlap
x0 x1
1
* x2 x3
*
* x4 x5
*
* x6 x7
*
* x8 x9
*
* x10 x11
*
*
*
– 33 –
15-213, S’03
Limitations of Parallel Execution
Need Lots of Registers
• To hold sums/products
• Only 6 usable integer registers
• Also needed for pointers, loop conditions
• 8 FP registers
• When not enough registers, must spill temporaries onto stack
• Wipes out any performance gains
• Not helped by renaming
• Cannot reference more operands than instruction set allows
• Major drawback of IA32 instruction set
– 34 –
15-213, S’03
Register Spilling Example
Example
• 8 X 8 integer product
• 7 local variables share 1 register
• See that are storing locals on
stack
• E.g., at -8(%ebp)
– 35 –
.L165:
imull (%eax),%ecx
movl -4(%ebp),%edi
imull 4(%eax),%edi
movl %edi,-4(%ebp)
movl -8(%ebp),%edi
imull 8(%eax),%edi
movl %edi,-8(%ebp)
movl -12(%ebp),%edi
imull 12(%eax),%edi
movl %edi,-12(%ebp)
movl -16(%ebp),%edi
imull 16(%eax),%edi
movl %edi,-16(%ebp)
…
addl $32,%eax
addl $8,%edx
cmpl -32(%ebp),%edx
jl .L165
15-213, S’03
Summary: Results for Pentium III
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Unroll 4
Unroll 16
4X2
8X4
8X8
Worst : Best
Floating Point
+
*
*
42.06
31.25
20.66
6.00
2.00
1.50
1.06
1.50
1.25
1.88
39.7
41.86
33.25
21.25
9.00
4.00
4.00
4.00
2.00
1.25
1.88
33.5
41.44
31.25
21.15
8.00
3.00
3.00
3.00
1.50
1.50
1.75
27.6
160.00
143.00
135.00
117.00
5.00
5.00
5.00
2.50
2.00
2.00
80.0
• Biggest gain doing basic optimizations
• But, last little bit helps
– 36 –
15-213, S’03
Results for Alpha Processor
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Unroll 4
Unroll 16
4X2
8X4
8X8
Worst : Best
Floating Point
+
*
*
40.14
25.08
19.19
6.26
1.76
1.51
1.25
1.19
1.15
1.11
36.2
47.14
36.05
32.18
12.52
9.01
9.01
9.01
4.69
4.12
4.24
11.4
52.07
37.37
28.73
13.26
8.08
6.32
6.33
4.44
2.34
2.36
22.3
53.71
32.02
32.73
13.01
8.01
6.32
6.22
4.45
2.01
2.08
26.7
• Overall trends very similar to those for Pentium III.
• Even though very different architecture and compiler
– 37 –
15-213, S’03
Results for Pentium 4
Method
Integer
+
Abstract -g
Abstract -O2
Move vec_length
data access
Accum. in temp
Unroll 4
Unroll 16
4X2
8X4
8X8
Worst : Best
Floating Point
+
*
*
35.25
26.52
18.00
3.39
2.00
1.01
1.00
1.02
1.01
1.63
35.2
35.34
30.26
25.71
31.56
14.00
14.00
14.00
7.00
3.98
4.50
8.9
35.85
31.55
23.36
27.50
5.00
5.00
5.00
2.63
1.82
2.42
19.7
38.00
32.00
24.25
28.35
7.00
7.00
7.00
3.50
2.00
2.31
19.0
• Higher latencies (int * = 14, fp + = 5.0, fp * = 7.0)
• Clock runs at 2.0 GHz
• Not an improvement over 1.0 GHz P3 for integer *
• Avoids FP multiplication anomaly
– 38 –
15-213, S’03
What About Branches?
Challenge
• Instruction Control Unit must work well ahead of Exec. Unit
• To generate enough operations to keep EU busy
80489f3:
80489f8:
80489fa:
80489fc:
80489fe:
8048a00:
movl
xorl
cmpl
jnl
movl
imull
$0x1,%ecx
%edx,%edx
%esi,%edx
8048a25
%esi,%esi
(%eax,%edx,4),%ecx
Executing
Fetching &
Decoding
• When encounters conditional branch, cannot reliably determine
where to continue fetching
– 39 –
15-213, S’03
Branch Outcomes
• When encounter conditional branch, cannot determine where to
continue fetching
• Branch Taken: Transfer control to branch target
• Branch Not-Taken: Continue with next instruction in sequence
• Cannot resolve until outcome determined by branch/integer unit
80489f3:
80489f8:
80489fa:
80489fc:
80489fe:
8048a00:
movl
xorl
cmpl
jnl
movl
imull
8048a25:
8048a27:
8048a29:
8048a2c:
8048a2f:
– 40 –
$0x1,%ecx
%edx,%edx
%esi,%edx Branch Not-Taken
8048a25
%esi,%esi
(%eax,%edx,4),%ecx
Branch Taken
cmpl
jl
movl
leal
movl
%edi,%edx
8048a20
0xc(%ebp),%eax
0xffffffe8(%ebp),%esp
%ecx,(%eax)
15-213, S’03
Branch Prediction
Idea
• Guess which way branch will go
• Begin executing instructions at predicted position
• But don’t actually modify register or memory data
80489f3:
80489f8:
80489fa:
80489fc:
. . .
movl
xorl
cmpl
jnl
$0x1,%ecx
%edx,%edx
%esi,%edx
8048a25
8048a25:
8048a27:
8048a29:
8048a2c:
8048a2f:
– 41 –
cmpl
jl
movl
leal
movl
Predict Taken
%edi,%edx
8048a20
0xc(%ebp),%eax
0xffffffe8(%ebp),%esp
%ecx,(%eax)
Execute
15-213, S’03
Branch Prediction Through Loop
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 98
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 99
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 100
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
– 42 –
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 101
%esi,%edx
80488b1
Assume vector length = 100
Predict Taken (OK)
Predict Taken
Executed
(Oops)
Read
invalid
location
Fetched
15-213, S’03
Branch Misprediction Invalidation
– 43 –
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 98
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 99
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 100
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
movl
addl
incl
(%ecx,%edx,4),%eax
%eax,(%edi)
i = 101
%edx
Assume vector length = 100
Predict Taken (OK)
Predict Taken (Oops)
Invalidate
15-213, S’03
Branch Misprediction Recovery
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
movl
addl
incl
cmpl
jl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 98
%esi,%edx
80488b1
80488b1:
80488b4:
80488b6:
80488b7:
80488b9:
80488bb:
80488be:
80488bf:
80488c0:
movl
addl
incl
cmpl
jl
leal
popl
popl
popl
(%ecx,%edx,4),%eax
%eax,(%edi)
%edx
i = 99
%esi,%edx
80488b1
0xffffffe8(%ebp),%esp
%ebx
%esi
%edi
Assume vector length = 100
Predict Taken (OK)
Definitely not taken
Performance Cost
• Misprediction on Pentium III wastes ~14 clock cycles
• That’s a lot of time on a high performance processor
– 44 –
15-213, S’03
Avoiding Branches
On Modern Processor, Branches Very Expensive
• Unless prediction can be reliable
• When possible, best to avoid altogether
Example
• Compute maximum of two values
• 14 cycles when prediction correct
• 29 cycles when incorrect
int max(int x, int y)
{
return (x < y) ? y : x;
}
– 45 –
movl 12(%ebp),%edx
movl 8(%ebp),%eax
cmpl %edx,%eax
jge L11
movl %edx,%eax
L11:
#
#
#
#
#
Get y
rval=x
rval:y
skip when >=
rval=y
15-213, S’03
Avoiding Branches with Bit Tricks
• In style of Lab #1
• Use masking rather than conditionals
int bmax(int x, int y)
{
int mask = -(x>y);
return (mask & x) | (~mask & y);
}
• Compiler still uses conditional
• 16 cycles when predict correctly
• 32 cycles when mispredict
xorl %edx,%edx
movl 8(%ebp),%eax
movl 12(%ebp),%ecx
cmpl %ecx,%eax
jle L13
movl $-1,%edx
L13:
– 46 –
# mask = 0
# skip if x<=y
# mask = -1
15-213, S’03
Avoiding Branches with Bit Tricks
• Force compiler to generate desired code
int bvmax(int x, int y)
{
volatile int t = (x>y);
int mask = -t;
return (mask & x) |
(~mask & y);
}
movl 8(%ebp),%ecx
movl 12(%ebp),%edx
cmpl %edx,%ecx
setg %al
movzbl %al,%eax
movl %eax,-4(%ebp)
movl -4(%ebp),%eax
#
#
#
#
#
#
#
Get x
Get y
x:y
(x>y)
Zero extend
Save as t
Retrieve t
• volatile declaration forces value to be written to memory
• Compiler must therefore generate code to compute t
• Simplest way is setg/movzbl combination
• Not very elegant!
• A hack to get control over compiler
• 22 clock cycles on all data
• Better than misprediction
– 47 –
15-213, S’03
Conditional Move
• Added with P6 microarchitecture (PentiumPro onward)
• cmovXXl %edx, %eax
• If condition XX holds, copy %edx to %eax
• Doesn’t involve any branching
• Handled as operation within Execution Unit
movl 8(%ebp),%edx
movl 12(%ebp),%eax
cmpl %edx, %eax
cmovll %edx,%eax
#
#
#
#
Get x
rval=y
rval:x
If <, rval=x
• Current version of GCC won’t use this instruction
• Thinks it’s compiling for a 386
• Performance
• 14 cycles on all data
– 48 –
15-213, S’03
Machine-Dependent Opt. Summary
Pointer Code
• Look carefully at generated code to see whether helpful
Loop Unrolling
• Some compilers do this automatically
• Generally not as clever as what can achieve by hand
Exposing Instruction-Level Parallelism
• Very machine dependent
Warning:
• Benefits depend heavily on particular machine
• Best if performed by compiler
• But GCC on IA32/Linux is not very good
• Do only for performance-critical parts of code
– 49 –
15-213, S’03
Role of Programmer
How should I write my programs, given that I have a good, optimizing
compiler?
Don’t: Smash Code into Oblivion
• Hard to read, maintain, & assure correctness
Do:
• Select best algorithm
• Write code that’s readable & maintainable
• Procedures, recursion, without built-in constant limits
• Even though these factors can slow down code
• Eliminate optimization blockers
• Allows compiler to do its job
Focus on Inner Loops
• Do detailed optimizations where code will be executed repeatedly
• Will get most performance gain here
– 50 –
15-213, S’03
Download