Plan 1. A brief tour of the hopefully obvious. 2.

advertisement
Plan
1.
A brief tour of the hopefully obvious.
•
•
•
2.
Your CPU and you. Understanding pipelining.
–
–
–
–
3.
What is a pipeline?
Pipeline stalls
Loop unrolling
Helping the compiler
Organising your memory. How cache works and why you need to know.
–
–
–
4.
Compiler flags
Cost of operations
Avoiding repeat computation
Cache thrashing
Padding and striping
Reducing memory access
Putting it into practice.
–
Example: Integrating the 2D wave equation
D. Quigley
22/02/2008
Optimisation
Some wisdom
“There are two rules of code optimisation:
Rule 1: Don’t do it!
Rule 2 (for experts only): Don’t do it yet!”
- M.A. Jackson
"More computing sins are committed in the name of efficiency
(without necessarily achieving it) than for any other single reason including blind stupidity."
- W.A. Wulf
“A slow but correct code is infinitely more useful to your research
than a fast broken one. Please do not break your codes and then
claim that Dr Quigley told you to do it!”
- D. Quigley
D. Quigley
22/02/2008
Optimisation
Compiler flags
msseay@foxymoron:~> f90 –o my_prog.exe my_code.f90
msseay@foxymoron:~> cc –o my_prog.exe my_code.c
-O0
No optimisation – does exactly what you coded.
-O1
Eliminate redundant code, unroll loops, remove invariants from loops.
-O2
Unroll nested loops, reorder operations to prevent pipeline stalls etc.
-O3
Array padding, function in-lining, loop reordering + others.
+ many other compiler dependent flags controlling vectorisation, optimisation for specific
target architecture etc. Read the manual.
Good practice:
•
Develop and test your code at –O0 with a reproducible test case.
•
Check this is unchanged at O1, O2, O3 etc. Compiler optimisations can compromise accuracy!
•
Time your code with various compiler flags. O3 can be slower than O2 for many codes.
D. Quigley
22/02/2008
Optimisation
All Operations are not equal
Your CPU only understands a few basic operations:
e.g. add, multiply, shift, compare
These can be usually executed in a single CPU cycle.
Any other operations;
e.g. divide, sqrt(), log, exp, xy, sin, cos
must be implemented with microcode, i.e. a sequence of operations stored
in the CPU firmware.
e.g. a divide operations can take from 30 to 100 CPU cycles!
D. Quigley
22/02/2008
Optimisation
Avoiding microcode
Classic example :
y = x/2.0
is many times slower than
y = 0.5*x
Older codes often contain things like:
y = x*x*x
y = x**3
instead of
or
y = pow(x,3)
This avoids the overhead of invoking microcode. Most compilers
will correct this for you but doesn’t hurt to be sure.
y = x**3.0
or
y = pow(x,3.0)
is REALLY bad. This will invoke general purpose microcode for
raising a number to a non-integer power. This involves taking logs
using lookup tables and is VERY slow.
y = x**z
Is z declared as an integer, can it be?
D. Quigley
22/02/2008
Optimisation
Simple optimisation
do i=1,n
x(i)=2*p*i/k1
y(i)=2*p*i/k2
end do
Avoid repeating 2*p*i
(compiler will do this for you)
do i=1,n
t1=2*p*i
x(i)=t1/k1
y(i)=t1/k2
end do
Compute 2*p outside loop
t2=2*p
do i=1,n
t1=t2*i
x(i)=t1/k1
y(i)=t1/k2
end do
(compiler should do this for you)
Store k1 and k2 as inverse
t1=2*p/k1
t2=2*p/k2
do i=1,n
x(i)=t1*i
y(i)=t2*i
end do
(compiler might do this for you)
D. Quigley
22/02/2008
Optimisation
Simple Algebra
do i = 1,n
E(i) = A(i)/B(i) + C(i)/D(i)
end do
2 divides and 1 add per iteration ~ 60 – 200 cycles
do i = 1,n
t1 = 1.0_dp/(B(i)*D(i))
E(i) = t1*(A(i)*D(i) + C(i)*B(i))
end do
1 divide, 4 multiplies and 1 add ~ 50 – 120 cycles
Compiler will not do this for you.
D. Quigley
22/02/2008
Optimisation
Pipelines
Even basic operations such as add, multiply, shift actually take multiple cycles.
These are divided into a series of simpler stages.
e.g. Instruction A = A + B passing through a five stage pipeline.
new instruction
instruction completed
A=A+B
CPU Cycle : 1
Each stage takes one CPU cycle to complete.
The entire operation takes 5 cycles.
Be aware that this is a highly simplified picture. CPUs have multiple
(branching) pipelines feeding multiple functional units per CPU core.
D. Quigley
22/02/2008
Optimisation
Pipelines
Even basic operations such as add, multiply, shift actually take multiple cycles.
These are divided into a series of simpler stages.
e.g. Instruction A = A + B passing through a five stage pipeline.
new instruction
instruction completed
A=A+B
CPU Cycle : 2
Each stage takes one CPU cycle to complete.
The entire operation takes 5 cycles.
Be aware that this is a highly simplified picture. CPUs have multiple
(branching) pipelines feeding multiple functional units per CPU core.
D. Quigley
22/02/2008
Optimisation
Pipelines
Even basic operations such as add, multiply, shift actually take multiple cycles.
These are divided into a series of simpler stages.
e.g. Instruction A = A + B passing through a five stage pipeline.
new instruction
instruction completed
A=A+B
CPU Cycle : 3
Each stage takes one CPU cycle to complete.
The entire operation takes 5 cycles.
Be aware that this is a highly simplified picture. CPUs have multiple
(branching) pipelines feeding multiple functional units per CPU core.
D. Quigley
22/02/2008
Optimisation
Pipelines
Even basic operations such as add, multiply, shift actually take multiple cycles.
These are divided into a series of simpler stages.
e.g. Instruction A = A + B passing through a five stage pipeline.
new instruction
instruction completed
A=A+B
CPU Cycle : 4
Each stage takes one CPU cycle to complete.
The entire operation takes 5 cycles.
Be aware that this is a highly simplified picture. CPUs have multiple
(branching) pipelines feeding multiple functional units per CPU core.
D. Quigley
22/02/2008
Optimisation
Pipelines
Even basic operations such as add, multiply, shift actually take multiple cycles.
These are divided into a series of simpler stages.
e.g. Instruction A = A + B passing through a five stage pipeline.
new instruction
A=A+B
instruction completed
CPU Cycle : 5
Each stage takes one CPU cycle to complete.
The entire operation takes 5 cycles.
Be aware that this is a highly simplified picture. CPUs have multiple
(branching) pipelines feeding multiple functional units per CPU core.
D. Quigley
22/02/2008
Optimisation
Pipelines
More stages means simpler stages, which in turn means each
stage takes less time and we can clock our CPU to higher cycles
per second.
e.g. 3.2 Ghz Pentium 4 has a 28 stage pipeline.
(This is not necessarily a good thing)
real(kind=dp),dimension(1:1000) :: A
This code is very pipeline friendly.
< some code >
do I = 1,1000
A(I) = A(I)**2
end do
We can start each operation before
the previous one is finished.
< more code >
D. Quigley
22/02/2008
Optimisation
Pipelines
e.g. with 5 stages we can have up to 5 operations in flight.
new instruction
instruction completed
A1 = A1**2
CPU Cycle : 1
Latency of 5 cycles to fill the pipeline.
Subsequent repeat rate of 1 cycle.
Hence CPU effectively completes one operation per cycle.
(N.B. most CPU cores actually peak at two operations per cycle or better)
D. Quigley
22/02/2008
Optimisation
Pipelines
e.g. with 5 stages we can have up to 5 operations in flight.
new instruction
A2 = A2**2
instruction completed
A1 = A1**2
CPU Cycle : 2
Latency of 5 cycles to fill the pipeline.
Subsequent repeat rate of 1 cycle.
Hence CPU effectively completes one operation per cycle.
(N.B. most CPU cores actually peak at two operations per cycle or better)
D. Quigley
22/02/2008
Optimisation
Pipelines
e.g. with 5 stages we can have up to 5 operations in flight.
new instruction
A3 = A3**2
A2 = A2**2
instruction completed
A1 = A1**2
CPU Cycle : 3
Latency of 5 cycles to fill the pipeline.
Subsequent repeat rate of 1 cycle.
Hence CPU effectively completes one operation per cycle.
(N.B. most CPU cores actually peak at two operations per cycle or better)
D. Quigley
22/02/2008
Optimisation
Pipelines
e.g. with 5 stages we can have up to 5 operations in flight.
new instruction
A4 = A4**2
A3 = A3**2
A2 = A2**2
instruction completed
A1 = A1**2
CPU Cycle : 4
Latency of 5 cycles to fill the pipeline.
Subsequent repeat rate of 1 cycle.
Hence CPU effectively completes one operation per cycle.
(N.B. most CPU cores actually peak at two operations per cycle or better)
D. Quigley
22/02/2008
Optimisation
Pipelines
e.g. with 5 stages we can have up to 5 operations in flight.
new instruction
A5 = A5**2
A4 = A4**2
A3 = A3**2
A2 = A2**2
A1 = A1**2
CPU Cycle : 5
Latency of 5 cycles to fill the pipeline.
Subsequent repeat rate of 1 cycle.
Hence CPU effectively completes one operation per cycle.
(N.B. most CPU cores actually peak at two operations per cycle or better)
D. Quigley
22/02/2008
Optimisation
instruction completed
Pipelines
e.g. with 5 stages we can have up to 5 operations in flight.
new instruction
A6 = A6**2
A5 = A5**2
A4 = A4**2
A3 = A3**2
A2 = A2**2
CPU Cycle : 6
Latency of 5 cycles to fill the pipeline.
Subsequent repeat rate of 1 cycle.
Hence CPU effectively completes one operation per cycle.
(N.B. most CPU cores actually peak at two operations per cycle or better)
D. Quigley
22/02/2008
Optimisation
instruction completed
Pipeline stalls
Our 5 stage pipeline needs 5 independent operations to sustain
peak performance, otherwise the pipeline will stall.
real(kind=dp),dimension(1:1000) :: A
real(kind=dp),dimension(1:1000) :: A
t1
t2
t3
t4
t5
sum = 0.0_dp
do I = 1,1000
sum = sum + A(I)
end do
Slow
Each increment of sum cannot begin until
the result of the previous operation is
known. Stalls every iteration.
Unroll the loop
Pipelines now full – 5 x faster.
D. Quigley
=
=
=
=
=
0.0_dp
0.0_dp
0.0_dp
0.0_dp
0.0_dp
do I = 1,1000-5,5
t1 = t1 + A(I)
t2 = t2 + A(I+1)
t3 = t3 + A(I+2)
t4 = t4 + A(I+3)
t5 = t5 + A(I+4)
end do
sum = t1 + t2 + t3 + t4
22/02/2008
Optimisation
Help the compiler!
The previous example should have been automatically unrolled by the
compiler. In general the compiler will do better at this than we can.
What about this one?
real(kind=dp),dimension(1:1000) :: A
integer :: J
< set J >
sum = 0.0_dp
do I = 1,1000-J
sum = sum + A(I) + A(I+J)
end do
If J > 5 then this loop can be unrolled and
efficiently pipelined.
The compiler doesn’t know in advance what
J will be, so can’t risk unrolling it.
If J is always going to be say 10, let the compiler know by
declaring it as a constant in C or a parameter in Fortran 90.
D. Quigley
22/02/2008
Optimisation
Help the compiler!
do I = 1,1000
do J = 1,1000
if ( J<I ) then
A(J,I) = A(J,I)*B(J,I) + C
else
A(J,I) = A(J,I)*D(J,I) + E
end if
end do
end do
do I = 1,1000
do J = 1,I-1
A(J,I) = A(J,I)*B(J,I) + C
end do
do J = I,1000
A(J,I) = A(J,I)*D(J,I) + E
end do
end do
D. Quigley
This branch can be avoided
as the pattern of true/false results is
predetermined.
The more complex the pattern the
less likely the compiler is to spot it.
Any branch which depends only on
constants and/or the loop trip count is
probably unnecessary.
22/02/2008
Optimisation
Help the compiler!
To avoid stalling the our 5 stage pipeline we must be able to see 5
operations into the future. Branches (i.e. IF or SELECT
statements) make this impossible.
Modern CPUs / compilers use branch prediction and speculative
execution.
real(kind=dp),dimension(1:1000) :: A,C
logical,dimension(1:1000)
:: B
< elements of B obtained as true or false >
do I = 1,1000
if ( B(I) == .true. ) then
A(I) = A(I) + C(I)
end if
Every time this branch is
predicted incorrectly the pipeline
will stall and we must suffer 5
cycles of latency.
Will never be 100% accurate.
end do
D. Quigley
22/02/2008
Optimisation
Help the compiler!
Avoid branches wherever possible, especially within loops.
real(kind=dp),dimension(1:1000) :: A,C
real(kind=dp),dimension(1:1000) :: B
< elements of B obtained as 1.0 or 0.0 >
Can now be pipelined.
do I = 1,1000
A(I) = A(I) + B(I)*C(I)
end do
BUT – what if a maximum of 15 elements of B are allowed to be true?
Simplest “assume previous result” branch prediction would get the answer right at least
970 times.
Cost of 30 pipeline stalls vs cost of 970 unnecessary multiply-add operations?
D. Quigley
22/02/2008
Optimisation
Hang on – I’ve got a great idea!
real(kind=dp),dimension(1:1000) :: A,B
do I = 1,1000
! Save time by using a Taylor expansion
! if B(I) is small.
if ( B(I) < 1.0e-5_dp ) then
A(I) = B(I) – 0.16666666_dp*B(I)**3
else
A(I) = sin(B(I))
end if
end do
Every time this branch is predicted
incorrectly the pipeline will stall.
Is the cost of the sine operation
more or less than the cost of the
pipeline stall?
Depends on the data…….
• If the data is fairly uniform then we expect good prediction and very few pipeline stalls.
• If dominated by large values  No worse than always using sine function.
• If dominated by small values  May well be much faster due to avoiding sine function.
• If the data randomly alternates between small and large values then expect poor branch
prediction and many pipeline stalls.
D. Quigley
22/02/2008
Optimisation
Pipelining Summary
Could look at many more examples. Key point:
You understand your code and data better than the compiler,
but the compiler understands the CPU better then you!
Help the compiler:
1.
2.
3.
4.
5.
Move branches outside of loops.
Avoid unnecessary branches.
Don’t declare constants as variables.
Use compiler directives ( see documentation ).
Often leads to longer, less transparent code.
See: Dowd and Severance, “High Performance Computing” O’Reilly (1999)
for more examples.
D. Quigley
22/02/2008
Optimisation
Memory Hierarchy
Registers:
Stores data CPU is currently operating on.
~32 registers per CPU core.
L1 cache:
Small (e.g. 32 Kb) on CPU, fast SRAM.
Takes 1-3 clock cycles to serve a memory request.
L2 cache:
Larger (e.g. 4 Mb) usually also on CPU.
Takes 5-25 clock cycles to serve a memory request
Main Memory (e.g. 2 Gb)
Takes 30-300 clock cycles to serve a memory request.
Substantial gains in performance by minimising number of
reads/writes to main memory.
D. Quigley
22/02/2008
Optimisation
Direct Mapping
Main memory
Whenever a value is read from
main memory, an entire cache line
is read into cache.
Data already on that cache line is
erased (or written back to memory
in write-back vs write-through
caches).
Cache memory
e.g. 4 cache lines each holding 16 64-bit words
(one 64 word = 1 double precision number)
D. Quigley
22/02/2008
Optimisation
Direct Mapping
Main memory
Whenever a value is read from
main memory, an entire cache line
is read into cache.
Data already on that cache line is
erased (or written back to memory
in write-back vs write-through
caches).
Cache memory
e.g. 4 cache lines each holding 16 64-bit words
(one 64 word = 1 double precision number)
D. Quigley
22/02/2008
Optimisation
Direct Mapping
Main memory
Whenever a value is read from
main memory, an entire cache line
is read into cache.
Data already on that cache line is
erased (or written back to memory
in write-back vs write-through
caches).
Cache memory
e.g. 4 cache lines each holding 16 64-bit words
(one 64 word = 1 double precision number)
D. Quigley
22/02/2008
Optimisation
Direct Mapping
Main memory
Whenever a value is read from
main memory, an entire cache line
is read into cache.
Data already on that cache line is
erased (or written back to memory
in write-back vs write-through
caches).
Cache memory
e.g. 4 cache lines each holding 16 64-bit words
(one 64 word = 1 double precision number)
D. Quigley
22/02/2008
Optimisation
Happy Cache
A
real(kind=dp),dimension(1:32) :: A
real(kind=dp),dimension(1:32) :: B
B
< some code >
do I = 1,32
A(I) = A(I)*B(I)
end do
< more code >
A and B map onto different
cache lines.
2 cache lines in use at all times.
4 reads from main memory.
D. Quigley
22/02/2008
Optimisation
Cache Thrashing
real(kind=dp),dimension(1:64) :: A
real(kind=dp),dimension(1:64) :: B
A
< some code >
B
do I = 1,64
A(I) = A(I)*B(I)
end do
< more code >
Ai and Bi always map onto the
same cache line.
Only using 1 cache line at a time.
128 reads from main memory c.f. 4
reads for problem of half the size.
D. Quigley
22/02/2008
Optimisation
Padding?
A
real(kind=dp),dimension(1:64) :: A
real(kind=dp),dimension(1:16) :: C
real(kind=dp),dimension(1:64) :: B
C
B
< some code >
do I = 1,64
A(I) = A(I)*B(I)
end do
< more code >
Ai and Bi now map onto different
cache lines.
2 cache lines in use at a time.
8 reads from main memory c.f. 4 reads for
problem of half the size – much better.
D. Quigley
22/02/2008
Optimisation
Striping?
type my_stripe_type
real(kind=dp):: A
real(kind=dp):: B
end type my_stripe_type
type(my_stripe_type),dimension(64) :: str
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
AB
AB
AB
AB
AB
AB
AB
AB
A
A
A
A
A
A
A
A
< some code >
do I = 1,64
str(I)%A = str(I)%A*str(I)%B
end do
< more code >
1 cache line in use at a time.
8 reads from main memory.
D. Quigley
22/02/2008
Optimisation
B
B
B
B
B
B
B
B
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
Set Association
Main memory
This cache is 2-way set associative.
When reading a word from memory
the cache line of the corresponding
colour that was least recently used
is overwritten.
Most caches are at least 2-way set
associative. Many are 4 or 8-way.
Cache memory
e.g. 4 cache lines each holding 16 64-bit words
(one 64 word = 1 double precision number)
D. Quigley
22/02/2008
Optimisation
Set Association
real(kind=dp),dimension(1:64) :: A
real(kind=dp),dimension(1:64) :: B
A
< some code >
B
do I = 1,64
A(I) = A(I)*B(I)
end do
< more code >
2 cache lines in use at a time.
8 reads from main memory.
No code changes needed.
D. Quigley
22/02/2008
Optimisation
Cache Thrashing
A
real(kind=dp),dimension(1:64) :: A
real(kind=dp),dimension(1:64) :: B
real(kind=dp),dimension(1:64) :: C
B
< some code >
do I = 1,64
A(I) = A(I)*B(I) + C(I)
end do
C
< more code >
2 cache lines in use at a time.
192 reads from main memory.
Fix with padding or striping as before.
D. Quigley
22/02/2008
Optimisation
Access patterns
real(kind=dp),dimension(1:128) :: A
A
< some code >
do I = 1,127,2
A(I) = A(I)**2
end do
! odd values
do I = 2,128,2
A(I) = A(I)**3
end do
! even values
< more code >
First loop triggers 8 reads
Second loop triggers 8 reads
16 reads total for 128 iterations
D. Quigley
22/02/2008
Optimisation
Access patterns
real(kind=dp),dimension(1:128) :: A
A
< some code >
do I = 1,127,2
A(I) = A(I)**2
A(I+1) = A(I+1)**3
end do
! odd values
! even values
< more code >
Now stepping through the array with
unit stride.
8 reads total for 128 iterations
Avoid non-unit stride.
D. Quigley
22/02/2008
Optimisation
2d data – F90
A(1,1) – A(16,1)
A(17,1) – A(32,1)
A(1,2) – A(16,2)
A(17,2) – A(32,2)
A(1,3) – A(16,3)
A(17,3) – A(32,3)
A(1,4) – A(16,4)
A(17,4) – A(32,4)
real(kind=dp),dimension(1:32,1:4) :: A
real(kind=dp),dimension(1:32)
:: sumrow
< some code >
do I = 1,32
sumrow(I) = 0.0_dp
do J = 1,4
sumrow(I) = sumrow(I) + A(I,J)
end do
end do
< more code >
Each addition triggers load of a new
cache line. 2 lines in use at a time.
128 reads from main memory total.
D. Quigley
22/02/2008
Optimisation
2d data – F90
A(1,1) – A(4,4)
A(1,5) – A(4,8)
A(1,9) – A(4,12)
A(1,13) – A(4,16)
A(1,17) – A(4,20)
A(1,21) – A(4,24)
A(1,25) – A(4,28)
A(1,29) – A(4,32)
real(kind=dp),dimension(1:4,1:32) :: A
real(kind=dp),dimension(1:32)
:: sumrow
< some code >
do I = 1,32
sumrow(I) = 0.0_dp
do J = 1,4
sumrow(I) = sumrow(I) + A(J,I)
end do
end do
< more code >
Now stepping through memory with
unit stride.
8 loads from main memory total.
D. Quigley
22/02/2008
Optimisation
Memory Summary
• Use unit stride wherever possible. Cache works well with spatial and
temporal locality of access.
• Try to avoid problem sizes which are multiples of a cache line.
i.e. avoid powers of 2 like the plague.
• Minimise the number of passes through data. Do as much as possible
with each read from main memory.
• Be VERY careful when looping through multidimensional arrays.
(Warning – array storage order in C is opposite to Fortran 90.
Be aware that many older codes (pre 1995)
were written when memory reads were cheap and most
machines had little or no cache memory.
D. Quigley
22/02/2008
Optimisation
An Example
Time evolution of the wave equation using finite differences on a
two-dimensional grid.
D. Quigley
22/02/2008
Optimisation
Download