Plan 1. A brief tour of the hopefully obvious. • • • 2. Your CPU and you. Understanding pipelining. – – – – 3. What is a pipeline? Pipeline stalls Loop unrolling Helping the compiler Organising your memory. How cache works and why you need to know. – – – 4. Compiler flags Cost of operations Avoiding repeat computation Cache thrashing Padding and striping Reducing memory access Putting it into practice. – Example: Integrating the 2D wave equation D. Quigley 22/02/2008 Optimisation Some wisdom “There are two rules of code optimisation: Rule 1: Don’t do it! Rule 2 (for experts only): Don’t do it yet!” - M.A. Jackson "More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason including blind stupidity." - W.A. Wulf “A slow but correct code is infinitely more useful to your research than a fast broken one. Please do not break your codes and then claim that Dr Quigley told you to do it!” - D. Quigley D. Quigley 22/02/2008 Optimisation Compiler flags msseay@foxymoron:~> f90 –o my_prog.exe my_code.f90 msseay@foxymoron:~> cc –o my_prog.exe my_code.c -O0 No optimisation – does exactly what you coded. -O1 Eliminate redundant code, unroll loops, remove invariants from loops. -O2 Unroll nested loops, reorder operations to prevent pipeline stalls etc. -O3 Array padding, function in-lining, loop reordering + others. + many other compiler dependent flags controlling vectorisation, optimisation for specific target architecture etc. Read the manual. Good practice: • Develop and test your code at –O0 with a reproducible test case. • Check this is unchanged at O1, O2, O3 etc. Compiler optimisations can compromise accuracy! • Time your code with various compiler flags. O3 can be slower than O2 for many codes. D. Quigley 22/02/2008 Optimisation All Operations are not equal Your CPU only understands a few basic operations: e.g. add, multiply, shift, compare These can be usually executed in a single CPU cycle. Any other operations; e.g. divide, sqrt(), log, exp, xy, sin, cos must be implemented with microcode, i.e. a sequence of operations stored in the CPU firmware. e.g. a divide operations can take from 30 to 100 CPU cycles! D. Quigley 22/02/2008 Optimisation Avoiding microcode Classic example : y = x/2.0 is many times slower than y = 0.5*x Older codes often contain things like: y = x*x*x y = x**3 instead of or y = pow(x,3) This avoids the overhead of invoking microcode. Most compilers will correct this for you but doesn’t hurt to be sure. y = x**3.0 or y = pow(x,3.0) is REALLY bad. This will invoke general purpose microcode for raising a number to a non-integer power. This involves taking logs using lookup tables and is VERY slow. y = x**z Is z declared as an integer, can it be? D. Quigley 22/02/2008 Optimisation Simple optimisation do i=1,n x(i)=2*p*i/k1 y(i)=2*p*i/k2 end do Avoid repeating 2*p*i (compiler will do this for you) do i=1,n t1=2*p*i x(i)=t1/k1 y(i)=t1/k2 end do Compute 2*p outside loop t2=2*p do i=1,n t1=t2*i x(i)=t1/k1 y(i)=t1/k2 end do (compiler should do this for you) Store k1 and k2 as inverse t1=2*p/k1 t2=2*p/k2 do i=1,n x(i)=t1*i y(i)=t2*i end do (compiler might do this for you) D. Quigley 22/02/2008 Optimisation Simple Algebra do i = 1,n E(i) = A(i)/B(i) + C(i)/D(i) end do 2 divides and 1 add per iteration ~ 60 – 200 cycles do i = 1,n t1 = 1.0_dp/(B(i)*D(i)) E(i) = t1*(A(i)*D(i) + C(i)*B(i)) end do 1 divide, 4 multiplies and 1 add ~ 50 – 120 cycles Compiler will not do this for you. D. Quigley 22/02/2008 Optimisation Pipelines Even basic operations such as add, multiply, shift actually take multiple cycles. These are divided into a series of simpler stages. e.g. Instruction A = A + B passing through a five stage pipeline. new instruction instruction completed A=A+B CPU Cycle : 1 Each stage takes one CPU cycle to complete. The entire operation takes 5 cycles. Be aware that this is a highly simplified picture. CPUs have multiple (branching) pipelines feeding multiple functional units per CPU core. D. Quigley 22/02/2008 Optimisation Pipelines Even basic operations such as add, multiply, shift actually take multiple cycles. These are divided into a series of simpler stages. e.g. Instruction A = A + B passing through a five stage pipeline. new instruction instruction completed A=A+B CPU Cycle : 2 Each stage takes one CPU cycle to complete. The entire operation takes 5 cycles. Be aware that this is a highly simplified picture. CPUs have multiple (branching) pipelines feeding multiple functional units per CPU core. D. Quigley 22/02/2008 Optimisation Pipelines Even basic operations such as add, multiply, shift actually take multiple cycles. These are divided into a series of simpler stages. e.g. Instruction A = A + B passing through a five stage pipeline. new instruction instruction completed A=A+B CPU Cycle : 3 Each stage takes one CPU cycle to complete. The entire operation takes 5 cycles. Be aware that this is a highly simplified picture. CPUs have multiple (branching) pipelines feeding multiple functional units per CPU core. D. Quigley 22/02/2008 Optimisation Pipelines Even basic operations such as add, multiply, shift actually take multiple cycles. These are divided into a series of simpler stages. e.g. Instruction A = A + B passing through a five stage pipeline. new instruction instruction completed A=A+B CPU Cycle : 4 Each stage takes one CPU cycle to complete. The entire operation takes 5 cycles. Be aware that this is a highly simplified picture. CPUs have multiple (branching) pipelines feeding multiple functional units per CPU core. D. Quigley 22/02/2008 Optimisation Pipelines Even basic operations such as add, multiply, shift actually take multiple cycles. These are divided into a series of simpler stages. e.g. Instruction A = A + B passing through a five stage pipeline. new instruction A=A+B instruction completed CPU Cycle : 5 Each stage takes one CPU cycle to complete. The entire operation takes 5 cycles. Be aware that this is a highly simplified picture. CPUs have multiple (branching) pipelines feeding multiple functional units per CPU core. D. Quigley 22/02/2008 Optimisation Pipelines More stages means simpler stages, which in turn means each stage takes less time and we can clock our CPU to higher cycles per second. e.g. 3.2 Ghz Pentium 4 has a 28 stage pipeline. (This is not necessarily a good thing) real(kind=dp),dimension(1:1000) :: A This code is very pipeline friendly. < some code > do I = 1,1000 A(I) = A(I)**2 end do We can start each operation before the previous one is finished. < more code > D. Quigley 22/02/2008 Optimisation Pipelines e.g. with 5 stages we can have up to 5 operations in flight. new instruction instruction completed A1 = A1**2 CPU Cycle : 1 Latency of 5 cycles to fill the pipeline. Subsequent repeat rate of 1 cycle. Hence CPU effectively completes one operation per cycle. (N.B. most CPU cores actually peak at two operations per cycle or better) D. Quigley 22/02/2008 Optimisation Pipelines e.g. with 5 stages we can have up to 5 operations in flight. new instruction A2 = A2**2 instruction completed A1 = A1**2 CPU Cycle : 2 Latency of 5 cycles to fill the pipeline. Subsequent repeat rate of 1 cycle. Hence CPU effectively completes one operation per cycle. (N.B. most CPU cores actually peak at two operations per cycle or better) D. Quigley 22/02/2008 Optimisation Pipelines e.g. with 5 stages we can have up to 5 operations in flight. new instruction A3 = A3**2 A2 = A2**2 instruction completed A1 = A1**2 CPU Cycle : 3 Latency of 5 cycles to fill the pipeline. Subsequent repeat rate of 1 cycle. Hence CPU effectively completes one operation per cycle. (N.B. most CPU cores actually peak at two operations per cycle or better) D. Quigley 22/02/2008 Optimisation Pipelines e.g. with 5 stages we can have up to 5 operations in flight. new instruction A4 = A4**2 A3 = A3**2 A2 = A2**2 instruction completed A1 = A1**2 CPU Cycle : 4 Latency of 5 cycles to fill the pipeline. Subsequent repeat rate of 1 cycle. Hence CPU effectively completes one operation per cycle. (N.B. most CPU cores actually peak at two operations per cycle or better) D. Quigley 22/02/2008 Optimisation Pipelines e.g. with 5 stages we can have up to 5 operations in flight. new instruction A5 = A5**2 A4 = A4**2 A3 = A3**2 A2 = A2**2 A1 = A1**2 CPU Cycle : 5 Latency of 5 cycles to fill the pipeline. Subsequent repeat rate of 1 cycle. Hence CPU effectively completes one operation per cycle. (N.B. most CPU cores actually peak at two operations per cycle or better) D. Quigley 22/02/2008 Optimisation instruction completed Pipelines e.g. with 5 stages we can have up to 5 operations in flight. new instruction A6 = A6**2 A5 = A5**2 A4 = A4**2 A3 = A3**2 A2 = A2**2 CPU Cycle : 6 Latency of 5 cycles to fill the pipeline. Subsequent repeat rate of 1 cycle. Hence CPU effectively completes one operation per cycle. (N.B. most CPU cores actually peak at two operations per cycle or better) D. Quigley 22/02/2008 Optimisation instruction completed Pipeline stalls Our 5 stage pipeline needs 5 independent operations to sustain peak performance, otherwise the pipeline will stall. real(kind=dp),dimension(1:1000) :: A real(kind=dp),dimension(1:1000) :: A t1 t2 t3 t4 t5 sum = 0.0_dp do I = 1,1000 sum = sum + A(I) end do Slow Each increment of sum cannot begin until the result of the previous operation is known. Stalls every iteration. Unroll the loop Pipelines now full – 5 x faster. D. Quigley = = = = = 0.0_dp 0.0_dp 0.0_dp 0.0_dp 0.0_dp do I = 1,1000-5,5 t1 = t1 + A(I) t2 = t2 + A(I+1) t3 = t3 + A(I+2) t4 = t4 + A(I+3) t5 = t5 + A(I+4) end do sum = t1 + t2 + t3 + t4 22/02/2008 Optimisation Help the compiler! The previous example should have been automatically unrolled by the compiler. In general the compiler will do better at this than we can. What about this one? real(kind=dp),dimension(1:1000) :: A integer :: J < set J > sum = 0.0_dp do I = 1,1000-J sum = sum + A(I) + A(I+J) end do If J > 5 then this loop can be unrolled and efficiently pipelined. The compiler doesn’t know in advance what J will be, so can’t risk unrolling it. If J is always going to be say 10, let the compiler know by declaring it as a constant in C or a parameter in Fortran 90. D. Quigley 22/02/2008 Optimisation Help the compiler! do I = 1,1000 do J = 1,1000 if ( J<I ) then A(J,I) = A(J,I)*B(J,I) + C else A(J,I) = A(J,I)*D(J,I) + E end if end do end do do I = 1,1000 do J = 1,I-1 A(J,I) = A(J,I)*B(J,I) + C end do do J = I,1000 A(J,I) = A(J,I)*D(J,I) + E end do end do D. Quigley This branch can be avoided as the pattern of true/false results is predetermined. The more complex the pattern the less likely the compiler is to spot it. Any branch which depends only on constants and/or the loop trip count is probably unnecessary. 22/02/2008 Optimisation Help the compiler! To avoid stalling the our 5 stage pipeline we must be able to see 5 operations into the future. Branches (i.e. IF or SELECT statements) make this impossible. Modern CPUs / compilers use branch prediction and speculative execution. real(kind=dp),dimension(1:1000) :: A,C logical,dimension(1:1000) :: B < elements of B obtained as true or false > do I = 1,1000 if ( B(I) == .true. ) then A(I) = A(I) + C(I) end if Every time this branch is predicted incorrectly the pipeline will stall and we must suffer 5 cycles of latency. Will never be 100% accurate. end do D. Quigley 22/02/2008 Optimisation Help the compiler! Avoid branches wherever possible, especially within loops. real(kind=dp),dimension(1:1000) :: A,C real(kind=dp),dimension(1:1000) :: B < elements of B obtained as 1.0 or 0.0 > Can now be pipelined. do I = 1,1000 A(I) = A(I) + B(I)*C(I) end do BUT – what if a maximum of 15 elements of B are allowed to be true? Simplest “assume previous result” branch prediction would get the answer right at least 970 times. Cost of 30 pipeline stalls vs cost of 970 unnecessary multiply-add operations? D. Quigley 22/02/2008 Optimisation Hang on – I’ve got a great idea! real(kind=dp),dimension(1:1000) :: A,B do I = 1,1000 ! Save time by using a Taylor expansion ! if B(I) is small. if ( B(I) < 1.0e-5_dp ) then A(I) = B(I) – 0.16666666_dp*B(I)**3 else A(I) = sin(B(I)) end if end do Every time this branch is predicted incorrectly the pipeline will stall. Is the cost of the sine operation more or less than the cost of the pipeline stall? Depends on the data……. • If the data is fairly uniform then we expect good prediction and very few pipeline stalls. • If dominated by large values No worse than always using sine function. • If dominated by small values May well be much faster due to avoiding sine function. • If the data randomly alternates between small and large values then expect poor branch prediction and many pipeline stalls. D. Quigley 22/02/2008 Optimisation Pipelining Summary Could look at many more examples. Key point: You understand your code and data better than the compiler, but the compiler understands the CPU better then you! Help the compiler: 1. 2. 3. 4. 5. Move branches outside of loops. Avoid unnecessary branches. Don’t declare constants as variables. Use compiler directives ( see documentation ). Often leads to longer, less transparent code. See: Dowd and Severance, “High Performance Computing” O’Reilly (1999) for more examples. D. Quigley 22/02/2008 Optimisation Memory Hierarchy Registers: Stores data CPU is currently operating on. ~32 registers per CPU core. L1 cache: Small (e.g. 32 Kb) on CPU, fast SRAM. Takes 1-3 clock cycles to serve a memory request. L2 cache: Larger (e.g. 4 Mb) usually also on CPU. Takes 5-25 clock cycles to serve a memory request Main Memory (e.g. 2 Gb) Takes 30-300 clock cycles to serve a memory request. Substantial gains in performance by minimising number of reads/writes to main memory. D. Quigley 22/02/2008 Optimisation Direct Mapping Main memory Whenever a value is read from main memory, an entire cache line is read into cache. Data already on that cache line is erased (or written back to memory in write-back vs write-through caches). Cache memory e.g. 4 cache lines each holding 16 64-bit words (one 64 word = 1 double precision number) D. Quigley 22/02/2008 Optimisation Direct Mapping Main memory Whenever a value is read from main memory, an entire cache line is read into cache. Data already on that cache line is erased (or written back to memory in write-back vs write-through caches). Cache memory e.g. 4 cache lines each holding 16 64-bit words (one 64 word = 1 double precision number) D. Quigley 22/02/2008 Optimisation Direct Mapping Main memory Whenever a value is read from main memory, an entire cache line is read into cache. Data already on that cache line is erased (or written back to memory in write-back vs write-through caches). Cache memory e.g. 4 cache lines each holding 16 64-bit words (one 64 word = 1 double precision number) D. Quigley 22/02/2008 Optimisation Direct Mapping Main memory Whenever a value is read from main memory, an entire cache line is read into cache. Data already on that cache line is erased (or written back to memory in write-back vs write-through caches). Cache memory e.g. 4 cache lines each holding 16 64-bit words (one 64 word = 1 double precision number) D. Quigley 22/02/2008 Optimisation Happy Cache A real(kind=dp),dimension(1:32) :: A real(kind=dp),dimension(1:32) :: B B < some code > do I = 1,32 A(I) = A(I)*B(I) end do < more code > A and B map onto different cache lines. 2 cache lines in use at all times. 4 reads from main memory. D. Quigley 22/02/2008 Optimisation Cache Thrashing real(kind=dp),dimension(1:64) :: A real(kind=dp),dimension(1:64) :: B A < some code > B do I = 1,64 A(I) = A(I)*B(I) end do < more code > Ai and Bi always map onto the same cache line. Only using 1 cache line at a time. 128 reads from main memory c.f. 4 reads for problem of half the size. D. Quigley 22/02/2008 Optimisation Padding? A real(kind=dp),dimension(1:64) :: A real(kind=dp),dimension(1:16) :: C real(kind=dp),dimension(1:64) :: B C B < some code > do I = 1,64 A(I) = A(I)*B(I) end do < more code > Ai and Bi now map onto different cache lines. 2 cache lines in use at a time. 8 reads from main memory c.f. 4 reads for problem of half the size – much better. D. Quigley 22/02/2008 Optimisation Striping? type my_stripe_type real(kind=dp):: A real(kind=dp):: B end type my_stripe_type type(my_stripe_type),dimension(64) :: str A A A A A A A A B B B B B B B B AB AB AB AB AB AB AB AB A A A A A A A A < some code > do I = 1,64 str(I)%A = str(I)%A*str(I)%B end do < more code > 1 cache line in use at a time. 8 reads from main memory. D. Quigley 22/02/2008 Optimisation B B B B B B B B A A A A A A A A B B B B B B B B A A A A A A A A B B B B B B B B A A A A A A A A B B B B B B B B A A A A A A A A B B B B B B B B A A A A A A A A B B B B B B B B Set Association Main memory This cache is 2-way set associative. When reading a word from memory the cache line of the corresponding colour that was least recently used is overwritten. Most caches are at least 2-way set associative. Many are 4 or 8-way. Cache memory e.g. 4 cache lines each holding 16 64-bit words (one 64 word = 1 double precision number) D. Quigley 22/02/2008 Optimisation Set Association real(kind=dp),dimension(1:64) :: A real(kind=dp),dimension(1:64) :: B A < some code > B do I = 1,64 A(I) = A(I)*B(I) end do < more code > 2 cache lines in use at a time. 8 reads from main memory. No code changes needed. D. Quigley 22/02/2008 Optimisation Cache Thrashing A real(kind=dp),dimension(1:64) :: A real(kind=dp),dimension(1:64) :: B real(kind=dp),dimension(1:64) :: C B < some code > do I = 1,64 A(I) = A(I)*B(I) + C(I) end do C < more code > 2 cache lines in use at a time. 192 reads from main memory. Fix with padding or striping as before. D. Quigley 22/02/2008 Optimisation Access patterns real(kind=dp),dimension(1:128) :: A A < some code > do I = 1,127,2 A(I) = A(I)**2 end do ! odd values do I = 2,128,2 A(I) = A(I)**3 end do ! even values < more code > First loop triggers 8 reads Second loop triggers 8 reads 16 reads total for 128 iterations D. Quigley 22/02/2008 Optimisation Access patterns real(kind=dp),dimension(1:128) :: A A < some code > do I = 1,127,2 A(I) = A(I)**2 A(I+1) = A(I+1)**3 end do ! odd values ! even values < more code > Now stepping through the array with unit stride. 8 reads total for 128 iterations Avoid non-unit stride. D. Quigley 22/02/2008 Optimisation 2d data – F90 A(1,1) – A(16,1) A(17,1) – A(32,1) A(1,2) – A(16,2) A(17,2) – A(32,2) A(1,3) – A(16,3) A(17,3) – A(32,3) A(1,4) – A(16,4) A(17,4) – A(32,4) real(kind=dp),dimension(1:32,1:4) :: A real(kind=dp),dimension(1:32) :: sumrow < some code > do I = 1,32 sumrow(I) = 0.0_dp do J = 1,4 sumrow(I) = sumrow(I) + A(I,J) end do end do < more code > Each addition triggers load of a new cache line. 2 lines in use at a time. 128 reads from main memory total. D. Quigley 22/02/2008 Optimisation 2d data – F90 A(1,1) – A(4,4) A(1,5) – A(4,8) A(1,9) – A(4,12) A(1,13) – A(4,16) A(1,17) – A(4,20) A(1,21) – A(4,24) A(1,25) – A(4,28) A(1,29) – A(4,32) real(kind=dp),dimension(1:4,1:32) :: A real(kind=dp),dimension(1:32) :: sumrow < some code > do I = 1,32 sumrow(I) = 0.0_dp do J = 1,4 sumrow(I) = sumrow(I) + A(J,I) end do end do < more code > Now stepping through memory with unit stride. 8 loads from main memory total. D. Quigley 22/02/2008 Optimisation Memory Summary • Use unit stride wherever possible. Cache works well with spatial and temporal locality of access. • Try to avoid problem sizes which are multiples of a cache line. i.e. avoid powers of 2 like the plague. • Minimise the number of passes through data. Do as much as possible with each read from main memory. • Be VERY careful when looping through multidimensional arrays. (Warning – array storage order in C is opposite to Fortran 90. Be aware that many older codes (pre 1995) were written when memory reads were cheap and most machines had little or no cache memory. D. Quigley 22/02/2008 Optimisation An Example Time evolution of the wave equation using finite differences on a two-dimensional grid. D. Quigley 22/02/2008 Optimisation