MMM Optimizations ● ● ● MMM – matrix-matrix multiplication Reference: How to Write Fast Numerical Code: A Small Introduction Srinivas Chellappa, Franz Franchetti, and Markus Püschel – Carnegie Mellon Univeristy Parallelsim ● ● Task of achieving highest performance for an implementation – lies to great extent with programmer Platform dependent ● Needs to be repeated for every new hardware release Automatic Performance Tuning ● Active research area ● Adaptive libraries ● ● Strategy – determined at runtime – from platform’s memory hierarchy Source code generators ● ● ● Produce algorithm generations from scratch Generate many different variants of algorithm & select fasted one Example: ATLAS – Automatically Tuned Linear Algebra Software Reuse ● ● Measures how often a given input value is used in a computation during the algorithm High degree of reuse – algorithm may perform better with memory hierarchy ● ● ● Number of computations dominates data transfers from memory to CPU Said to be CPU bound Low degree of reuse ● ● Number of data transfers from memory to CPU is high compared to the number of computations Said to be memory bound Matrix-Matrix Mulitiplication ● Generally consider C = C + AB ● If A & B are nxn matrices ● n3 multiplications ● n3 additions ● O(n3) floating point operations ● Reuse O(n3/n2) = O(n) ● Better implementations exist Memory Hierarchy Locality ● Temporal memory locality ● ● Spatial memory locality ● ● Memory location that is referenced will likely be referenced again in the future Likelihood of referencing a memory location is higher is a nearby local was recented referenced High performance software needs to take advantage of locality Registers ● ● Register spill – register contents written to lower levels of memory – expensive Compiler optimizations Cache Memory ● Cache miss ● ● ● Data not in cache – fetched from memory Need to minimize number of cache misses Divided into cache lines (blocks) ● ● Data moved in & out of cache in chunks of the cache line size Take advantage of spatial locality Design Principles ● ● Once data is brought into cache – should be reused as much as possible before it is evicted Programs need to be designed to perform computations on neighboring data before cache line in evicted Cache Analysis ● Consider simple direct mapped 16 byte data cache with 2 cache lines – each of size 8 bytes ● ● ● 2 floats per line Assume data is cache-aligned (data is question starts at the beginning of the cache line Consider the following code fragment float X[8]; for(int j=0; j<2; j++) for(int i=0; i<8; i++) access(X[i]); ● 8 cache hits & 8 cache misses Cache Analysis ● Stride of 2 float X[8]; for(int j=0; j<2; j++) { for(int i=0; i<7; i+=2) access(X[i]); for(int i=1; i<8; i+=2) access(X[i]); } ● 0 hits & 16 misses Cache Analysis ● Different code fragment – same logical result float X[8]; for(i=0; i<2; i++) for(k=0; k<2; k++) for(j=0; j<4; j++) access(X[j+(i*4]); ● 12 hits & 4 misses CPU Features ● Most processors contain pipelined superscalar out-of-order cores with multiple execution units ● ● ● Pipelining – different parts of a processor work simultaneously on different components of different instructions Superscalar cores – can retire more than 1 instruction per processsor clock cycle Out-of-order – reschedule instruction sequence around dependencies Peak Performance ● Theoretical Peal Performance – theoretical rate at which a processor can perform floating point operations ● Measure in FLOPS Using Compilers ● Use appropriate compiler flags, language extensions, monitoring & analyzing of compiler’s output Variable Declarations ● C assigns variables to different storage classes by default ● Overridden by storage class specifier ● extern – shared by different sources files ● static – exist as long as program executes ● auto – allocated on stack ● register – compiler allocates space directly in registers Variable Declarations ● Qualifiers – specify variable attributes ● const – constant ● ● volatile – values may be influenced by sources external to compiler’s knowledge restrict – tells compiler that memory address will be restricted to access by a specific pointer ● ● No aliasing Memory alignment – request that variables to aligned to cache line boundaries or virtual memory pages Inline Assemblies ● ● Can include assembly code Language can provide intrinsics – ability to access special machine instructions Compiler Flags ● C standards – version of C ● Architecture specifications ● Optimization levels ● Specialized compiler options Compiler Output ● Optimization reports – inform what optimizations are possible or why certain attempted optimizations did not work Performance Optimization ● Finding hotspots (most frequently executed code regions) ● Timing hotspots ● Analyzing measured runtimes Finding the Hotspots ● Profiling tool ● Gnu gprof ● Intel vtune ● valgrind Gprof Example #include <stdio.h> float function1() { int i; float retval=0; for(i=1; i<1000000; i++) retval += (1/i); return(retval); } float function2() { int i; float retval=0; for(i=1; i<10000000; i++) retval += (1/(i+1)); return(retval); } void function3() { return; } int main() { int i; printf("Result: %.2f\n", function1()); printf("Result: %.2f\n", function2()); if (1==2) function3(); return(0); } Gprof Example ● Compile & link: ● ● Execute ● ● gcc -O0 -lm -g -pg -o ourProgram ourProgram.c ./ourProgram Run gprof ● gprof ourProgram gmon.out > profile.txt Timing a Hotspot ● Read current time ● Execute hotspot – number of times ● Read current time ● Calculate time per iteration Known Problems ● ● ● Too few iterations of the function to be timed are executed between the two time stamp readings, and the resulting timing is inaccurate due to poor timer resolution. Too many iterations are executed between the two time stamp readings, and the resulting timing is affected by system events. The machine is under load and the load has side effects on the measured program. Known Problems ● ● ● ● Multiple timing jobs are executed concurrently, and they interfere with one another. Data alignment of input and output triggers cache problems. Virtual-to-physical memory translation makes timing irreproducible. The time stamp counter overflows and either triggers an interrupt or produces a meaningless value. Known Problems ● ● ● Reading the timestamp counters requires hundred(s) of cycles, which itself affects the timing. The linking order of object files changes locality of static constants and this produces cache interference. The machine was not rebooted in a long time and the operating system state causes problems. Known Problems ● ● The control flow in the numerical kernel being timed is data-dependent and the test data is not representative. The kernel is in-place (e.g., the input is a vector x and the output is written back to x), and the norm of the output is larger than the norm of the input. Repetitive application of the kernel leads to an exponential growth of the norm and finally triggers floating-point exceptions which interfere with the timing. Known Problems ● The transform is timed with a zero vector, and the operating system is “smart,” and responds to a request for a large zerovector dynamic memory allocation by returning a special zero-valued copy-onwrite virtual memory page. Read accesses to this “page” would be much faster than accesses to a page that is actually allocated, since this page is a special one maintained by the operating system for efficiency. Analyzing Measured Runtime ● 2 basic questions: ● ● ● How efficient is the implementation with respect to the limiting resource? Normalization ● ● What is the limiting resource? Asymptotic or exact operations count Relative performance ● Comparing measured performance to theoretical peak performance Optimization for Memory Hierarchy ● Performance conscious programming ● Optimizations for cache ● Optimizations for registers & CPU ● Parameter-based performance tuning Performance-Conscious Programming ● Object oriented programming – avoided for performance critical parts of program ● ● ● Late binding Languages not compiled to native machine code – avoided Use 1 dimensional arrays whenever possible ● Linearize higher dimension arrays Performance-Conscious Programming ● Avoid complicated struct & union data types ● Multiple arrays over array of structs ● ● Dynamically generated data structures must be avoided if algorithm can be implemented using arrays While loops & loops with complicated termination conditions must be avoided ● ● ● Use for loops with loop counters & loop bounds known at compile time Selection structures need to be avoided in hot spots & inner loops Macros instead of small functions Cache Optimization ● ● Cache & TLB – goal is to reuse data as much as possible Optimization methods ● ● ● Blocking – working on chunks of data that fit into cache Loop merging – merging consecutive loops that traverse data Buffering – copying into contiguous temporary buffers Blocking ● Perform computation in blocks that operate on a subset of input data – memory locality ● Tiling split loop into smaller loops ● Sometimes recursion Loop Merging ● For sequencial loops – if all operations of first loop do not need to executed before second loop – possible to merge into 1 loop Buffering ● Logically close data elements could be stored apart from each other ● ● Elements in a column of a matrix that is stored in row-matrix form Copy logically close data elements into a contiguous buffer CPU & Register Optimization ● Optimization goals for a modern CPU ● ● ● ● ● Have inner loops with adequately large loop bodies Have many independent operations inside an inner loop body Use automatic variables whenever possible Reuse loaded data elements to the extent possible Avoid math library function calls inside an inner loop Blocking ● Partitions data into chunk on which computation can be performed within register set for(i=0; i<8; i++) { y[2*i] = x[2*i] + x[2*i+1]; y[2*i+1] = x[2*i] - x[2*i+1]; } ● Blocked into for(i1=0; i1<4; i1++) for(i2=0; i2<2; i2++) { y[4*i1+2*i2] = x[4*i1+2*i2] + x[4*i1+2*i2+1]; y[4*i1+2*i2+1] = x[4*i1+2*i2] - x[4*i1+2*i2+1]; } Unrolling & Scheduling ● Unrolling produces larger basic blocks ● Decreases number of conditional branches ● Increases number of operations in basic block ● Easier to determine data dependencies ● Easier to do rescheduling for(i1=0; i1<4; i1++) { y[4*i1] = x[4*i1] + x[4*i1+1]; y[4*i1+1] = x[4*i1] - x[4*i1+1]; y[4*i1+2] = x[4*i1+2] + x[4*i1+3]; y[4*i1+3] = x[4*i1+2] - x[4*i1+3]; } Scalar Replacement ● ● Pointer analysis – complicated Replace arrays that are fully inside scope of innermost loop by 1 automatic scalar variable per array element double t[2]; for(i=0; i<8; i++) { t[0] = x[2*i] + x[2*i+1]; t[1] = x[2*i] - x[2*i+1]; y[2*i] = t[0] * D[2*i]; y[2*i+1] = t[0] * D[2*i]; } ● double t0, t1; for(i=0; i<8; i++) { t0 = x[2*i] + x[2*i+1]; t1 = x[2*i] - x[2*i+1]; y[2*i] = t0 * D[2*i]; y[2*i+1] = t1 * D[2*i]; } Automatic variables – can be stored in registers – arrays in memory Precomputation of Constants ● All constants know ahead of time – should be precomputed at compile time or initialization time & stored in a data array for(i=0; i<8; i++) y[i] = x[i] * sin(M_PI * i / 8); static double D[8]; void init() { for(int i=0; i<8; i++) D[i] = sin(M_PI * i / 8); } ... // in the kernel for(i=0; i<8; i++) y[i] = x[i] * D[i];