CS61: Systems Programming and Machine Organization Harvard University, Fall 2009 Lecture 14: Cache Performance Measurement and Optimization Prof. Matt Welsh October 20, 2009 Topics for today ● Cache performance metrics ● Discovering your cache's size and performance ● The “Memory Mountain” ● Matrix multiply, six ways ● Blocked matrix multiplication © 2009 Matt Welsh – Harvard University 2 Cache Performance Metrics Miss Rate ● ● Fraction of memory references not found in cache (# misses / # references) Typical numbers: ● 3-10% for L1 ● Can be quite small (e.g., < 1%) for L2, depending on size and locality. Hit Time ● ● Time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) Typical numbers: ● 1-2 clock cycles for L1 ● 5-20 clock cycles for L2 Miss Penalty ● Additional time required because of a miss ● Typically 50-200 cycles for main memory © 2009 Matt Welsh – Harvard University 3 Writing Cache Friendly Code • Repeated references to variables are good (temporal locality) • Stride-1 reference patterns are good (spatial locality) • Examples: cold cache, 4-byte words, 4-word cache blocks int sum_array_rows(int a[M][N]) { int i, j, sum = 0; } for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; Miss rate = 1/4 = 25% © 2009 Matt Welsh – Harvard University int sum_array_cols(int a[M][N]) { int i, j, sum = 0; } for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; Miss rate = 100% 4 Determining cache characteristics ● ● Say I gave you a machine and didn't tell you anything about its cache size or speeds. How would you figure these values out? © 2009 Matt Welsh – Harvard University 5 Determining cache characteristics ● ● ● Say I gave you a machine and didn't tell you anything about its cache size or speeds. How would you figure these values out? Idea: Write a program to measure the cache's behavior and performance. ● ● Program needs to perform memory accesses with different locality patterns. Simple approach: ● ● ● Allocate array of size “W” words Loop over the array with stride index “S” and measure speed of memory accesses Vary W and S to estimate cache characteristics S = 4 words W = 32 words © 2009 Matt Welsh – Harvard University 6 Determining cache characteristics S = 4 words W = 32 words ● ● What happens as you vary W and S? Changing W varies the total amount of memory accessed by the program. ● ● Changing S varies the spatial locality of each access. ● ● ● As W gets larger than one level of the cache, performance of the program will drop. If S is less than the size of a cache line, sequential accesses will be fast. If S is greater than the size of a cache line, sequential accesses will be slower. See end of lecture notes for example C program to do this. © 2009 Matt Welsh – Harvard University 7 Varying Working Set Keep stride constant at S = 1, and vary W from 1KB to 8MB ● Shows size and read throughputs of different cache levels and memory Read throughput 1200 (MB/sec) main memory region L2 cache region L1 cache region Why the dropoff in performance here? read througput (MB/s) 1000 800 600 Why is this dropoff not so drastic? 400 200 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1024k 2m 4m 8m 0 working set size (bytes) © 2009 Matt Welsh – Harvard University 8 Varying stride Keep working set constant at W = 256 KB, vary stride from 1-16 ● Shows the cache block size. 800 Why the gradual drop in performance? read throughput (MB/s) 700 600 500 one access per cache line 400 300 200 100 0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 stride (words) © 2009 Matt Welsh – Harvard University 9 Pentium III – 550 Mhz 1000 L1 dropoff occurs at 16 KB L1 800 600 400 xe Ridges of temporal locality L2 © 2009 Matt Welsh – Harvard University 8k 32k 128k 512k 2m 8m s15 Stride (words) s9 s7 mem s5 s3 s1 0 2k 200 s13 Slopes of spatial locality 16 KB L1 cache 512 KB L2 cache 1200 s11 Throughput (MB/sec) The Memory Mountain L2 dropoff occurs at 512 KB Working set size (bytes) 10 X86-64 Memory Mountain 5000 L1 4000 3000 L2 2000 4k 1000 128k © 2009 Matt Welsh – Harvard University s29 s25 Stride (w ords) M em s21 s17 s13 s9 s5 0 s1 Slopes of Spatial Locality R e a d T h r o u g h p u t (M B /s ) 6000 Pentium Noco na Xeon x86-64 3.2 GHz 12 Kuo p on-chip L1 trace cache 16 KB on-chip L1 d-cache 1 M B off-chip unified L2 cache 4m Ridges o f Tempo ral Locality W orking Set Size (bytes) 128m 11 Opteron Memory Mountain AMD Opteron 2 GHZ 3000 L1 2500 2000 Read throughput (MB/s) 1500 L2 1000 4k 500 128k 0 4m © 2009 Matt Welsh – Harvard University s29 s25 Stride (words) s21 s17 s13 s9 s5 s1 Mem Working set (bytes) 128m 12 Matrix Multiplication Example ● Matrix multiplication is heavily used in numeric and scientific applications. ● ● It's also a nice example of a program that is highly sensitive to cache effects. Multiply two N x N matrices ● ● ● O(N3) total operations Read N values for each source element Sum up N values for each destination Variable sum /* ijk */ held in register for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } © 2009 Matt Welsh – Harvard University 13 Matrix Multiplication Example /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } 4*3 + 2*2 + 7*5 = 51 4 2 7 1 8 2 6 0 1 © 2009 Matt Welsh – Harvard University x 3 0 1 2 4 5 5 9 1 51 = 14 Miss Rate Analysis for Matrix Multiply Assume: ● ● ● Line size = 32B (big enough for four 64-bit “double” values) Matrix dimension (N) is very large Cache is not even big enough to hold multiple rows Analysis Method: ● Look at access pattern of inner loop k i j k A © 2009 Matt Welsh – Harvard University j i B C 15 Layout of C Arrays in Memory (review) C arrays allocated in row-major order ● Each row in contiguous memory locations Stepping through columns in one row: ● for (i = 0; i < N; i++) sum += a[0][i]; ● ● Accesses successive elements Compulsory miss rate: (8 bytes per double) / (block size of cache) Stepping through rows in one column: ● for (i = 0; i < n; i++) sum += a[i][0]; ● ● Accesses distant elements -- no spatial locality! Compulsory miss rate = 100% © 2009 Matt Welsh – Harvard University 16 Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Inner loop: (*,j) (i,j) (i,*) A Row-wise B Columnwise C Fixed Misses per Inner Loop Iteration: A 0.25 B 1.0 C 0.0 Assume cache line size of 32 bytes. Compulsory miss rate = 8 bytes per double / 32 bytes = ¼ = 0.25 © 2009 Matt Welsh – Harvard University 17 Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } } Inner loop: (*,j) (i,*) A B Row-wise Columnwise Misses per Inner Loop Iteration: A 0.25 B 1.0 (i,j) C Fixed C 0.0 Same as ijk. Just swapped i and j. © 2009 Matt Welsh – Harvard University 18 Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) A Fixed (k,*) B Row-wise (i,*) C Row-wise Misses per Inner Loop Iteration: A 0.0 B 0.25 C 0.25 Now we suffer 0.25 compulsory misses per iteration for “B” and “C” accesses. Also need to store back “temporary” result c[i][j] on each innermost loop iteration! © 2009 Matt Welsh – Harvard University 19 Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) A B Fixed Row-wise (i,*) C Row-wise Misses per Inner Loop Iteration: A 0.0 B 0.25 C 0.25 Same as kij. © 2009 Matt Welsh – Harvard University 20 Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (k,j) A Column wise Misses per Inner Loop Iteration: A 1.0 © 2009 Matt Welsh – Harvard University B 0.0 (*,j) B Fixed C Columnwise C 1.0 21 Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (*,j) (k,j) A Columnwise B Fixed C Columnwise Misses per Inner Loop Iteration: A 1.0 © 2009 Matt Welsh – Harvard University B 0.0 C 1.0 22 Summary of Matrix Multiplication for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } © 2009 Matt Welsh – Harvard University ijk or jik: 2 loads, 0 stores • misses/iter = 1.25 • kij or ikj: 2 loads, 1 store • misses/iter = 0.5 • jki or kji: 2 loads, 1 store • misses/iter = 2.0 • 23 Pentium Matrix Multiply Performance 60 50 jki, kji (3 mem accesses, 2 misses/iter) Cycles/iteration 40 kij, ikj (3 mem accesses, 0.5 misses/iter) 30 20 kji jki kij ikj jik ijk ijk, jik 10 (2 mem accesses, 1.25 misses/iter) 0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 Array size (n) ● ● ● Versions with same number of mem accesses and miss rate perform about the same. Lower misses/iter tends to do better ijk and jik version fastes, although higher miss rate than kij and ikj versions © 2009 Matt Welsh – Harvard University 24 Using blocking to improve locality Blocked matrix multiplication ● ● ● Break matrix into smaller blocks and perform independent multiplications on each block. Improves locality by operating on one block at a time. Best if each block can fit in the cache! Example: Break each matrix into four sub-blocks A11 A12 A21 A22 B11 B12 X B21 B22 = C11 C12 C21 C22 Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars. C11 = A11B11 + A12B21 C12 = A11B12 + A12B22 C21 = A21B11 + A22B21 C22 = A21B12 + A22B22 © 2009 Matt Welsh – Harvard University 25 Blocked Matrix Multiply (bijk) for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } } } © 2009 Matt Welsh – Harvard University 26 Blocked matrix multiply operation A B kk jj C jjjj+bsize jjjj+bsize jj+bsize kk kk kk+bsizekk+bsizekk+bsize x = Step 1: Pick location of block in matrix B ● Block slides across matrix B left-to-right, top-to-bottom, by “bsize” units at a time © 2009 Matt Welsh – Harvard University 27 Blocked matrix multiply operation A i i i i kk kk kk kk B kk+bsize kk+bsize kk+bsize kk+bsize x kk C jj jj+bsize = kk+bsize Step 2: Slide row sliver across matrix A ● ● ● Hold block in matrix B fixed. Row sliver slides from top to bottom across matrix A Row sliver spans columns [ kk ... kk+bsize ] © 2009 Matt Welsh – Harvard University 28 Blocked matrix multiply operation A B kk C kk+bsize i j j x kk jj jj+bsize = kk+bsize Step 3: Slide column sliver across block in matrix B ● ● ● Row sliver in matrix A stays fixed. Column sliver slides from left to right across the block Column sliver spans rows [ kk ... kk+bsize ] in matrix B © 2009 Matt Welsh – Harvard University 29 Blocked matrix multiply operation A B C j kk kk+bsize i i j x kk jj jj+bsize = kk+bsize Step 4: Iterate over row and column slivers together ● ● Compute dot product of both vectors of length “bsize” Dot product is added to contents of cell (i,j) in matrix C © 2009 Matt Welsh – Harvard University 30 Locality properties A B C j kk kk+bsize i i j x kk jj jj+bsize = kk+bsize What is the locality of this algorithm? ● ● Iterate over all elements of the block N times (once for each row sliver in A) Row sliver in matrix A accessed “bsize” times (once for each column sliver in B) If block and row slivers fit in the cache, performance should rock! © 2009 Matt Welsh – Harvard University 31 Pentium Blocked Matrix Multiply Performance Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik) ● Relatively insensitive to array size. 60 Cycles/iteration 50 kji jki kij ikj jik ijk bijk (bsize = 25) bikj (bsize = 25) 40 30 20 10 Blocked versions 15 0 17 5 20 0 22 5 25 0 27 5 30 0 32 5 35 0 37 5 40 0 5 12 10 0 75 50 25 0 Array size (n) © 2009 Matt Welsh – Harvard University 32 Cache performance test program /* The test function */ void test(int elems, int stride) { int i, result = 0; volatile int sink; } for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ /* Run test(elems, stride) and return read throughput (MB/s) */ double run(int size, int stride) { uint64_t start_cycles, end_cycles, diff; int elems = size / sizeof(int); } test(elems, stride); /* start_cycles = get_cpu_cycle_counter(); /* test(elems, stride); /* end_cycles = get_cpu_cycle_counter(); /* diff = end_cycles – start_cycles; /* return (size / stride) / (diff / CPU_MHZ); © 2009 Matt Welsh – Harvard University warm up the cache */ Read CPU cycle counter */ Run test */ Read CPU cycle counter again */ Compute time */ /* convert cycles to MB/s */ 33 Cache performance main routine #define #define #define #define #define CPU_MHZ 2.8 * 1024.0 * 1024.0; /* e.g., 2.8 GHz */ MINBYTES (1 << 10) /* Working set size ranges from 1 KB */ MAXBYTES (1 << 23) /* ... up to 8 MB */ MAXSTRIDE 16 /* Strides range from 1 to 16 */ MAXELEMS MAXBYTES/sizeof(int) int data[MAXELEMS]; int main() { int size; int stride; } /* The array we'll be traversing */ /* Working set size (in bytes) */ /* Stride (in array elements) */ init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride)); printf("\n"); } exit(0); © 2009 Matt Welsh – Harvard University 34