15-213 Memory System Performance March 21, 2000 Topics • Impact of cache parameters • Impact of memory reference patterns – memory mountain range – matrix multiply class19.ppt Basic Cache Organization Address space (N = 2n bytes) Cache (C = S x E x B bytes) E lines/set Address (n = t + s + b bits) t s S = 2s sets b B Valid bit tag 1 bit t bits class19.ppt data B = 2b bytes (line size) –2– Cache line (cache block) CS 213 S’00 Multi-Level Caches Options: separate data and instruction caches, or a unified cache Processor TLB regs L1 Dcache L1 Icache size: speed: $/Mbyte: line size: 200 B 2 ns L2 Cache 8-64 KB 2 ns 1-4MB SRAM 4 ns $100/MB 8B 32 B 32 B larger, slower, cheaper Memory disk 128 MB DRAM 60 ns $1.50/MB 8 KB 9 GB 8 ms $0.05/MB larger line size, higher associativity, more likely to write back class19.ppt –3– CS 213 S’00 Cache Performance Metrics Miss Rate • fraction of memory references not found in cache (misses/references) • Typical numbers: 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time • time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) • Typical numbers: 1 clock cycle for L1 3-8 clock cycles for L2 Miss Penalty • additional time required because of a miss – Typically 25-100 cycles for main memory class19.ppt –4– CS 213 S’00 Impact of Cache and Line Size Cache Size • impact on miss rate: – larger is better • impact on hit time: – smaller is faster Line Size • impact on miss rate: – big lines can help exploit spatial locality (if it exists) – however, for a given cache size, bigger lines means that there are fewer of them (which can hurt the miss rate) • impact on miss penalty: – given a fixed amount of bandwidth, larger lines means longer transfer times (and hence larger miss penalties) class19.ppt –5– CS 213 S’00 Impact of Associativity • Direct mapped, set associative, or fully associative? Total Cache Size (tags+data): • Higher associativity requires more tag bits, LRU state machine bits • Additional read/write logic, multiplexers (MUXs) Miss Rate: • Higher associativity (generally) decreases miss rate Hit Time: • Higher associativity increases hit time – direct mapped is the fastest Miss Penalty: • Higher associativity may require additional delays to select victim – in practice, this decision is often overlapped with other parts of the miss class19.ppt –6– CS 213 S’00 Impact of Write Strategy • Write through or write back? Advantages of Write Through: • Read misses are cheaper. – Why? • Simpler to implement. – uses a write buffer to pipeline writes Advantages of Write Back: • Reduced traffic to memory – especially if bus used to connect multiple processors or I/O devices • Individual writes performed at the processor rate class19.ppt –7– CS 213 S’00 Qualitative Cache Performance Model Compulsory (aka “Cold”) Misses: • first access to a memory line (which is not in the cache already) – since lines are only brought into the cache on demand, this is guaranteed to be a cache miss • changing the cache size or configuration does not help Capacity Misses: • active portion of memory exceeds the cache size • the only thing that really helps is increasing the cache size Conflict Misses: • active portion of address space fits in cache, but too many lines map to the same cache entry • increased associativity and better replacement policies can potentially help class19.ppt –8– CS 213 S’00 Measuring Memory Bandwidth int data[MAXSIZE]; int test(int size, int stride) { int result = 0; int wsize = size/sizeof(int); for (i = 0; i < wsize; i+= stride) result += data[i]; return result; } Stride (words) Size (bytes) class19.ppt –9– CS 213 S’00 Measuring Memory Bandwidth (cont.) Stride (words) Size (bytes) Measurement • Time repeated calls to test – If size sufficiently small, then can hold array in cache Characteristics of Computation • Stresses read bandwidth of system • Increasing stride yields decreased spatial locality – On average will get stride*4/B accesses / cache block • Increasing size increases size of “working set” class19.ppt – 10 – CS 213 S’00 Alpha Memory Mountain Range DEC Alpha 21164 466 MHz 8 KB (L1) 96 KB (L2) 2 M (L3) 600 500 400 L1 Resident 300 200 100 16m s13 s11 L3 Resident Data Set s15 class19.ppt s9 Stride s7 s5 s3 0 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 8k 4k 2k 1k .5k L2 Resident s1 MB/s – 11 – Main Memory Resident CS 213 S’00 Effects Seen in Mountain Range Cache Capacity • See sudden drops as increase working set size Cache Block Effects • Performance degrades as increase stride – Less spatial locality • Levels off – When reach single access per line class19.ppt – 12 – CS 213 S’00 Alpha Cache Sizes 600 500 400 300 200 100 0 16m 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 8k 4k 2k 1k • MB/s for stride = 16 Ranges .5k – 8k 16k – 64k 128k 256k – 2m 4m – 16m class19.ppt Running in Running in Indistinct Running in Running in L1 (High overhead for small data set) L2. cutoff (Since cache is 96KB) L3. main memory – 13 – CS 213 S’00 .5k Alpha Line Size Effects 600 500 16m 1024k 32k 8k 400 300 200 100 0 s1 s2 s4 s8 s16 Observed Phenomenon • As double stride, decrease accesses/block by 2 • Until reaches point where just 1 access / block • Line size at transition from downward slope to horizontal line – Sometimes indistinct class19.ppt – 14 – CS 213 S’00 Alpha Line Sizes 600 500 16m 1024k 32k 8k 400 300 200 100 0 s1 s2 s4 s8 s16 Measurements 8k 32k 1024k 16m Entire array L1 resident. Effectively flat (except for overhead) Shows that L1 line size = 32B Shows that L2 line size = 32B L3 line size = 64? class19.ppt – 15 – CS 213 S’00 Xeon Memory Mountain Range Pentium III Xeon 550 MHz 16 KB (L1) 512 KB (L2) 1200 1000 800 MB/s 600 L1 Resident 400 16m s13 s11 L2 Resident Data Set s15 class19.ppt s9 Stride s7 s5 s3 s1 0 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 8k 4k 2k 1k .5k 200 Main Memory Resident – 16 – CS 213 S’00 Xeon Cache Sizes 1000 900 800 700 600 500 400 300 200 100 0 16m 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 8k 4k 2k 1k .5k • MB/s for stride = 16 Ranges .5k – 16k 32k – 256k 512k 1m – 16m class19.ppt Running Running Running Running in in in in L1. (Overhead at high end) L2. main memory (but L2 supposed to be 512K!) main memory – 17 – CS 213 S’00 Xeon Line Sizes 1200 1000 800 16m 600 256k 400 4k 200 0 s1 s2 s4 s8 s16 Measurements 4k Entire array L1 resident. Effectively flat (except for overhead) 256k Shows that L1 line size = 32B 16m Shows that L2 line size = 32B class19.ppt – 18 – CS 213 S’00 Interactions Between Program & Cache Major Cache Effects to Consider • Total cache size – Try to keep heavily used data in cache closest to processor Variable sum • Line size held in register – Exploit spatial locality /* ijk */ Example Application for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } • Multiply N x N matrices • O(N3) total operations • Accesses – N reads per source element – N values summed per destination » but may be able to hold in register class19.ppt – 19 – CS 213 S’00 Matrix Mult. Performance: Sparc20 20 18 16 mflops (d.p.) 14 12 10 8 175 200 6 4 50 75 100 ikj kij ijk jik jki kji 2 0 125 150 matrix size (n) • As matrices grow in size, they eventually exceed cache capacity • Different loop orderings give different performance – cache effects – whether or not we can accumulate partial sums in registers class19.ppt – 20 – CS 213 S’00 Miss Rate Analysis for Matrix Multiply Assume: • Line size = 32B (big enough for 4 64-bit words) • Matrix dimension (N) is very large – Approximate 1/N as 0.0 • Cache is not even big enough to hold multiple rows Analysis Method: • Look at access pattern of inner loop k i j k A class19.ppt j i C B – 21 – CS 213 S’00 Layout of Arrays in Memory C arrays allocated in row-major order • each row in contiguous memory locations Stepping through columns in one row: for (i = 0; i < N; i++) sum += a[0][i]; • accesses successive elements • if line size (B) > 8 bytes, exploit spatial locality – compulsory miss rate = 8 bytes / B Stepping through rows in one column: for (i = 0; i < n; i++) sum += a[i][0]; • accesses distant elements • no spatial locality! – compulsory miss rate = 1 (i.e. 100%) Memory Layout 0x80000 0x80008 0x80010 0x80018 a[0][0] a[0][1] a[0][2] a[0][3] • • • 0x807F8 0x80800 0x80808 0x80810 0x80818 a[0][255] a[1][0] a[1][1] a[1][2] a[1][3] • • • 0x80FF8 a[1][255] • • • • • • 0xFFFF8 a[255][255] class19.ppt – 22 – CS 213 S’00 Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Inner loop: (*,j) (i,*) A B Row-wise Columnwise Misses per Inner Loop Iteration: A 0.25 class19.ppt B 1.0 (i,j) C Fixed C 0.0 – 23 – CS 213 S’00 Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } } Misses per Inner Loop Iteration: A 0.25 class19.ppt B 1.0 C 0.0 – 24 – Inner loop: (*,j) (i,j) (i,*) A B Row-wise Columnwise C Fixed CS 213 S’00 Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) A Fixed (k,*) B class19.ppt B 0.25 C 0.25 – 25 – C Row-wise Row-wise Misses per Inner Loop Iteration: A 0.0 (i,*) CS 213 S’00 Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) A Fixed (k,*) B class19.ppt B 0.25 C 0.25 – 26 – C Row-wise Row-wise Misses per Inner Loop Iteration: A 0.0 (i,*) CS 213 S’00 Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) A Column wise Misses per Inner Loop Iteration: A 1.0 class19.ppt B 0.0 (k,j) (*,j) B Fixed C Columnwise C 1.0 – 27 – CS 213 S’00 Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Misses per Inner Loop Iteration: A 1.0 class19.ppt B 0.0 Inner loop: (*,k) A Columnwise (k,j) B Fixed C 1.0 – 28 – CS 213 S’00 (*,j) C Columnwise Summary of Matrix Multiplication ijk (& jik): • 2 loads, 0 stores kij (& ikj): • 2 loads, 1 store • misses/iter = 1.25 • misses/iter = 0.5 for (i=0; i<n; i++) { for (k=0; k<n; k++) { for (j=0; j<n; j++) { jki (& kji): • 2 loads, 1 store • misses/iter = 2.0 for (j=0; j<n; j++) { for (i=0; i<n; i++) { for (k=0; k<n; k++) { sum = 0.0; r = a[i][k]; r = b[k][j]; for (k=0; k<n; k++) for (j=0; j<n; j++) for (i=0; i<n; i++) sum += a[i][k] * b[k][j]; c[i][j] += r * b[k][j]; c[i][j] = sum; } c[i][j] += a[i][k] * r; } } } } } class19.ppt – 29 – CS 213 S’00 Matrix Mult. Performance: DEC5000 3 mflops (d.p.) 2.5 kij ijk jik 1.5 0.5 0 ikj 2 1 jki kji (misses/iter = 0.5) (misses/iter = 1.25) 50 75 class19.ppt 100 125 150 matrix size (n) 175 – 30 – 200 (misses/iter = 2.0) CS 213 S’00 Matrix Mult. Performance: Sparc20 Multiple columns of B fit in cache 20 mflops (d.p.) 18 16 14 12 8 6 kij ijk jik jki kji (misses/iter = 0.5) (misses/iter = 1.25) 2 0 ikj 10 4 50 75 class19.ppt 100 125 150 matrix size (n) 175 – 31 – 200 (misses/iter = 2.0) CS 213 S’00 Matrix Mult. Performance: Alpha 21164 Too big for L1 Cache Too big for L2 Cache 160 mflops (d.p.) 140 120 ijk ikj jik jki kij kji (misses/iter = 0.5) 100 80 60 40 20 (misses/iter = 1.25) 0 (misses/iter = 2.0) matrix size (n) class19.ppt – 32 – CS 213 S’00 Matrix Mult.: Pentium III Xeon 90 80 70 ijk ikj jik jki kij kji 50 40 30 20 (misses/iter = 0.5 or 1.25) (misses/iter = 2.0) 10 10 0 12 5 15 0 17 5 20 0 22 5 25 0 27 5 30 0 32 5 35 0 37 5 40 0 42 5 45 0 47 5 50 0 75 50 0 25 MFlops (d.p.) 60 class19.ppt Matrix Size (n) – 33 – CS 213 S’00 Blocked Matrix Multiplication • “Block” (in this context) does not mean “cache block” - instead, it means a sub-block within the matrix Example: N = 8; sub-block size = 4 A11 A12 A21 A22 B11 B12 X B21 B22 = C11 C12 C21 C22 Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars. C11 = A11B11 + A12B21 C21 = A21B11 + A22B21 class19.ppt C12 = A11B12 + A12B22 C22 = A21B12 + A22B22 – 34 – CS 213 S’00 Blocked Matrix Multiply (bijk) for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } } } class19.ppt – 35 – CS 213 S’00 Blocked Matrix Multiply Analysis • Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X bsize block of B and accumulates into 1 X bsize sliver of C • Loop over i steps through n row slivers of A & C, using same B for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; Innermost } kk jj jj Loop Pair i A class19.ppt kk i B C Update successive row sliver accessed elements of sliver bsize times block reused n times in succession – 36 – CS 213 S’00 Blocked Matrix Mult. Perf: DEC5000 3 mflops (d.p.) 2.5 2 1.5 1 125 150 175 200 bijk bikj ikj ijk 0.5 0 50 class19.ppt 75 100 matrix size (n) – 37 – CS 213 S’00 Blocked Matrix Mult. Perf: Sparc20 20 mflops (d.p.) 18 16 14 12 75 100 125 150 175 200 bijk bikj ikj ijk 10 8 6 4 2 0 50 class19.ppt matrix size (n) – 38 – CS 213 S’00 Blocked Matrix Mult. Perf: Alpha 21164 160 140 mflops (d.p.) 120 100 bijk bikj 80 ijk ikj 60 40 20 500 475 450 425 400 375 350 325 300 275 250 225 200 175 150 125 100 75 50 0 matrix size (n) class19.ppt – 39 – CS 213 S’00 90 Blocked Matrix Mult. : Xeon 80 70 ijk ikj bijk bikj 50 40 30 20 10 10 0 12 5 15 0 17 5 20 0 22 5 25 0 27 5 30 0 32 5 35 0 37 5 40 0 42 5 45 0 47 5 50 0 75 0 50 MFlops (d.p.) 60 class19.ppt Matrix Size (n) – 40 – CS 213 S’00