CS 152 Computer Architecture & Engineering Section 7 Spring 2010 Andrew Waterman University of California, Berkeley Mystery Die Mystery Die Mystery Die • RISC II: 41K transistors, 4 micron NMOS @ 12 MHz • 2.2x faster than VAX 11-780 (1500 TTL chips @ 5MHz) Agenda • Quiz 2 Post-Mortem – Mean: 53.1 – Standard Deviation: 9.0 Quiz 2, Q1 • N=1024. Store/Load miss rate for 4KB 2-way cache w/LRU replacement? • LRU => no conflicts between loads/stores • Loads are unit-stride with no reuse • All misses compulsory => 1/8 • All stores miss because of capacity misses for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j]; Quiz 2, Q1 • What about FIFO replacement? • Stores and loads could now conflict. When? • Stores always use set i/8 % 64 • Loads always use set j/8 % 64 • Conflicts occur when these are equal for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j]; Quiz 2, Q1 • What about FIFO replacement? • Stores and loads could now conflict. When? • Stores always use set i/8 % 64 • Loads always use set j/8 % 64 • Conflicts occur when these are equal for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j]; Quiz 2, Q1 • Is Write-Allocate a good idea for this code? for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j]; Quiz 2, Q1 • Is Write-Allocate a good idea for this code? • On every store miss, 32 bytes of data are read into cache then discarded, so no for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j]; Quiz 2, Q1 • Is Write-Back a good idea for this code? for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j]; Quiz 2, Q1 • Is Write-Back a good idea for this code? • For Write-Allocate, bad (32 bytes written back for each 4 byte store (total of 64 bytes traffic)) • Otherwise, OK, except the Write-Through alternative had a write buffer, which will dramatically reduce miss penalty for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j]; Quiz 2, Q1 • If cache were fully associative, how could we improve code’s performance? for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j]; Quiz 2, Q1 • If cache were fully associative, how could we improve code’s performance? • Block the transpose • FA makes this easier; lots of solutions • Here’s one; let B = 8 (words in cache line) for(i = 0; i < N; i+=B) for(j = 0; j < N; j++) for(k = 0; k < B; k++) B[j*N+(i+k)] = A[(i+k)*N+j]; for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j]; Quiz 2, Q1 • What about TLB misses? • 4KB pages, 1024-entry DM TLB • Compulsory misses first • 2 matrices * (1024^2 words)/(1024 words/page) • = 2048 for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j]; Quiz 2, Q1 • What about TLB misses? • • • • • • • • 4KB pages, 1024-entry DM TLB Now consider some iteration 0 ≤ i < N-1 After iteration i, TLB[i] = Ai, and TLB[k] = Bk, k≠i During iteration i+1, store to Bi will miss Then store to Bi+1 will miss, kicking out Ai+1 Next load to Ai+1 will miss for(i = 0; i < N; i++) 3 conflicts/iteration for(j = 0; j < N; j++) 3072+2048 misses total B[j*N+i] = A[i*N+j]; Quiz 2, Q2 • Basic idea of microtags: SA caches put tag check on critical path (data-out) • Reduce critical path by using subset of tag to select way • In this cache, microtag check -> data out remains critical path, but 1/6 faster Quiz 2, Q2 • AMAT = hit time + miss rate * miss penalty • Hit time not multiplied by hit rate • You have to pay the hit time even on a miss Quiz 2, Q2 • Microtag uniqueness affects conflict misses • Increases compared to 4-way SA • But still much better than DM • Otherwise, why would we build a microtagged cache? Just use DM Quiz 2, Q2 • Aliasing question was unintentionally tricky: microtags are a red herring • The aliasing problem is just the same as for any virtually-indexed physically-tagged cache with index+offset ≤ page offset • Aliases always map to the same set, which would be fine for DM, but with SA they can live in different ways Quiz 2, Q2 • Aliasing question was unintentionally tricky: microtags are a red herring • The aliasing problem is just the same as for any virtually-indexed physically-tagged cache with index+offset ≤ page offset • Simple fix: on a miss, you already have the physical tag and all physical tags in the set • Iff there’s a match, there’s an alias Quiz 2, Q3 • 2x associativity, capacity & line size constant • Increases hit time due to data-out muxing • Reduces conflict misses • Halving line size (associativity & #sets constant) • Reduces hit time (capacity down) • Increases miss rate (same reason) • Reduces miss penalty (shorter lines, less to fetch) Quiz 2, Q3 • Physical -> virtual cache • Hit time reduced (only real reason to do this is to remove TLB from hit path) • Effect on miss rate ambiguous • More misses for aliases • More misses for context switches w/o ASIDs • Fewer misses due to address space contiguity • Increased miss penalty because TLB lookup is moved to miss path, and for anti-aliasing Quiz 2, Q3 • Write buffer • Reduces both store miss penalty and hit time • HW prefetching • HW isn’t on hit path, so no effect on hit time • Reduces miss rate (main reason) • Prefetch buffer hit considered “slow hit”, not miss • Reduces miss penalty (prefetches can be in-flight when miss occurs)