CDA 5155 Caches New Topic: Memory Systems 24. Cache 101 – review of undergraduate material 25. Associativity and other organization issues 26. Advanced designs and interactions with pipelines 27. Tomorrow’s cache design (power/performance) 28. Advances in memory design 29. Virtual memory (and how to do it fast) Cache Design 101 Reg 100s bytes 1 cycle access (early in pipeline) L1 Cache (several KB) 1-3 cycle access L2 Cache (½-32MB) Memory (128MB – fewGB) Memory pyramid Disk (Many GB) 6-15 cycle access 50-300 cycle access Millions cycle access! Terminology • Temporal locality: – If memory location X is accessed, then it is more likely to be reaccessed in the near future than some random location Y. – Caches exploit temporal locality by placing a memory element that has been referenced into the cache. • Spatial locality: – If memory location X is accessed, then locations near X are more likely to be accessed in the near future than some random location Y. – Caches exploit spatial locality by allocating a cache line of data (including data near the referenced location). A simple cache Processor Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ R0 R1 R2 R3 1 5 1 4 0 ] ] ] ] ] Cache 2 cache lines 3 bit tag field 2 byte block tag data V V Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 st 1 Processor Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ R0 R1 R2 R3 access Cache 1 5 1 4 0 ] ] ] ] ] tag 0 0 data Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 st 1 Processor Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ access Cache 1 5 1 4 0 ] ] ] ] ] tag 1 0 data 100 110 lru 0 Addr: 0001 R0 R1 R2 R3 110 Misses: 1 Hits: 0 Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 nd 2 Processor Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ R0 R1 R2 R3 access Cache 1 5 1 4 0 110 ] ] ] ] ] tag 1 0 data 100 110 lru 0 Misses: 1 Hits: 0 Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 nd 2 Processor Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ access Cache 1 5 1 4 0 ] ] ] ] ] tag lru 1 0 1 2 data 100 110 140 150 Addr: 0101 R0 R1 R2 R3 110 150 Misses: 2 Hits: 0 Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 rd 3 Processor Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ access Cache 1 5 1 4 0 ] ] ] ] ] tag lru 1 0 1 2 data 100 110 140 150 Addr: 0001 R0 R1 R2 R3 110 150 Misses: 2 Hits: 0 Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 rd 3 Processor Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ R0 R1 R2 R3 access Cache 1 5 1 4 0 110 150 110 ] ] ] ] ] tag 1 0 lru 1 2 data 100 110 140 150 Misses: 2 Hits: 1 Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 th 4 Processor Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ access Cache 1 5 1 4 0 ] ] ] ] ] tag 1 0 lru 1 2 data 100 110 140 150 Addr: 0100 R0 R1 R2 R3 110 150 110 Misses: 2 Hits: 1 Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 th 4 Processor Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ R0 R1 R2 R3 access Cache 1 5 1 4 0 110 150 140 ] ] ] ] ] tag lru 1 0 1 2 data 100 110 140 150 Misses: 2 Hits: 2 Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 th 5 Processor Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ access Cache 1 5 1 4 0 ] ] ] ] ] tag lru 1 0 1 2 data 100 110 140 150 Addr: 0000 R0 R1 R2 R3 110 150 140 Misses: 2 Hits: 2 Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 th 5 Processor Ld Ld Ld Ld Ld R1 M[ R2 M[ R3 M[ R3 M[ R2 M[ R0 R1 R2 R3 access Cache 1 5 1 4 0 110 100 140 140 ] ] ] ] ] tag 1 0 lru 1 2 data 100 110 140 150 Misses: 2 Hits: 3 Memory 0 100 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 10 200 11 210 12 220 13 230 14 240 15 250 Basic Cache Organization • Decide on the block size – How? Simulate lots of different block sizes and see which one gives the best performance – Most systems use a block size between 32 bytes and 128 bytes – Longer sizes reduce the overhead by: • Reducing the number of tags • Reducing the size of each tag Tag Block offset What About Stores? • Where should you write the result of a store? – If that memory location is in the cache? • Send it to the cache. • Should we also send it to memory right away? (write-through policy) • Wait until we kick the block out (write-back policy) – If it is not in the cache? • Allocate the line (put it in the cache)? (write allocate policy) • Write it directly to memory without allocation? (no write allocate policy) Handling Stores (write-through) Processor Cache Assume write-allocate policy Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 1 7 0 5 10 ] ] ] ] ] V tag data 0 0 Misses: 0 Hits: 0 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-through (REF 1) Processor Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 ] ] ] ] ] V tag data 0 0 Misses: 0 Hits: 0 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-through (REF 1) Processor Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 ] ] ] ] ] V tag data 78 1 0 29 lru 0 Misses: 1 Hits: 0 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-through (REF 2) Processor Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 ] ] ] ] ] V tag data 78 1 0 29 lru 0 Misses: 1 Hits: 0 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-through (REF 2) Processor Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] V tag data lru 1 0 78 29 1 3 162 173 Misses: 2 Hits: 0 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-through (REF 3) Processor Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] V tag data lru 1 0 78 29 1 3 162 173 Misses: 2 Hits: 0 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-through (REF 3) Processor Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] V tag data 173 1 0 29 lru 1 3 162 173 Misses: 2 Hits: 1 Memory 0 173 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-through (REF 4) Processor Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] V tag data 173 1 0 29 lru 1 3 162 173 Misses: 2 Hits: 1 Memory 0 173 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-through (REF 4) Processor Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] V tag data lru 1 0 173 29 1 2 71 150 29 Misses: 3 Hits: 1 Memory 0 173 1 29 2 120 3 123 4 71 5 150 29 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-through (REF 5) Processor Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] V tag data lru 1 0 173 29 1 2 71 29 Misses: 3 Hits: 1 Memory 0 173 1 29 2 120 3 123 4 71 5 29 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-through (REF 5) Processor Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 33 ] ] ] ] ] V tag data 1 5 33 28 71 lru 1 2 29 Misses: 4 Hits: 1 Memory 0 173 1 29 2 120 3 123 4 71 5 29 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 How many memory references? • Each miss reads a block – 2 bytes in this cache • Each store writes a byte • Total reads: 8 bytes • Total writes: 2 bytes but caches generally miss < 20% Write-through vs. write-back • We can also design the cache to NOT write all stores to memory immediately? – We can keep the most current copy in the cache and update the memory when that data is evicted from the cache (a write-back policy). • Do we need to write-back all evicted lines? – No, only blocks that have been stored into » Keep a “dirty bit”, reset when the line is allocated, set when the block is stored into. If a block is “dirty” when evicted, write its data back into memory. Handling Stores (write-back) Processor Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 ] ] ] ] ] V d tag data 0 0 Misses: 0 Hits: 0 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-back (REF 1) Processor Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 ] ] ] ] ] V d tag data 0 0 Misses: 0 Hits: 0 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-back (REF 1) Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 ] ] ] ] ] lru Processor V d tag data 10 0 78 29 0 Misses: 1 Hits: 0 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-back (REF 2) Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 ] ] ] ] ] lru Processor V d tag data 10 0 78 29 0 Misses: 1 Hits: 0 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-back (REF 2) Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] lru Processor V d tag data 10 0 78 29 10 3 162 173 Misses: 2 Hits: 0 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-back (REF 3) Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] lru Processor V d tag data 10 0 78 29 10 3 162 173 Misses: 2 Hits: 0 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-back (REF 3) Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] lru Processor V d tag data 11 0 173 29 10 3 162 173 Misses: 2 Hits: 1 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-back (REF 4) Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] lru Processor V d tag data 11 0 173 29 10 3 162 173 Misses: 2 Hits: 1 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-back (REF 4) Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] lru Processor V d tag data 11 0 173 29 11 3 71 29 Misses: 3 Hits: 1 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-back (REF 5) Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] lru Processor V d tag data 11 0 173 29 11 3 71 29 Misses: 3 Hits: 1 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-back (REF 5) Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 173 ] ] ] ] ] lru Processor V d tag data 11 0 173 29 11 3 71 29 Misses: 4 Hits: 1 Memory 0 173 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 write-back (REF 5) Ld Ld St St Ld R1 M[ R2 M[ R2 M[ R1 M[ R2 M[ R0 R1 R2 R3 Cache 1 7 0 5 10 29 33 ] ] ] ] ] lru Processor V d tag data 10 5 33 28 11 3 71 29 Misses: 4 Hits: 1 Memory 0 78 1 29 2 120 3 123 4 71 5 150 6 162 7 173 8 18 9 21 10 33 11 28 12 19 13 200 14 210 15 225 How many memory references? • Each miss reads a block – 2 bytes in this cache • Each evicted dirty cache line writes a block • Total reads: 8 bytes • Total writes: 4 bytes (after final eviction) Choose write-back or write-through?