CMP 301A Computer Architecture 1 Lecture 2 Outline Direct mapped caches: Reading and writing policies Measuring cache performance Improving cache performance Enhancing main memory performance Flexible placement of blocks: Associativity Multilevel caches Read and Write Policies Cache read is much easier to handle than cache write: Instruction cache is much easier to design than data cache Cache write: How do we keep data in the cache and memory consistent? Two write options: Write Through: write to cache and memory at the same time. Isn’t memory too slow for this? Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss. Need a “dirty” bit for each cache block Greatly reduce the memory bandwidth requirement Control can be complex Write Buffer for Write Through Processor Cache DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation Problem: Write buffer may hold updated value of location needed by a read miss??!! Write Allocate versus Not Allocate Assume: a 16-bit write to memory location 0x0 and causes a miss Do we read in the rest of the block (Byte 2, 3, ... 31)? Yes: Write Allocate No: Write Not Allocate 31 9 Cache Tag Cache Tag 0 Cache Index Byte Select Ex: 0x00 Ex: 0x00 Cache Data 0x00 Byte 31 Byte 63 : : Valid Bit Example: 0x00 4 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3 : : Byte 1023 : : Byte 992 31 Measuring cache performance Impact of cache miss on Performance Suppose a processor executes at Clock Rate = 1 GHz (1 ns per cycle), Ideal (no misses) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations (involving data) get 100 cycle miss penalty Suppose that 1% of instructions get same miss penalty CPI ideal CPI average stalls per instructio n CPI cycles Data_Mops miss cycle Inst_Mop miss cycle instr. instr. Data_Mops miss instr. Inst_Mop miss CPI 1.1 0.30 0.10 100 1 0.01100 cycle (1.1 3.0 1.0) 5.1 instr. 78% of the time the proc is stalled waiting for memory! 6 Improving Cache Performance Average memory access time(AMAT) = Hit time + Miss rate x Miss penalty To improve performance: • reduce the hit time • reduce the miss rate • reduce the miss penalty 7 Enhancing main memory performance Increasing memory and bus width Transfer more words every clock cycle Isn’t too much wiring Using interleaved memory organization Reduce access time with less wiring Double Date Rate (DDR) DRAMs Enhancing main memory performance (Cont) Flexible placement of blocks: Associativity Block Number 1111111111 2222222222 33 0123456789 0123456789 0123456789 01 Memory Set Number 0 1 2 3 01234567 Cache block 12 can be placed Fully Associative anywhere (2-way) Set Associative anywhere in set 0 (12 mod 4) Direct Mapped only into block 4 (12 mod 8) 10 Flexible placement of blocks: Associativity 11 A Two-way Set Associative Cache N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Valid Cache Tag : : Adr Tag Compare Cache Data Cache Index Cache Data Cache Block 0 Cache Block 0 : : Sel1 1 Mux 0 Sel0 OR Hit Cache Block Cache Tag Valid : : Compare And yet Another Extreme Example: Fully Associative Fully Associative Cache -- push the set associative idea to its limit! Forget about the Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 32 B blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache 31 4 Cache Tag (27 bits long) 0 Byte Select Ex: 0x01 Valid Bit Cache Data X Byte 31 X Byte 63 : : Cache Tag Byte 1 Byte 33 Byte 32 X X X : : Byte 0 : Replacement Policy In an associative cache, which block from a set should be evicted when the set becomes full? • Random • Least-Recently Used (LRU) • LRU cache state must be updated on every access • true implementation only feasible for small sets (2-way) • First-In, First-Out (FIFO) a.k.a. Round-Robin • used in highly associative caches • Not-Most-Recently Used (NMRU) • FIFO with exception for most-recently used block or blocks Replacement only happens on misses 14