CS 152 Computer Architecture and Engineering Lecture 11 -- Cache II 2014-2-25 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152 L11: Cache II UC Regents Spring 2014 © UCB Today: Caches, Part Two ... Locality: Why caches work. Cache misses and performance: How do we size the cache? Short Break Practical cache design: A state machine and a controller. Write buffers and caches. Victim caches and pre-fetch buffers. CS 152 L11: Cache II Also: IBM mainframe cache UC Regents Spring 2014 © UCB Why We Use Caches CS 152 L11: Cache II UC Regents Spring 2014 © UCB Programs with locality cache well ... Memory Address (one dot per access) Bad Temporal Locality Spatial Locality Time Q. Point out bad locality behavior ... CS 152 L11: Cache II Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems UC Regents Spring 2014 © UCB Journal 10(3): 168-192 (1971) The caching algorithm in one slide Temporal locality: Keep most recently accessed data closer to processor. Spatial locality: Move contiguous blocks in the address space to upper levels. CS 152 L11: Cache II UC Regents Spring 2014 © UCB Caching terminology Hit: Data appears in upper level block (ex: Blk X) Hit Rate: The fraction of memory accesses found in upper level. Miss: Data retrieval from lower level needed (ex: Blk Y) Miss Rate: 1 - Hit Rate CS 152 L11: Cache II Hit Time << Miss Penalty Hit Time: Time to access upper level. Includes hit/miss check. Miss penalty: Time to replace block in upper level + deliver to CPU UC Regents Spring 2014 © UCB Cache Design Example CS 152 L11: Cache II UC Regents Spring 2014 © UCB CPU address space: An array of “blocks” Block # 32-bit Memory Address 31 0 32-byte blocks 0 1 Which block? 27 bits Byte # 5 bits 2 3 4 5 The job of a cache is to hold a “popular” subset of blocks. 6 7 . . . 27 2 -1 CS 152 L11: Cache II UC Regents Spring 2014 © UCB One Approach: Fully Associative Cache 31 Ideal, but expensive ... Cache Tag (27 bits) 26 5 4 0 Byte Select Cache Data Holds 4 blocks Block # (”Tags”) 0 Ex: 0x04 = Byte 31 ... Byte 1 Byte 0 = Byte 31 ... Byte 1 Byte 0 = = Hit CS 152 L11: Cache II Valid Bit Return byte(s) of “hit” cache line UC Regents Spring 2014 © UCB Building a cache with one comparator Block # Blocks of a certain color may only appear in one line of the cache. 32-byte blocks 0 1 2 32-bit Memory Address 31 7 6 5 4 3 0 4 Which block? 25 bits Color Byte # 2 bits 5 bits 5 6 7 Cache index . . . 27 2 -1 CS 152 L11: Cache II UC Regents Spring 2014 © UCB Example: A Direct Mapped Cache 31 7 Cache Tag (25 bits) 6 5 Index Ex: 0x01 24 Cache Tags 4 0 Byte Select Ex: 0x00 Cache Data 0 = Byte 31 ... Byte 1 Byte 0 Byte 31 ... Byte 1 Byte 0 Hit PowerPC 970: 64K direct-mapped Level-1 ICS 152 L11: Cache II cache Valid Bit Return byte(s) of a “hit” cache line UC Regents Spring 2014 © UCB Memory Address (one dot per access) The limits of direct-mapped caches ... What if both regions have same block color? Time Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971) CS 152 L11: Cache II UC Regents Spring 2014 © UCB Hybrid Design: Set Associative Cache “N-way” set associative -- N is number of blocks for each color Index Cache Tag (26 bits) (2 bits) Byte Select (4 bits) Ex: 0x01 Cache Data Valid Cache Tags Cache Tags Valid Cache Block Cache Block = = Cache Block 16 bytes Cache Data Cache Block Hit Left Hit Right 16 bytes Return bytes of “hit” set member Cache block halved to keep # of cached bits CS 152 L11: Cache II PowerPC 970: 32K 2way UC Regents Spring 2014 © UCB Memory Address (one dot per access) The benefits of set-associativity ... What if both regions have same block color? Time Q. What costs (over direct mapped) for this CSbenefit? 152 L11: Cache II Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971) UC Regents Spring 2014 © UCB Recall: Branch predictor (direct-mapped) Address of BNEZ instruction 0b011[..]010[..]100 18 bits 12 bits Branch Target Buffer (BTB) 18-bit address tag 0b011[...]01 = Hit CS 152 L11: Cache II target address PC + 4 + Loop “Taken” Address 4096 BTB/BHT entries BNEZ R1 Loop Branch History Table (BHT) With 4096 colors, odds are low 2 active branches have the same “Taken” or “Not Taken” color. If branches “clash”, they take turns kicking each other UC Regents Spring 2014 © UCB Key ideas about caches ... Program locality is why building a memory hierarchy makes sense Latency toolkit: hierarchy design, bit-wise parallelism, pipelining. In practice: how many rows, how many columns, how many arrays. Cache operation: compare tags, detect hits, select bytes. CS 152 L11: Cache II UC Regents Spring 2014 © UCB Cache Misses & Performance CS 152 L11: Cache II UC Regents Spring 2014 © UCB Recall: Caching terminology Hit: Data appears in upper level block (ex: Blk X) Hit Rate: The fraction of memory accesses found in upper level. Miss: Data retrieval from lower level needed (ex: Blk Y) Miss Rate: 1 - Hit Rate CS 152 L11: Cache II Hit Time << Miss Penalty Hit Time: Time to access upper level. Includes hit/miss check. Miss penalty: Time to replace block in upper level + deliver to CPU UC Regents Spring 2014 © UCB Recall: The Performance Equation What factors make different programs have different CPIs? Cache behavior varies. Instruction mix varies. Branch prediction varies. Seconds Program = Instructions Cycles Seconds Program Instruction Cycle We need all three terms, and only these terms, to compute CPU Time! CS 152 L11: Cache II “CPI” -- The Average Number of Clock Cycles Per Instruction For the Program UC Regents Spring 2014 © UCB Recall: CPI as a tool to guide design Machine CPI (throughput, not latency) Program Instruction Mix 5 x 30 + 1 x 20 + 2 x 20 + 2 x 10 + 2 x 20 100 = 2.7 cycles/instruction CS 152 L11: Cache II Where program spends its time UC Regents Spring 2014 © UCB AMAT: Average Memory Access Time Seconds Program = Instructions Program Cycles Instruction Seconds Cycle Last slide computed it ... Machine CPI Last slide assumes constant memory access time. True CPI depends on the Average Memory Access Time (AMAT) for Inst & Data AMAT = Hit Time + (Miss Rate x Miss Penalty) Goal: Reduce AMAT True CPI = Ideal CPI + See Appendix B.2 of Memory Stall Cycles. CA-AQA for details. CS 152 L11: Cache II Beware! Improving one term may hurt other terms, and increase AMAT! UC Regents Spring 2014 © UCB One type of cache miss: Conflict Miss N blocks of same color in use at once, but cache can only hold M < N of them Miss Rate Solution: Increase M (Associativity) Miss rate improvement equivalent to doubling cache size. fully-associative Cache Size (KB) Other Solutions Increase number of cache lines (# blocks in cache) Q. Why does this help? A. Reduce odds of a conflict. Add a small “victim cache” that holds blocks recently removed from the cache. More victim cache soon ... If hit time increases, AMAT may go up! AMAT = Hit Time + (Miss Rate x Miss Penalty) CS 152 L11: Cache II UC Regents Spring 2014 © UCB Other causes of cache misses ... Capacity Misses Compulsory Misses Cache cannot contain all blocks accessed by the program First access of a block by a program Mostly unavoidable Solution: Increase size of the cache Solution: Prefetch blocks (via hardware, software) Miss rates (relative) Miss rates (absolute) Cache Size (KB) Cache Size (KB) Also “Coherency Misses”: other processes update memory CS 152 L11: Cache II UC Regents Spring 2014 © UCB Thinking about cache miss types ... What kind of misses happen in a fully associative cache of infinite size? A. Compulsory misses. Must bring each block into cache. In addition, what kind of misses happen in a finite-sized fully associative cache? A. Capacity misses. Program may use more blocks than can fit in cache. In addition, what kind of misses happen in a set-associative or direct-map cache? A. Conflict misses. (all questions assume the replacement policy used is considered “optimal”) CS 152 L11: Cache II UC Regents Spring 2014 © UCB Separate instruction and data caches? Compare 2k separate I & D to 2k+1 arrows mark unified ... crossover. Misses per 1000 instructions Figure B.6 from CA-AQA. Data for a 2-way set associative cache with 64-byte blocks for DEC Alpha. Note: The extraordinarily effectiveness of large instruction caches ... CS 152 L11: Cache II UC Regents Spring 2014 © UCB Break Play: CS 152 L11: Cache II UC Regents Spring 2014 © UCB Practical Cache Design CS 152 L11: Cache II UC Regents Spring 2014 © UCB time machine back to FPGA-oriented 2006 CS 152 ... Cache Design: Datapath + Control Datapath for performance, control for correctness. Most design errors come from incorrect specification of state machine behavior! State Machine To CPU Control Control Control Addr To CPU Din Dout Addr Blocks Tags Din Dout To Lower Level Memory To Lower Level Memory Red text will highlight state machine requirements CS 152 L11: Cache II ... UC Regents Spring 2014 © UCB Recall: State Machine Design ... Rst == 1 RYG 100 Change == 1 Change == 1 RYG 001 Change == 1 RYG 010 Cache controller state machines like this, but more states, and perhaps several connected machines ... CS 152 L11: Cache II UC Regents Spring 2014 © UCB Issue #1: Control for CPU interface .... For reads, your state machine must: Large, slow Small, fast (1) sense REQ (2) latch Addr (3) create Wait (4) put Data Out on the bus. From CPU To CPU An example interface ... there are other possibilities. CS 152 L11: Cache II UC Regents Spring 2014 © UCB Issue #2: Cache Block Replacement After a cache read miss, if there are no empty cache blocks, which block should be removed from the cache? The Least Recently Used A randomly chosen block? (LRU) block? Appealing, Easy to implement, how but hard to implement. well does it work? Miss Rate for 2-way Set Associative Cache Size Random LRU 16 KB 5.7% 2.0% 1.17% 5.2% 1.9% 1.15% 64 KB 256 KB Part of your state machine decides which block to CS 152 L11: Cache II replace. Also, try other LRU approx. UC Regents Spring 2014 © UCB Issue #3: High performance block fetch 1 12-bit row address input of 40 96 de co de r With proper memory layout, one row access delivers entire cache block to the sense challenges: amp. Two state machine (1) Bring in the word requested by CPU with lowest latency (2) Bring in rest of cache block ASAP 2048 columns Each 4096 rows column 33,554,432 usable bits 4 bits (tester found good bits in bigger array) deep 8196 bits delivered by sense amps Select requested bits, send off the CS 152 L11: Cache II UC Regents Spring 2014 © UCB Issue #3 (continued): DRAM Burst Reads One request ... DRAM can be set up to request an N byte region starting at an arbitrary N+k within region Many returns ... State machine challenges: (1) setting up correct block read mode (2) delivering correct word direct to CPU (3) putting all words in cache in right place. CS 152 L11: Cache II UC Regents Spring 2014 © UCB Writes and Caches CS 152 L11: Cache II UC Regents Spring 2014 © UCB Issue #4: When to write to lower level ... Write-Through Write-Back Policy Data written to cache block also written to lower-level memory Write data only to the cache Update lower level when a block falls out of the cache Do read misses produce writes? No Yes Do repeated writes make it to lower level? Yes No Related issue: Do writes to blocks not in the cache get put in the cache (”writeallocate”) or not? State machine design (1) Write-back puts most write logic in cache-miss machine. (2) Write-through isolates writing in its own state machine. CS 152 L11: Cache II UC Regents Spring 2014 © UCB Issue #5: Write-back DRAM Burst Writes One command ... Many bytes written State machine challenges: (1) putting cache block into correct location (2) what if a read or write wants to use DRAM before the burst is complete? Must stall ... CS 152 L11: Cache II UC Regents Spring 2014 © UCB If we choose write-through ... Write-Through Policy Data written to cache block also written to lower-level memory Do read misses produce writes? No Do repeated writes make it to lower level? Yes State machine design issue: handling writes without stalling the machine until the written word is safely in the lower level (DRAM) CS 152 L11: Cache II UC Regents Spring 2014 © UCB Issue #6: Avoid write-through write stalls Solution: add a “write buffer” to cache datapath Processor Cache Lower Level Memory Write Buffer Holds data awaiting write-through to lower level memory Q. Why a write buffer ? A. So CPU doesn’t stall Q. Why a buffer, why A. Bursts of writes are not just one register ? common. Q. Are Read After Write A. Yes! Drain buffer (RAW) hazards an issue before next read, or for write buffer? check write buffers. On reads, state machine checks cache and write buffer -what if word was removed from cache before lower-level write? On writes, state machine stalls for full write buffer, CS 152 L11: Cache II UC Regents Spring 2014 © UCB Write buffer logic for a LW instruction Processor Cache Lower Level Memory Write Buffer Cache state machines must be designed so that this algorithm always yields correct memory semantics. CS 152 L11: Cache II UC Regents Spring 2014 © UCB Write buffer logic for a SW instruction Processor Cache Lower Level Memory Write Buffer LW + SW require complex state machine logic ... plus, state machine needs to manage two buses ... plus the write buffer FIFO logic ... CS 152 L11: Cache II UC Regents Spring 2014 © UCB Issue #7: Optimizing the hit time ... Hit time is directly tied to clock rate of CPU. If left unchecked, it increases when cache size and associativity increases. Note that XScale pipelines both instruction and data caches, adding stages to the CPU pipeline. State machine design issue: pipelining cache control! CS 152 L11: Cache II UC Regents Spring 2014 © UCB Common bug: When to write to cache? A1. If no write-allocate ... when address is already in the cache. Issue: Must check tag before writing, or else may overwrite the wrong address! Options: Stall and do tag check, or pipeline check. A2. Always: we do allocate on write. CS 152 L11: Cache II UC Regents Spring 2014 © UCB Key ideas about caches ... The cache design spectrum: from direct mapped to fully associative. AMAT (Ave. Memory Access Time) = Hit Time + (Miss Rate x Miss Penalty) Cache misses: conflict, capacity, compulsory, and coherency. Cache design bugs are usually from cache specification errors. CS 152 L11: Cache II UC Regents Spring 2014 © UCB Influential memory systems paper by Norm Jouppi (then at DEC, now at Google) 1990 We look at it in depth ... The baseline memory system ... that enhancements are trying to improve. In 1990, one thousand 16Mb DRAM chips! Small software benchmark suite: How memory system falls short: Green line shows how a system with perfect caching performs. Actual performance First idea: Miss cache L1 Cache “Miss Cache” Miss Cache: Small. Fully-associative. LRU replacement. Checked in parallel with the L1 cache. If L1 and miss cache both miss, the data block returned by the next-lower cache is placed in L1 and miss cache. Second idea: Victim cache L1 Cache “Victim Cache” Victim Cache: Small. Fully-associative. LRU replacement. Checked in parallel with the L1 cache. If L1 and miss cache both miss, the data block removed from the L1 cache is placed into the victim cache. % of conflict misses removed Plotted vs number of {miss, victim} cache entries Miss Cache Victim Cache Each symbol a benchmark.{Solid, dashed} line is L1 {I, D}. Third idea: Streaming prefetch buffer (FIFO) L1 Cache Prefetch FIFO Prefetch buffer: Small FIFO of cache lines and tags. Check head of FIFO in parallel with L1 cache access. If both miss, fetch missed block “k” for L1. Clear FIFO. Prefetch blocks “k + 1”, “k +2”, ... and place in FIFO tail. Fourth idea: Multi-way streaming prefetch buffer L1 Cache FIFO FIFO FIFO FIFO Multi-way buffer: 4-FIFO version of the original design. Allows block streams for 4 misses to proceed in parallel. If an access misses L1 and all FIFOs, clear LRU FIFO. % of all misses removed Plotted vs number of prefetches that follow a miss. Single-Way Buffer Multi-Way Buffer Each symbol a benchmark.{Solid, dashed} line is L1 {I, D}. Purple line: Performance of enhanced system Complete memory system Baseline Green line: Shows how a system with perfect caching performs. 4-way data victim cache + 4-way data stream buffer Also: Instruction stream buffer. Original Captive Model IBM owns fabs, and designs CPU chips, for use in its server products. IBM’s most important server line is the z-Series, compatible with the 360 mainframe architecture introduced in 1962. CS 250 L1: Fab/Design Interface UC Regents Fall 2013 © UCB Who uses IBM mainframes? You do! Imagine a TeleBears migration to a new platform ... that goes wrong ... and wipes out a semester! IBM Mainframe 32nm process 2.75 billion transistors 6 cores, but ... Most of the die is dedicated to the memory system Most of power is spent in the 5.5 Ghz cores ... and chips today are power limited ... How to use 1.35 billion new transistors? Photos shown to scale. 50% more cores + memory system scale-up ... use faster transistors to meet power budget ... 45 nm chip (2012) 5.2 GHz 1.4B Transistors 32 nm chip (2014) 5.5 GHz 2.75B Transistors L2 operand hit time halved How? Split instruction and data L2s simplified design. Also: New L2-CPU floorplan + logic optimizations. Branch predictor array sizes rival L1s ... On Thursday Virtual memory ... Have fun in section !