CS 61C: Great Ideas in Computer Architecture Caches Instructor: David A. Patterson http://inst.eecs.Berkeley.edu/~cs61c/sp12 6/27/2016 Spring 2012 -- Lecture #11 1 New-School Machine Structures (It’s a bit more complicated!) Software • Parallel Requests Assigned to computer e.g., Search “Katz” Hardware Harness Smart Phone Warehouse Scale Computer • Parallel Threads Parallelism & Assigned to core e.g., Lookup, Ads Achieve High Performance Computer • Parallel Instructions >1 instruction @ one time e.g., 5 pipelined instructions • Parallel Data >1 data item @ one time e.g., Add of 4 pairs of words • Hardware descriptions All gates @ one time Core Memory Input/Output Instruction Unit(s) Core Functional Unit(s) A0+B0 A1+B1 A2+B2 A3+B3 Cache Memory Logic Gates • Programming Languages 6/27/2016 Today’s … Core Lecture (Cache) Spring 2012 -- Lecture #11 2 Review • Time (seconds/program) is measure of performance Instructions Clock cycles Seconds × × = Program Instruction Clock Cycle • Benchmarks stand in for real workloads to as standardized measure of relative performance • Power of increasing concern, and being added to benchmarks • Time measurement via clock cycles, machine specific • Profiling tools as way to see where spending time in your program • Don’t optimize prematurely! 6/27/2016 Spring 2012 -- Lecture #11 3 Agenda • • • • • Memory Hierarchy Analogy Memory Hierarchy Overview Administrivia Caches Fully Associative, N-Way Set Associative, Direct Mapped Caches • Cache Performance • Multilevel Caches 6/27/2016 Spring 2012 -- Lecture #11 4 Conventional Wisdom (CW) in Computer Architecture • Old CW: Power cheap, Transistors expensive • New CW: “Power wall” Power expensive, transistors cheap – Can put more on chip than can turn on • Old: Multiplies slow, Memory access fast • New: “Memory wall” Memory slow, multiplies fast – 200 clocks to memory, 4 clocks for FP multiply 5 Big Idea: Memory Hierarchy Processor Higher Levels in memory hierarchy Lower Level 1 Level 2 Increasing distance from processor, decreasing speed Level 3 ... Level n Size of memory at each level Student Roulette As we move to deeper levels the latency goes up and price per bit goes down. Why? 6/27/2016 Spring 2012 -- Lecture #11 6 Library Analogy • Writing a report based on books on reserve – E.g., works of J.D. Salinger • Go to library to get reserved book and place on desk in library • If need more, check them out and keep on desk – But don’t return earlier books since might need them • You hope this collection of ~10 books on desk enough to write report, despite 10 being only 0.00001% of books in UC Berkeley libraries 6/27/2016 Spring 2012 -- Lecture #11 7 Principle of Locality • Principle of Locality: Programs access small portion of address space at any instant of time • What program structures lead to locality in code? Student Roulette 6/27/2016 Spring 2012 -- Lecture #11 8 How does hardware exploit principle of locality? • Offer a hierarchy of memories where – closest to processor is fastest (and most expensive per bit so smallest) – furthest from processor is largest (and least expensive per bit so slowest) • Goal is to create illusion of memory almost as fast as fastest memory and almost as large as biggest memory of the hierarchy 6/27/2016 Spring 2012 -- Lecture #11 9 A Cache on-chip CPU Cache 32-bit data & 32-bit addr per cycle bus DRAM Memory 6/27/2016 • Processor requests 32-bit words • Cache controller checks address from CPU to see if requested word is in the cache • If not, go to memory and load into cache, kicking out some other word – “Bus” is name for wires connecting processor to memory • Speedup: cache typical takes 1 or 2 clock cycles, vs. 100-200 to DRAM Spring 2012 -- Lecture #12 10 Anatomy of a Cache Processor • Operations: 32-bit Address 1. Cache Hit 2. Cache Miss 3. Refill cache from memory • Cache needs Address Tag to decide if Processor Address is a Cache Hit or Cache Miss 6/27/2016 32-bit Data Tag CacheData Cache 32-bit Address Spring 2012 -- Lecture #11 32-bit Data Memory 11 Hardware Cost of Cache • Need 32-bit Tag for every 32 bits of data • Optimization: 1 Tag for 4 (or more) words Processor 32-bit Address – Group of words called a “cache block” – ¼ number tags • Also can make address tag 2-bits narrower since block 4X larger 6/27/2016 128-bit Data Tag Data Cache 32-bit Address Spring 2012 -- Lecture #11 32-bit Data Memory 1212 Big Idea: Locality • Temporal Locality (locality in time) – Go back to same book on desktop multiple times – If a memory location is referenced then it will tend to be referenced again soon • Spatial Locality (locality in space) – When go to book shelf, pick up multiple books on J.D. Salinger since library stores related books together – If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon 6/27/2016 Spring 2012 -- Lecture #11 13 Principle of Locality • Principle of Locality: Programs access small portion of address space at any instant of time • What program structures lead to temporal and spatial locality in code? • In data? Student Roulette 6/27/2016 Spring 2012 -- Lecture #11 14 Administrivia • Lab #6 posted • Hundreds of students using GitHub successfully – Will lose 2 points if GSIs need to regrade due to Git mistakes • Project #2, Part 2 Due Sunday @ 11:59:59 • No Homework this week! • Midterm in 2 weeks: – – – – – – – 6/27/2016 TA Review: Su, Mar 4, starting 2 PM, 2050 VLSB Exam: Tu, Mar 6, 6:40-9:40 PM, 2050 VLSB (room change) TA Review: Su, Mar 6, 2-5 PM, 2050 VLSB Covers everything through lecture Tue Feb 28 Closed book, can bring one sheet notes, both sides Copy of Green Card will be supplied No phones, calculators, …; just bring pencils & eraser Spring 2012 -- Lecture #11 15 Project 2, Part 1 Scores Avg: 10.7 pts with 244 submissions 6/27/2016 Spring 2012 -- Lecture #11 16 61C in the News Australian and American physicists have built a working transistor from a single phosphorus atom embedded in a silicon crystal. “It shows that Moore’s Law can be scaled toward atomic scales in silicon.” … Currently, the smallest dimension in state-of-the-art computers made by Intel is 22 nm— less than 100 atoms in diameter. 6/27/2016 Moore’s Law refers to technology improvements by the semiconductor industry that have doubled the number of transistors on a silicon chip roughly every 18 months for the past half-century. That has led to accelerating increases in performance and declining prices. “Physicists Create a Working Transistor From a Single Atom,” -- Lecture #11 By John Markoff, NewSpring York2012 Times, February 20, 2012 17 Agenda • • • • • Memory Hierarchy Analogy Memory Hierarchy Overview Administrivia Caches Fully Associative, N-Way Set Associative, Direct Mapped Caches • Cache Performance • Multilevel Caches 6/27/2016 Spring 2012 -- Lecture #11 18 Hardware Cost of Cache • Need to compare 32-bit every tag to the Address Processor address • Comparators are Tag Set 0 expensive • Optimization: 2 sets Tag Set 1 => ½ comparators • 1 Address bit selects 32-bit which set Address 6/27/2016 Spring 2012 -- Lecture #11 Processor 128-bit Data Data Data Cache 32-bit Data Memory 1919 Processor Address Fields used by Cache Controller • Block Offset: Byte address within block • Index: Selects which set • Tag: Remaining portion of processor address Tag Index Block offset • Size of Index = log2 (number of blocks) • Size of Tag = Address size – Size of Index – log2 (number of bytes/block) 4/12/11 Spring 2011 -- Lecture #22 20 What is limit to number of sets? • Can save more comparators if have more than 2 sets • Limit: As Many Sets as Cache Blocks • Called “Direct Mapped” Design Tag 6/27/2016 Index Block offset Spring 2012 -- Lecture #11 21 One More Detail: Valid Bit • When start a new program, cache does not have valid information for this program • Need an indicator whether this tag entry is valid for this program • Add a “valid bit” to the cache entry – 0 => cache miss, even if by chance address = tag – 1 => cache hit if processor address = tag 6/27/2016 Spring 2012 -- Lecture #11 22 Direct Mapped Cache Example • One word blocks, cache size = 1K words (or 4KB) 31 30 Hit Valid bit ensures something useful in cache for this index Compare Tag with upper part of Address to see if a Hit 6/27/2016 ... 13 12 11 Tag 20 Index Index Valid Tag ... Block offset 2 1 0 10 Data Data 0 1 2 . . . 1021 1022 1023 32 20 Read data from cache instead of memory if a Hit Comparator Student Roulette What kind of locality are we taking advantage of? Spring 2012 -- Lecture #11 23 Cache Terms • Hit rate: fraction of access that hit in the cache • Miss rate: 1 – Hit rate • Miss penalty: time to replace a block from lower level in memory hierarchy to cache • Hit time: time to access cache memory (including tag comparison) 6/27/2016 Spring 2012 -- Lecture #11 24 Mapping a 6-bit Memory Address 5 • • • • 4 3 2 1 0 Block Within $ Byte Offset Within Block Mem Block Within (e.g., Word) Index $ Block Tag Note: $ = Cache In example, block size is 4 bytes/1 word (it could be multi-word) Memory and cache blocks are the same size, unit of transfer between memory and cache # Memory blocks >> # Cache blocks – 16 Memory blocks/16 words/64 bytes/6 bits to address all bytes – 4 Cache blocks, 4 bytes (1 word) per block – 4 Memory blocks map to each cache block • Byte within block: low order two bits, ignore! (nothing smaller than a block) • Memory block to cache block, aka index: middle two bits • Which memory block is in a given cache block, aka tag: top two bits 6/27/2016 Spring 2012 -- Lecture #11 25 Caching: A Simple First Example Main Memory 0000xx Index Valid Tag Data One word blocks 0001xx Two low order bits 00 0010xx define the byte in the 0011xx 01 block (32b words) 0100xx 10 0101xx Q: Where in the cache is 11 0110xx the mem block? 0111xx 1000xx Use next 2 low order 1001xx memory address bits – Q: Is the mem block in cache? 1010xx the index – to determine 1011xx which cache block (i.e., Compare the cache tag to the 1100xx modulo the number of high order 2 memory address 1101xx blocks in the cache) bits to tell if the memory block 1110xx is in the cache 1111xx (provided as valid bit is a 1) (block address) modulo (# of blocks in the cache) Cache 6/27/2016 Spring 2012 -- Lecture #11 26 Caching: A Simple First Example Main Memory 0000xx One word blocks 0001xx Cache Two low order bits 0010xx Index Valid Tag Data define the byte in the 0011xx block (32b words) 00 0100xx 0101xx 01 0110xx 10 0111xx Q: Where in the cache is 11 1000xx the mem block? 1001xx Q: Is the mem block in cache? 1010xx Use next 2 low order 1011xx memory address bits – Compare the cache tag to the 1100xx the index – to determine high order 2 memory address 1101xx which cache block (i.e., bits to tell if the memory 1110xx modulo the number of block is in the cache 1111xx blocks in the cache) (provided Valid bit is 1) (block address) modulo (# of blocks in the cache) 6/27/2016 Spring 2012 -- Lecture #11 27 Multiword Block Direct Mapped Cache • Four words/block, cache size = 1K words 31 30 . . . Hit Tag Index Valid 13 12 11 . . . 4 3 2 1 0 20 Index Byte offset Data Block offset 8 Data Tag 0 1 2 . . . 253 254 255 20 32 What kind of locality are we taking advantage of? 6/27/2016 Spring 2012 -- Lecture #11 Student Roulette 28 Cache Names for Each Organization • “Fully Associative”: Block can go anywhere – First design in lecture – Note: No Index field, but 1 comparator/block • “Direct Mapped”: Block goes one place – Note: Only 1 comparator – Number of sets = number blocks • “N-way Set Associative”: N places for a block – Number of sets = number of blocks / N – Fully Associative: N = number of blocks – Direct Mapped: N = 1 6/27/2016 Spring 2012 -- Lecture #11 29 Range of Set-Associative Caches • For a fixed-size cache, each increase by a factor of 2 in associativity doubles the number of blocks per set (i.e., the number of “ways”) and halves the number of sets – • decreases the size of the index by 1 bit and increases the size of the tag by 1 bit More Associativity (more ways) Tag 4/12/11 Index Block offset Spring 2011 -- Lecture #22 30 For S sets, N ways, B blocks, which statements hold? A) The cache has B tags B) Size of Index = Log2(B) C) B = N x S D) The cache needs N comparators ☐ A only ☐ A and B only ☐ A, B, and C only ☐ 31 Measuring Cache Performance • Assuming cache hit costs are included as part of the normal CPU execution cycle, then CPU time = IC × CPI × CC = IC × (CPIideal + Memory-stall cycles) × CC CPIstall • A simple model for Memory-stall cycles Memory-stall cycles = accesses/program × miss rate × miss penalty • 6/27/2016 Will talk about writes and write misses next lecture, where its a little more complicated Spring 2012 -- Lecture #11 32 Average Memory Access Time (AMAT) • Average Memory Access Time (AMAT) is the average to access memory considering both hits and misses in the cache AMAT = Time for a hit + Miss rate x Miss penalty 6/27/2016 Spring 2012 -- Lecture #11 33 Average Memory Access Time (AMAT) is the average to access memory considering both hits and misses AMAT = Time for a hit + Miss rate x Miss penalty Given a 200 psec clock, a miss penalty of 50 clock cycles, a miss rate of 0.02 misses per instruction and a cache hit time of 1 clock cycle, what is AMAT? ☐ ≤200 psec ☐ 400 psec ☐ 600 psec ☐ 34 Impacts of Cache Performance • Relative $ penalty increases as processor performance improves (faster clock rate and/or lower CPI) – Memory speed unlikely to improve as fast as processor cycle time. When calculating CPIstall, cache miss penalty is measured in processor clock cycles needed to handle a miss – Lower the CPIideal, more pronounced impact of stalls • Processor with a CPIideal of 2, a 100 cycle miss penalty, 36% load/store instr’s, and 2% I$ and 4% D$ miss rates – Memory-stall cycles = 2% × 100 + 36% × 4% × 100 = 3.44 – So CPIstalls = 2 + 3.44 = 5.44 Student Roulette – More than twice the CPIideal ! • What if the CPIideal is reduced to 1? • What if the D$ miss rate went up by 1%? 6/27/2016 Spring 2012 -- Lecture #11 36 How Reduce Miss Penalty? • Could there locality on the misses from a cache? • Use multiple cache levels! • With Moore’s Law, have more room on die for bigger L1 caches and for second level (L2)cache • And in some cases even an L3 cache! 6/27/2016 Spring 2012 -- Lecture #11 38 Typical Memory Hierarchy On-Chip Components Control Size (bytes): Cost/bit: Instr Data Cache Cache Speed (cycles): RegFile Datapath Second Level Cache (SRAM) Main Memory (DRAM) ½’s 1’s 10’s 100’s 100’s 10K’s M’s G’s highest Secondary Memory (Disk Or Flash) 1,000,000’s T’s lowest • Principle of locality + memory hierarchy presents programmer with ≈ as much memory as is available in the cheapest technology at the ≈ speed offered by the fastest technology 6/27/2016 Spring 2012 -- Lecture #11 39 Memory Hierarchy Technologies • Caches use SRAM (Static RAM) for speed and technology compatibility – Fast (typical access times of 0.5 to 2.5 ns) – Low density (6 transistor cells), higher power, expensive ($2000 to $4000 per GB in today) – Static: content will last as long as power is on • Main memory uses DRAM (Dynamic RAM) for size (density) – Slower (typical access times of 50 to 70 ns) – High density (1 transistor cells), lower power, cheaper ($20 to $40 per GB in today) – Dynamic: needs to be “refreshed” regularly (~ every 8 ms) 6/27/2016 • Consumes 1% to 2% of the active cycles of the DRAM Spring 2012 -- Lecture #11 40 For L1 cache AMAT = Time for a hit + Miss rate x Miss penalty What is AMAT for L2 cache? ☐ Time for L2 hit + L2 Miss rate x L2 Miss penalty ☐ Time for L1 hit + L1 Miss rate x L2 Miss rate x Miss penalty Time for L1 hit + L1 Miss rate x (Time for L2 hit + L2 Miss rate x Miss Penalty) ☐ ☐ 41 Local vs. Global Miss Rates • Local miss rate – the fraction of references to one level of a cache that miss • Local Miss rate L2$ = $L2 Misses / L1$ Misses • Global miss rate – the fraction of references that miss in all levels of a multilevel cache • L2$ local miss rate >> than the global miss rate • Global Miss rate = L2$ Misses / Total Accesses = L2$ Misses / L1$ Misses x L1$ Misses / Total Accesses = Local Miss rate L2$ x Local Miss rate L1$ • AMAT = Time for a hit + Miss rate x Miss penalty • AMAT = Time for a L1$ hit + (local) Miss rateL1$ x (Time for a L2$ hit + (local) Miss rate L2$ x L2$ Miss penalty) 6/27/2016 Spring 2012 -- Lecture #12 42 Reducing Cache Miss Rates • E.g., CPIideal of 2, 100 cycle miss penalty (to main memory), 25 cycle miss penalty (to L2$), 36% load/stores, a 2% (4%) L1 I$ (D$) miss rate, add a 0.5% L2$ miss rate – CPIstalls = 2 + .02×25 + .36×.04×25 + .005×100 + .36×.005×100 = 3.54 (vs. 5.44 with no L2$) 6/27/2016 Spring 2012 -- Lecture #12 43 Multilevel Cache Design Considerations • Different design considerations for L1$ and L2$ – L1$ focuses on minimizing hit time for shorter clock cycle: Smaller $ with smaller block sizes – L2$(s) focus on reducing miss rate to reduce penalty of long main memory access times: Larger $ with larger block sizes • Miss penalty of L1$ is significantly reduced by presence of L2$, so can be smaller/faster but with higher miss rate • For the L2$, hit time is less important than miss rate – L2$ hit time determines L1$’s miss penalty 6/27/2016 Spring 2012 -- Lecture #12 44 Review so far • Principle of Locality for Libraries /Computer Memory • Hierarchy of Memories (speed/size/cost per bit) to Exploit Locality • Cache – copy of data lower level in memory hierarchy • Direct Mapped to find block in cache using Tag field and Valid bit for Hit • Larger caches reduce Miss rate via Temporal and Spatial Locality, but can increase Hit time • Multilevel caches help Miss penalty • AMAT helps balance Hit time, Miss rate, Miss penalty 6/27/2016 Spring 2012 -- Lecture #11 45