Lecture 12 Storage Hierarchy Storage Hierarchy CS510 Computer Architecture Lecture 12 - 1 Who Cares about Memory Hierarchy? • Processor Only Thus Far in Course – CPU cost/performance, ISA, Pipelined Execution 1000 CPU 100 55%/year CPU-DRAM Gap 10 35%/year 7%/year DRAM 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 1990 1989 1988 1987 1986 1985 1984 1983 1982 1981 1980 1 • 1980: no cache in mproc; • 1995 2-level cache, 60% transistors on Alpha 21164 mproc Storage Hierarchy CS510 Computer Architecture Lecture 12 - 2 General Principles • Locality – Temporal Locality: referenced again soon – Spatial Locality: nearby items referenced soon • Locality + smaller HW is faster = memory hierarchy – Levels: smaller, faster, more expensive/byte than the level below – Inclusive: data found in top also found in the bottom • Definitions – – – – Upper is closer to processor Block: minimum unit that present or not in the upper level Address = Block frame address + block offset address Hit time: time to access the upper level, including hit determination Storage Hierarchy CS510 Computer Architecture Lecture 12 - 3 Measures • Hit rate: fraction found in that level – So high that usually talk about Miss Rate or Fault Rate – Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory • Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks)) • Miss penalty:: time to replace a block from the lower level, including to replace in CPU – access time: time to access the lower level =(lower level latency) – transfer time: time to transfer block =(BW upper & lower, block size) Storage Hierarchy CS510 Computer Architecture Lecture 12 - 4 Block Size vs. Measures Increasing Block Size generally increases Miss Penalty and decreases Miss Rate Miss Penalty Transfer Time Miss Rate = Average Memory Access Time Access Time Block Size Block Size Miss Penalty x Miss Rate Storage Hierarchy = Block Size Avg. Memory Access Time(AMAT) CS510 Computer Architecture Lecture 12 - 5 Implications for CPU • Fast hit check since every memory access needs it – Hit is the common case • Unpredictable memory access time – 10s of clock cycles: wait – 1000s of clock cycles: • Interrupt & switch & do something else • New style: multithreaded execution • How to handle miss (10s => HW, 1000s => SW)? Storage Hierarchy CS510 Computer Architecture Lecture 12 - 6 4 Questions for Memory Hierarchy Designers • Q1: Where can a block be placed in the upper level? (Block placement) • Q2: How is a block found if it is in the upper level? (Block identification) • Q3: Which block should be replaced on a miss? (Block replacement) • Q4: What happens on a write? (Write strategy) Storage Hierarchy CS510 Computer Architecture Lecture 12 - 7 Q1: Block Placement: Where can a Block be Placed in the Upper Level? Block 12 placed in 8 block cache Fully Associative: Block 12 can go anywhere Block – Fully Associative(FA), Number 0 Direct Mapped, 2-way Set Associative(SA) – SA Mapping ; (Block #) Modulo(# of Sets) 1 234 567 Direct mapped: Block 12 can go only into Block 4 (12 Mod 8)= 4 0 123 4 56 7 Set Associative: Block 12 can go anywhere in Set 0 (12 Mod 4)= 0 0 12 34 56 7 Set Set Set Set 0 1 2 3 Block Frame Address Block 1 11 11 1 Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 Memory ... Storage Hierarchy CS510 Computer Architecture Lecture 12 - 8 Q2: Block Identification: How to Find a Block in the Upper Level? • Tag on each block – No need to check index or block offset • Increasing associativity shrinks index, expands tag Block Address Tag Index Block Offset FAM: No index DM: Large index Storage Hierarchy CS510 Computer Architecture Lecture 12 - 9 Q3: Block Replacement: Which Block Should be Replaced on a Miss? • Easy for Direct Mapped • SAM or FAM: – Random – LRU Miss Rates Associativity: 2-way Cache Size LRU 16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% Storage Hierarchy 4-way Random LRU 8-way Random CS510 Computer Architecture LRU Random Lecture 12 - 10 Q4: Write Strategy: What Happens on a Write? • DLX : store 9%, load 26% in integer programs – STORE: • 9%/(100%+26%+9%) 7% of the overall memory traffic • 9%/(26%+9%) 25% of the data cache traffic – READ access is majority, thus to make the common case fast: optimizing caches for reads – High performance designs cannot neglect the speed of WRITEs Storage Hierarchy CS510 Computer Architecture Lecture 12 - 11 Q4: What Happens on a Write? • Write Through: The information is written to both the block in the cache and to the block in the lower-level memory. • Write Back: The information is written only to the block in the cache. The modified cache block(Dirty Block) is written to main memory only when it is replaced. – is block clean or dirty? • Pros and Cons of each: – WT: read misses cannot result in writes (because of replacements) – WB: no writes of repeated writes • WT needs to be combined with write buffers so that don’t wait for lower level memory Storage Hierarchy CS510 Computer Architecture Lecture 12 - 12 Q4: What Happens on a Write? • Write Miss – Write Allocate (fetch on write) – No-Write Allocate (write around) • WB caches generally use Write Allocate, while WT caches often use No-Write Allocate Storage Hierarchy CS510 Computer Architecture Lecture 12 - 13 Summary • CPU-Memory gap is major performance obstacle for performance, HW and SW • Take advantage of program behavior: locality • Time of program still only reliable performance measure • 4Qs of memory hierarchy – – – – Block Placement Block Identification Block Replacement Write Strategy Storage Hierarchy CS510 Computer Architecture Lecture 12 - 14