CS5100 Advanced Computer Architecture Memory Hierarchy Design Prof. Chung-Ta King Department of Computer Science National Tsing Hua University, Taiwan (Slides are from textbook, Prof. Hsien-Hsin Lee, Prof. Yasun Hsu) National Tsing Hua University About This Lecture • Goal: To review the basics of memory hierarchy design and basic cache optimization techniques • Outline: Memory hierarchy design (Sec. 2.1, Appendix B.1) • Motivation, basic concepts Basic cache organization and performance (Appendix B.2) Basic cache optimizations (Appendix B.3) 1 National Tsing Hua University Memory Performance Gap 2 National Tsing Hua University Motivation for Memory Hierarchy • Programmers want unlimited amounts of memory with low latency But fast memory more expensive than slower memory • Solution: small fast memory + big slow memory = Looks like a big fast memory Big Fast MC MM Small & Fast Big & Slow 3 National Tsing Hua University Basic Idea of Memory Hierarchy • Entire addressable memory space available in largest, slowest memory • Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor 4 National Tsing Hua University Model of Memory Hierarchy SRAM Reg File L1 Data cache DISK DRAM L2 Cache Main Memory L1 Inst cache eDRAM or Emerging memory SRAM Reg File L1 Data cache L2 Cache L1 Inst cache L3 Cache L4 Cache DRAM DISK Main Memory 5 National Tsing Hua University Model of Memory Hierarchy Cycle CPU Chip Words/cycle Management Registers 1 3-10 Level 1 Cache 1-3 1-2 Level 2 Cache 5-10 1 30-100 0.5 Chips Mechanic DRAM Disk 106-107 0.01 6 National Tsing Hua University Memory Hierarchy Design • Memory hierarchy design becomes more crucial with recent multi-core processors: • Aggregate peak bandwidth grows with # cores: Intel Core i7 generates two references per core per clock Four cores and 3.2 GHz clock: 25.6 billion 64-bit data references/second + 12.8 billion 128-bit instruction references = 409.6 GB/s! DRAM bandwidth is only 6% of this (25 GB/s) Requires: • Multi-port, pipelined caches • Two levels of cache per core • Shared third-level cache on chip 7 National Tsing Hua University Performance and Power • High-end mPs have >10 MB on-chip cache Consumes large amount of area and power budget Intel Core i7 L1 D$ L1 I$ L2 Shared L3 8 National Tsing Hua University Underlying Principles • Locality of reference: Program access a relatively small portion of the address space at any instant of time a program property Temporal locality Spatial locality Stack Code P Array P t Location 9 National Tsing Hua University But, Program Behavior Matters • Locality depends on type of program • Some programs ‘behave’ well Small loop operating on data on stack (towers of hanoi) • Some programs don’t 10 National Tsing Hua University Memory Hierarchy Basics • Hit: data appears in some block (page) in upper level Hit rate: % of memory access found in upper level Hit time: time to access upper level • Miss: data not in upper level Miss rate = 1 - (hit rate) Miss penalty: time to replace a block in upper level + time to deliver the block to the processor • Hit Time << Miss Penalty From Processor Uppermiss Level Memory Blk X Lower Level Memory Blk Y To Processor 11 National Tsing Hua University Memory Hierarchy Performance • Average memory access time between levels = Hit time + Miss rate x Miss penalty • Miss penalty: time to fetch a block from lower memory level Access time: time to lower level; function of latency Transfer time: function of bandwidth between levels • Transfer one “cache line/block” at a time • Transfer at the size of the memory-bus width 12 National Tsing Hua University 4 Questions for Memory Hierarchy • Block placement: where in upper level? fully associative, direct mapped, set-associative • Block identification: find block in upper level? search and match address tag: block address (= tag + index) + block offset valid bit: tag-index boundary and associativity • Block replacement: which to replace on miss? easy for direct map.; random/LRU for associative • Write strategy: what happens on a write? Write through and write back Write allocate and not allocate 13 National Tsing Hua University Outline • Memory hierarchy design (Sec. 2.1, Appendix B.1) Motivation, basic concepts • Basic cache organization and performance (Appendix B.2) • Basic cache optimizations (Appendix B.3) 14 National Tsing Hua University Cache on CPU Performance • CPU time = (CPU cycle + memory stall cycles) x clock cycle time • Memory stall cycles = memory access x miss rate x miss penalty Assumptions: memory stalls are due to cache misses, hit cycles included in CPU execution cycles, reads and writes have same miss rate and penalty • CPU time = instruction count x (CPIideal + memory access/instruction x miss rate x miss penalty) x clock cycle time • Cache design becomes more important for CPUs with lower CPI and higher clock rates 15 National Tsing Hua University Cache Basics • When a word is not found in the cache, a miss occurs: Fetch word from lower level in hierarchy, requiring a higher latency reference Lower level may be another cache or the main memory Also fetch the other words contained within the block Place block into cache in any location within its set, determined by address • Block address MOD number of sets 16 National Tsing Hua University Cache Organization 000… 100.. 111 00100 01100 10100 11100 17 National Tsing Hua University Example: 1KB DM Cache, 32-byte Blocks • The lowest M bits are the offset (block size = 2M= 32) • Index = log2 (# of sets) Address 31 9 Tag Index Set-associativity will increase or reduce the index bits? Cache Tag Ex: 0x01 Ex: 0x00 Cache Data Byte 31 Byte 63 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3 : : : Byte 1023 : # of set 0 Offset : : Valid Bit 4 Byte 992 31 18 National Tsing Hua University Example: Set-Associative Cache N elements per set S Sets = = = = Tag Set Address Hit Data • Share costly comparators across sets; MUX delay 19 National Tsing Hua University Comparison of Cache Organization 0x1234 • Cache miss as a metric: (causes of cache misses) Compulsory: • First access to a block Processor Cache 0x1234 Capacity: 0x567 80x91B 01x111 • Block discarded due to limited cache size and later retrieved Conflict: Processor • Block discarded due to conflict in set and later retrieved 0x1234 0x567 08x91B 1 Coherence: • In multicore systems (next semester) Cache Processor Cache 20 National Tsing Hua University Comparison of Cache Organization • Which cache organization can reduce compulsory misses? • Which cache organization can reduce capacity misses? • Which cache organization can reduce conflict misses? • Reduce cache misses = improve cache performance? 21 National Tsing Hua University Comparison of Cache Organizations Comparisons in terms of space and time: • Can access to tag and data arrays be done in parallel? • How many comparators and multiplexers are needed? How about wiring in IC layout? • How many bits of storage are needed in tag and data arrays? The index bits need not be stored! 22 National Tsing Hua University Speed of Direct-Mapped Cache • Tag and data access happen in parallel Faster cache access! Tag Tag array Index Offset Index Data array 23 National Tsing Hua University Overhead of Full Associative Cache tag = = = offset Tag Data Associative Search = Multiplexor Rotate and Mask 24 National Tsing Hua University Overhead of Fully Associative Cache Tag Write Data offset Address Tag compare Data Tag compare Data Tag compare Data Tag Data compare Additional circuitry as compared to DM caches More extensive than SA caches Read Data Makes FA caches slower to access than either DM or SA of comparable size 25 National Tsing Hua University Other Design Issues: Write Hit Policy • Write through Update next level on every write cache is always clean Lots of traffic to next level (mostly writes) • Write back Write to cache and mark block dirty update main memory on eviction More complex eviction and coherence • Reservation problem: Reads use tag and data array at same time Writes use tag first, then data array How to pipeline to allow one read/write per cycle? 26 National Tsing Hua University Write Buffer for Write Through • A write buffer between cache and memory Processor: writes data into cache and write buffer Memory controller: write contents of buffer to memory • Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: store freq. (w.r.t. time) << 1 / DRAM write cycle Processor Cache DRAM Write Buffer 27 National Tsing Hua University Write Miss Policy • Write allocate: Allocate a new block on each write Fetch on write: fetch entire block, then write word into block No-fetch: allocate block but don’t fetch • Requires valid bits per word, complex eviction • Write no-allocate: Write around cache; typically used by write through • Write invalidate (instead of update) • Sometimes like to have a read no-allocate irregular accesses on machine with large block size 28 National Tsing Hua University Contents of the Cache Cache may hold certain types of information • Instruction cache: Read-only (mostly), very good spatial locality • sequential instruction stream and loops • but sometimes lots of subroutine calls • Data cache: read/write with different spatial locality pattern • Stack cache: captures a window of memory addresses including the top-of-stack • Unified cache: holds everything - data + instructions 29 National Tsing Hua University Other Design Issues: Replacement Policy • On a cache miss we need to evict a block to make room for the new line • In an n-way set associative cache, we have n choices of which block to evict • Which block gets evicted? • Optima cache replacement policy: [Belady, IBM Systems Journal, 1966] Evict block with longest reuse distance • i.e. next reference to block is farthest in future • Requires knowledge of the future! Can’t build it, but can approximate it 30 National Tsing Hua University Least Recently Used • For 2-way set-associative cache Single bit per set indicates LRU/MRU Set/clear on each access • For a>2, LRU is difficult/expensive in HW Need to maintain history of access order Timestamps? How many bits? • Must find min timestamp on each eviction • Expensive for HW to compare them on every access Sorted list? • Linked list to keep track of access order? • Re-sort blocks on every access HW expensive 31 National Tsing Hua University Pseudo LRU • Tree-based: 3 bits for 4-way O(N) Value 1: MRU block is left subtree Value 0: MRU block is right subtree • Advantages Less hardware than LRU Faster than LRU AB/CD bit (L0) A/B bit (L1) Way A Way B C/D bit (L2) Way C Way D Way3 Way2 Way1 Way0 A B C D 32 National Tsing Hua University Pseudo LRU AB/CD bit (L0) A/B bit (L1) Way A Way B L2-L1-L0 = 000, there is a hit in Way B, what is the new updated L2L1L0? C/D bit (L2) Way C Way D L2-L1-L0 = 001, a way needs to be replaced, which way would be chosen? Replacement decision LRU update algorithm CD AB AB/CD CD AB AB/CD Way hit L2 L1 L0 L2 L1 L0 Way to replace Way A --- 1 1 X 0 0 Way A Way B --- 0 1 X 1 0 Way B Way C 1 --- 0 0 X 1 Way C Way D 0 --- 0 1 X 1 Way D 33 National Tsing Hua University Not Recently Used (NRU) • A “used bit” is associated with every block Initially, the used bits of all blocks are reset to 0 Whenever a block is accessed, either on a hit or miss, its used bit is set to 1 If on an access, all other used bits in a set are 1, they are reset to 0 except the bit of the block that is accessed A replacement pointer is used to look for a victim block A block can be replaced if its used bit is 0 Rotate pointer until a block with used bit = 0 is found • Simple • Used by Intel Itanium, Sparc T2 34 National Tsing Hua University Yet Another NRU • Use R(eferenced) and M(odified) bits 0 (not referenced or not modified) 1 (referenced or modified) • Classify blocks into C0: C1: C2: C3: R=0, M=0 R=0, M=1 R=1, M=0 R=1, M=1 • Choose the victim from the lowest class (C3 > C2 > C1 > C0) • Periodically clear R and M bits 35 National Tsing Hua University Traditional Cache Replacement Policies • LRU and Pseudo-LRU focus on victim selection They may work fine for blocks with high temporal locality or for work loads smaller than cache size • However, thrashing may occur for a working set greater than cache size • Divide cache replacement into two parts Victim selection policy: which block to evict Insertion policy: where to insert incoming block 36 National Tsing Hua University LRU Insertion Policy (LIP) MRU [Qureshi et al. ISCA 07] LRU Incoming Block 37 National Tsing Hua University LRU Insertion Policy (LIP) MRU LRU Useless Block Evicted at next eviction Useful Block Moved to MRU position 38 National Tsing Hua University LRU Insertion Policy (LIP) MRU LRU Useless Block Evicted at next eviction Useful Block Moved to MRU position 39 National Tsing Hua University Bimodal Insertion Policy (BIP) • LIP may not age older blocks • BIP is similar to LIP, except that BIP places some % of incoming blocks into MRU position • Let e = Bimodal throttle parameter if ( rand() < e ) Insert at MRU position; // LRU replacement policy else Insert at LRU position; Promote to MRU if reused [Qureshi et al. ISCA 07] 40 National Tsing Hua University Outline • Memory hierarchy design (Sec. 2.1, Appendix B.1) Motivation, basic concepts • Basic cache organization and performance (Appendix B.2) • Basic cache optimizations (Appendix B.3) 41 National Tsing Hua University Improving Cache Performance • Recall average memory access time = hit time + miss rate x miss penalty • Improving cache performance by Reducing the miss rate Reducing the miss penalty Reducing the time to hit in the cache 42 National Tsing Hua University Six Basic Cache Optimizations • Larger block size Exploit spatial locality Reduce compulsory misses Increase capacity and conflict misses, increase miss penalty and false sharing • Larger total cache capacity Reduce capacity misses Increase hit time, increase power consumption • Higher associativity Reduce conflict misses Increase hit time, increase power consumption 43 National Tsing Hua University Six Basic Cache Optimizations • More cache levels • Giving priority to read misses over writes • Avoiding address translation in cache indexing 44 National Tsing Hua University Reducing Miss Penalty • Average Memory Access Time (AMAT) = Hit Time + Miss rate * Miss Penalty = Thit(L1) + Miss%(L1) * T(memory) • Example: Cache Hit = 1 cycle Miss rate = 10% = 0.1 Miss penalty = 300 cycles AMAT = 1 + 0.1 * 300 = 31 cycles 1 cyc 300 cyc L1$ Hit Time Miss % * Miss penalty Main Memory (DRAM) 45 National Tsing Hua University Multi-level Cache • AMAT = Thit(L1) + Miss%(L1)* (Thit(L2) + Miss%(L2)* (Thit(L3) + Miss%(L3)*T(memory) ) ) • Example: 14.7x speed-up! Miss rate L1=10%, Thit(L1) = 1 cycle Miss rate L2=5%, Thit(L2) = 10 cycles Miss rate L3=1%, Thit(L3) = 20 cycles T(memory) = 300 cycles 1 cyc L1 10 cyc L2 20 cyc 300 cyc L3 On-die AMAT = 2.115 cyc Main Memory (DRAM) 46 National Tsing Hua University Priority to Read Miss over Write • For write through caches with write buffer: WB may hold updated value needed on a read miss SW 512(R0),R3 ; M[512] <- R3 LW R1,1024(R0) ; R1 <- M[1024] LW R2,512(R0) ; R2 <- M[512] (direct-mapped, write-thru.; 512/1024 to same block) Rad miss waits until write buffer empty miss penalty Better solution: check WB contents before read; if no conflicts, let the memory access continue • For write-back caches: Read miss writes dirty block to memory, then do read Better: dirty block to WB, do read, do write CPU stall less since restarts as soon as do read 47 National Tsing Hua University Avoid Address Translation • Virtually addressed cache: VA P VA PA TLB $ P Physical Cache $ PA TLB • Every time a process is switched logically, the cache must be flushed; otherwise get false hits Cost: time to flush + “compulsory” misses Add processes identifier tag 48 National Tsing Hua University Avoid Address Translation • Aliases (synonyms): two different virtual addresses mapped to same physical address Two different cache entries holding data for the same physical address For update, must update all cache entries with same physical address or memory becomes inconsistent • Solution to aliases: HW guarantees every cache block has unique physical address SW guarantees: lower n bits must have same address; as long as covers index field and direct mapped, they must be unique (called page coloring) 49 National Tsing Hua University Avoid Address Translation • Physically mapped, virtually tagged: Index with physical part of address => low bits are untranslated Can start tag access in parallel with translation so that physical tag can be compared What if number of sets VA is greater than blocks Tag Set per page? Data • higher associativity • page coloring $ TLB Tag = 50 National Tsing Hua University Recap • Memory hierarchy exploits program locality to reduce AMAT • Types of caches Direct mapped, set-associative, fully associative • Cache policies Cache replacement: victim selection, insertion policy Write through vs. Write back Write allocate vs. No write allocate • Basic cache optimization techniques For reducing miss rate, reducing miss penalty, improving hit time 51 National Tsing Hua University