Computer Architecture Memory Hierarchy Memory/Storage Architecture Lab Technology Trends 300 Year Capacity $/GB 1980 64Kbit $1500000 1983 256Kbit $500000 1985 1Mbit $200000 1989 4Mbit $50000 250 1992 16Mbit $15000 1996 64Mbit $10000 1998 128Mbit $4000 2000 256Mbit $1000 2004 512Mbit $250 2007 1Gbit $50 200 Trac Tcac 150 100 50 0 '80 '83 '85 '89 '92 '96 '98 '00 '04 '07 Memory/Storage Architecture Lab 2 Memory Hierarchy Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available … We are … forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible. Burks, Goldstine, and von Neumann, 1946 CPU Level 1 Levels in the memory hierarchy Decreasing cost Increasing speed and bandwidth Level 2 • • • Level n Size of the memory at each level Memory/Storage Architecture Lab 3 Memory Technology (Big Picture) Processor Control Memory Memory Memory Datapath Memory Memory Speed: Fastest Size: Smallest Cost: Highest Memory/Storage Architecture Lab Slowest Biggest Lowest 4 Memory Technology (Real-world Realization) Processor Control On-chip Caches Registers Speed Size Management Register <1ns 100B Compiler Off-chip Level Caches (SRAM) Cache <5ns KB→MB Hardware Main Memory (DRAM) Main Memory 50ns~70ns MB→GB OS Memory/Storage Architecture Lab Secondary storage (Disk) Disk Memory 5ms~20ms GB→TB OS 5 Memory Hierarchy An optimization resulting from a perfect match between memory technology and two types of program locality Temporal locality (locality in time) − Spatial locality (locality in space) − If an item is referenced, it will tend to be referenced again soon. If an item is referenced, items whose addresses are close by will tend to be referenced soon. Goal : To provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory Memory/Storage Architecture Lab 6 Temporal and Spatial Localities Source: Glass & Cao (1997 ACM SIGMETRICS) Memory/Storage Architecture Lab 7 Memory Hierarchy Terminology Hit – Accessed data is found in upper level Miss – Accessed data found only in lower level Hit Rate = fraction of accesses found in upper level Hit Time = time to access the upper level Processor waits until data is fetched from next level, then restarts/continues access Miss rate = 1 – (hit rate) Miss penalty = time to get block from lower level + time to replace in upper level Data are transferred in the unit of blocks Hit time << miss penalty Average memory access time << worst case access time Average memory access time = hit time + miss rate ⅹmiss penalty Memory/Storage Architecture Lab 8 (CPU) Cache Upper level : SRAM (small, fast, expensive) lower level : DRAM (large, slow, cheap) Goal : To provide a “virtual” memory technology that has an access time of SRAM with the size and cost of DRAM Additional benefits Reduction of memory bandwidth consumed by processor More memory bandwidth for I/O No need to change the ISA Memory/Storage Architecture Lab 9 Direct-mapped Cache Each memory block is mapped to a single cache block The mapped cache block is determined by memory block address mod number of cache blocks Memory/Storage Architecture Lab 10 Direct-Mapped Cache Example Consider a direct-mapped cache with block size 4 bytes and total capacity 4KB Assume 1 word per block… The 2 lowest address bits specify the byte within a block The next 10 address bits specify the block’s index within the cache The 20 highest address bits are the unique tag for this memory block The valid bit specifies whether the block is an accurate copy of memory Exploit temporal locality Memory/Storage Architecture Lab 11 On cache read On cache hit, CPU proceeds normally On cache miss (handled completely by hardware) Stall the CPU pipeline Fetch the missed block from the next level of hierarchy Instruction cache miss − Restart instruction fetch Data cache miss − Complete data access Memory/Storage Architecture Lab 12 On cache write Write-through Always write the data into both the cache and main memory Simple but slow and increases memory traffic (requires a write buffer) Write-back Write the data into the cache only and update the main memory when a dirty block is replaced (requires a dirty bit and possibly a write buffer) Fast but complex to implement and causes a consistency problem Memory/Storage Architecture Lab 13 Write allocation What should happen on a write miss? Alternatives for write-through Allocate on miss: fetch the block Write around: don’t fetch the block − Since programs often write a whole block before reading it (e.g., initialization) For write-back Usually fetch the block Memory/Storage Architecture Lab 14 Memory Reference Sequence Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Index Valid Tag Data XXXX XXXX 1 0 XXXX XXXX 2 0 XXXX XXXX 3 0 XXXX XXXX 1021 0 XXXX XXXX 1022 0 XXXX XXXX 1023 0 XXXX XXXX … 0 … 0 Cache Initially Empty Memory/Storage Architecture Lab 15 After Reference 1 Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00 Index Valid Tag Data 00000000000000000000 Memory bytes 0…3 (copy) 1 0 XXXX XXXX 2 0 XXXX XXXX 3 0 XXXX XXXX 1021 0 XXXX XXXX 1022 0 XXXX XXXX 1023 0 XXXX XXXX Miss … 1 … 0 Cache Miss, Place Block at Index 0 Memory/Storage Architecture Lab 16 After Reference 2 Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000001 00 Index Valid Tag Data 00000000000000000000 Memory bytes 0…3 (copy) 1 1 00000000000000000000 Memory bytes 4…7 (copy) 2 0 XXXX XXXX 3 0 XXXX XXXX 1021 0 XXXX XXXX 1022 0 XXXX XXXX 1023 0 XXXX XXXX Miss … 1 … 0 Cache Miss, Place Block at Index 1 Memory/Storage Architecture Lab 17 After Reference 3 Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Address = 00000000000000000001 1111111111 00 Index Valid Tag Data 00000000000000000000 Memory bytes 0…3 (copy) 1 1 00000000000000000000 Memory bytes 4…7 (copy) 2 0 XXXX XXXX 3 0 XXXX XXXX 1021 0 XXXX XXXX 1022 0 XXXX XXXX 1023 1 00000000000000000001 Memory bytes 8188…8191 (copy) … 1 … 0 Miss Cache Miss, Place Block at Index 1023 Memory/Storage Architecture Lab 18 After Reference 4 Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00 Index Valid Tag Data 00000000000000000000 Memory bytes 0…3 (copy) 1 1 00000000000000000000 Memory bytes 4…7 (copy) 2 0 XXXX XXXX 3 0 XXXX XXXX 1021 0 XXXX XXXX 1022 0 XXXX XXXX 1023 1 00000000000000000001 Memory bytes 8188…8191 (copy) Hit … 1 … 0 Cache Hit to Block at Index 0 Memory/Storage Architecture Lab 19 After Reference 5 Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Address = 00000000000000000100 0000000000 00 [same index!] Index Valid Tag Data 00000000000000000100 00000000000000000000 Memory bytes 16384…16387(copy) 1 1 00000000000000000000 Memory bytes 4…7 (copy) 2 0 XXXX XXXX 3 0 XXXX XXXX 1021 0 XXXX XXXX 1022 0 XXXX XXXX 1023 1 00000000000000000001 Memory bytes 8188…8191 (copy) Miss … 1 … 0 Cache Miss, Replace Block at Index 0 Memory/Storage Architecture Lab 20 After Reference 6 Look at the following sequence of memory references for the previous direct-mapped cache 0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00 [same index!] Index Valid Tag Data 00000000000000000100 00000000000000000000 Memory bytes 0…3 (copy) 1 1 00000000000000000000 Memory bytes 4…7 (copy) 2 0 XXXX XXXX 3 0 XXXX XXXX 1021 0 XXXX XXXX 1022 0 XXXX XXXX 1023 1 00000000000000000001 Memory bytes 8188…8191 (copy) Miss … 1 … 0 Cache Miss, Replace Block at Index 0 Total of 1 Hit and 5 Misses Memory/Storage Architecture Lab 21 Exploiting Spatial Locality - Larger than one word block size 16 KB Direct-mapped cache with 256 64B (16 words) blocks Memory/Storage Architecture Lab 22 Miss Rate vs. Block Size Memory/Storage Architecture Lab 23 Set-Associative Caches Allow multiple entries per index to improve hit rates n-way set associative caches allow up to n conflicting references to be cached − − − Fully-associative caches − n is the number of cache blocks in each set n comparisons are needed to search all blocks in the set in parallel When there is a conflict, which block is replaced (this was easy for direct mapped caches – there`s only one entry!) a single (very large!) set allows a memory location to be placed in any cache block Direct-mapped caches are essentially 1-way set-associative caches For fixed cache capacity, higher associativity leads to higher hit rates Because more combinations of memory blocks can be present in the cache Set associativity optimizes cache contents, but at what cost? Memory/Storage Architecture Lab 24 Cache Organization Spectrum Memory/Storage Architecture Lab 25 Implementation of Set Associative Cache Memory/Storage Architecture Lab 26 Cache Organization Example One-way set associative (direct mapped) Block Two-way set associative Tag Data Set 0 0 1 1 2 2 3 3 Tag Data Tag Data 4 5 Four-way set associative 6 Set Tag Data Tag Data Tag Data Tag Data 7 0 1 Eight-way set associative (fully associative) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Memory/Storage Architecture Lab 27 Cache Block Replacement Policy Direct-mapped Caches No replacement policy is needed since each memory block can be placed in only one cache block N-way set-associative Caches Each memory block can be placed in any of the n cache blocks in the mapped set Least Recently Used (LRU) replacement policy is typically used to select a block to be replaced among the blocks in the mapped set LRU replaces the block that has not been used for the longest time Memory/Storage Architecture Lab 28 Miss Rate vs. Set Associativity Memory/Storage Architecture Lab 29 Memory Reference Sequence Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 This sequence had 5 misses and 1 hit for the direct mapped cache with the same capacity Set Number 0 1 Valid Tag 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX … … 255 Data Cache Initially Empty Memory/Storage Architecture Lab 30 After Reference 1 Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000 Set Number 0 1 Valid Tag 1 000000000000000000000 Memory bytes 0..7 (copy) 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX Miss … … 255 Data Cache Miss, Place in First Block of Set 0 Memory/Storage Architecture Lab 31 After Reference 2 Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 100 Set Number 0 1 Valid Tag 1 000000000000000000000 Memory bytes 0..7 (copy) 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX Hit … … 255 Data Cache Hit to first Block in Set 0 Memory/Storage Architecture Lab 32 After Reference 3 Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 Address = 000000000000000000111 11111111 000 Set Number 0 1 Valid Tag 1 000000000000000000000 Memory bytes 0..7 (copy) 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 1 000000000000000000111 Memory bytes 8188..8195 (copy) 0 XXXX XXXX … … 255 Data Miss Cache Miss, Place in First Block of Set 255 Memory/Storage Architecture Lab 33 After Reference 4 Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000 Set Number 0 1 Valid Tag 1 000000000000000000000 Memory bytes 0..7 (copy) 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 1 000000000000000000111 Memory bytes 8188..8195 (copy) 0 XXXX XXXX Hit … … 255 Data Cache Hit to first Block in Set 0 Memory/Storage Architecture Lab 34 After Reference 5 Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 Address = 000000000000000010000 00000000 000 Set Number 0 1 Valid Tag 1 000000000000000000000 Memory bytes 0..7 (copy) 1 000000000000000010000 Memory bytes 16384..16391(copy) 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 1 000000000000000000111 Memory bytes 8188..8195 (copy) 0 XXXX XXXX Miss … … 255 Data Cache Miss, Place in Second Block of Set 0 Memory/Storage Architecture Lab 35 After Reference 6 Look again at the following sequence of memory references for a 2way set associative cache with a block size of two words (8bytes) 0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000 Set Number 0 1 Valid Tag Data 000000000000000000000 Memory bytes 0..7 (copy) 1 000000000000000010000 Memory bytes 16384..16391(copy) 0 XXXX XXXX 0 XXXX XXXX 0 XXXX XXXX 1 000000000000000000111 Memory bytes 8188..8195 (copy) 0 XXXX XXXX 255 Hit … … 1 Cache Hit to first Block in Set 0 Total of 3 hits and 3 misses Memory/Storage Architecture Lab 36 Improving Cache Performance Cache Performance is determined by Average memory access time = hit time + (miss rate x miss penalty) Decrease hit time Decrease miss rate Make cache smaller, but miss rate increases Use direct mapped, but miss rate increase Make cache larger, but can increases hit time Add associativity, but can increases hit time Increase block size, but increases miss penalty Decrease miss penalty Reduce transfer time component of miss penalty Add another level of cache Memory/Storage Architecture Lab 37 Current Cache Organizations Intel Nehalem AMD Opteron X4 L1 caches (per core) L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, writeback/allocate, hit time n/a L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles L1 D-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, write-back/allocate, hit time 9 cycles L2 unified cache (per core) 256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a 512KB, 64-byte blocks, 16-way, approx LRU replacement, write-back/allocate, hit time n/a L3 unified cache (shared) 8MB, 64-byte blocks, 16-way, replacement n/a, write-back/allocate, hit time n/a 2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, writeback/allocate, hit time 32 cycles n/a: data not available Memory/Storage Architecture Lab 38 Cache Coherence Problem Suppose two CPU cores share a physical address space Write-through caches Time step CPU A’s cache Event CPU B’s cache 0 Memory 0 1 CPU A reads X 0 2 CPU B reads X 0 0 0 3 CPU A writes 1 to X 1 0 1 Memory/Storage Architecture Lab 0 39 Snoopy Protocols Write Invalidate Protocol: Write Broadcast Protocol: Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Write to shared data: broadcast on bus, processors snoop, and update copies Write serialization: bus serializes requests Bus is single point of arbitration Memory/Storage Architecture Lab 40 Write invalidate Protocol Cache gets exclusive access to a block when it is to be written Broadcasts an invalidate message on the bus Subsequent read in another cache misses − Owning cache supplies updated value CPU activity Bus activity CPU A’s cache CPU B’s cache Memory 0 CPU A reads X Cache miss for X 0 CPU B reads X Cache miss for X 0 CPU A writes 1 to X Invalidate for X 1 CPU B read X Cache miss for X 1 Memory/Storage Architecture Lab 0 0 0 0 1 1 41 Summary Memory hierarchies are an optimization resulting from a perfect match between memory technology and two types of program locality Temporal locality Spatial locality The goal is to provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory Cache memory is an instance of a memory hierarchy exploits both temporal and spatial localities direct-mapped caches are simple and fast but have higher miss rates set-associative caches have lower miss rates but are complex and slow multilevel caches are becoming increasingly popular cache coherence protocols ensures consistency among multiple caches Memory/Storage Architecture Lab 42