CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity 1 Fully Associative Cache Reducing Cache Misses by More Flexible Placement Blocks Instead of direct mapped, we allow any memory block to be placed in any cache slot. There are many different potential addresses that mapped to each index Use any available entry to store memory elements Remember: Direct memory caches are more rigid, any cache data goes directly where the index says to, even if the rest of the cache is empty But in Fully associative cache, nothing gets “thrown out” of the cache until it is completely full. It’s harder to check for a hit (hit time will increase). Requires lots more hardware (a comparator for each cache slot). Each tag will be a complete block address (No index bits are used). 2 Fully Associative Cache Must compare tags of all entries in parallel to find the desired one (if there is a hit) But Direct mapped cache only need to look one place No conflict misses, only capacity misses Practical only for caches with small number of blocks, since searching increases the hardware cost 3 Fully Associative Cache 4 Direct Mapped vs Fully Associative Direct Mapped Index V Tag 0: 1: 2 3: 4: 5: 6: 7: 8 9: 10: 11: 12: 13: 14: 15: Fully Associative Data V Tag Data No Index Each address has only one possible location Address = Tag | Index | Block offset Address = Tag | Block offset 5 Trade off Fully Associate is much more flexible, so the miss rate will be lower. Direct Mapped requires less hardware (cheaper). – will also be faster! Tradeoff of miss rate vs. hit time. Therefore we might be able to compromise to find best solution between direct mapped cache and fully associative cache We can also provide more flexibility without going to a fully associative placement policy. For each memory location, provide a small number of cache slots that can hold the memory element. This is much more flexible than direct-mapped, but requires less hardware than fully associative. SOLUTION: Set Associative 6 SET Associative Cache A fixed number of locations where each block can be placed. N-way set associative means there are N places (slots) where each block can be placed. Divide the cache into a number of sets each set is of size N “ways” (N way set associative) Therefore, A memory block maps to unique set (specified by index field) and can be placed in any “way” of that set So there N choices A memory block can be mapped is Set-accociative cache - (Block address) modulo (Number of set in the cache) - Remember that in a direct mapped cache the position of memory block is given by (Block address) modulo (Number of cache blocks) 7 A Compromise 2-Way set associative V Tag 4-Way set associative V Tag Data 0: 0: 1: 2: 3: Each address has two possible locations with the same index 6: 1: Each address has four possible locations with the same index 2: 4: 5: Data One fewer index bit: 1/2 the indexes 7: Address = Tag | Index | Block offset 3: Two fewer index bits: 1/4 the indexes Address = Tag | Index | Block offset 8 Range of Set Associative Caches Index is the set number is used to determine which set the block can be placed in Used for tag compare Tag Decreasing associativity Direct mapped (only one way) Smaller tags Selects the set Index Selects the word in the block Block offset Byte offset Increasing associativity Fully associative (only one set) Tag is all the bits except block and byte offset 9 Range of Set Associative Caches For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e. the numbers or ways) And halves the number of sets, Decreases And the size of the index by 1 bit increases the size of the tag by 1 bit Tag Index Block offset Byte offset 10 Set Associative Cache Cache V Tag 0 0 1 1 0 1 Q2: Is it there? Data Main Memory 0000xx 0001xx Two low order bits 0010xx define the byte in the word (32-b words) 0011xx One word blocks 0100xx 0101xx 0110xx 0111xx 1000xx Q1: How do we find it? 1001xx 1010xx Use next 1 low order 1011xx memory address bit to 1100xx determine which 1101xx cache set (i.e., modulo 1110xx the number of sets in 1111xx the cache) Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in (block address) modulo (# set in the cache) the cache Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block 11 Set Associative Cache Organization FIGURE 7.17 The implementation of a four-way set-associative cache requires four comparators and a 4-to-1 multiplexor. The comparators determine which element of the selected set (if any) matches the tag. The output of the comparators is used to select the data from one of the four blocks of the indexed set, using a multiplexor with a decoded select signal. In some implementations, the Output enable signals on the data portions of the cache RAMs can be used to select the entry in the set that drives the output. The Output enable signal comes from the comparators, causing the element that matches to drive the data outputs. 12 Set Associative Cache Organization This is called a 4-way set associative cache because there are four cache entries for each cache index. Essentially, you have four direct mapped cache working in parallel. This is how it works: the cache index selects a set from the cache. The four tags in the set are compared in parallel with the upper bits of the memory address. If no tags match the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data from the way where the tag matches occur. This is simple enough. What is its disadvantages? 13 N-way Set Associative Cache versus Direct Mapped Cache: N way set associative cache will also be slower than a direct mapped cache because N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. 14 Remember the Example for Direct Mapping (ping pong effect) Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 0 miss 00 00 01 Mem(0) 0 miss Mem(4) 0 01 00 01 00 0 4 0 4 0 4 0 4 4 miss Mem(0) 4 4 miss Mem(0) 4 00 01 00 01 0 miss 0 01 00 0 miss 0 Mem(4) 01 00 Mem(4) 4 miss Mem(0)4 4 miss Mem(0) 4 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block 15 Solution: Use set associative cache Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 0 miss 000 Mem(0) 0 4 0 4 0 4 0 4 4 miss 000 010 Mem(0) Mem(4) 0 hit 000 010 Mem(0) Mem(4) 4 hit 000 010 Mem(0) Mem(4) 8 requests, 2 misses Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist! 16 Byte offset (2 bits) Block offset (2 bits) Index (1-3 bits) Tag (3-5 bits) Set Associative Example 0100111000 1100110100 0100111100 0110110000 1100111000 Index V 000: 0 Tag Miss Miss Miss Miss Miss Data Index V 0 0 0 01: 0 0 10: 0 0 1 11: 1 0 Tag Miss Miss Hit Miss Miss Data 00: 001: 0 010: 0 011: 0 1 100: 0 101: 0 0100111000 1100110100 0100111100 0110110000 1100111000 011 110 010 010 - 110: 0 111: 0 Direct-Mapped 1100 0100 0110 1100 - 2-Way Set Assoc. 0100111000 1100110100 0100111100 0110110000 1100111000 Index V 0 0 0: 0 0 0 1 0 1 1: 0 1 0 Miss Miss Hit Miss Hit Tag Data 01001 11001 01101 - 4-Way Set Assoc. 17 New Performance Numbers Miss rates for DEC 3100 (MIPS machine) Separate 64KB Instruction/Data Caches Benchmark Associativity Instruction rate gcc Direct 2.0% Data miss miss rate 1.7% Combined 1.9% gcc 2-way 1.6% 1.4% 1.5% gcc 4-way 1.6% 1.4% 1.5% spice Direct 0.3% 0.6% 0.4% spice 2-way 0.3% 0.6% 0.4% spice 4-way 0.3% 0.6% 0.4% 18 Benefits of Set Associative Caches The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation 12 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB Miss Rate 10 8 6 4 2 0 1-way 2-way 4-way 8-way Data from Hennessy & Patterson, Computer Architecture, 2003 Associativity Largest gains are in going from direct mapped to 2-way (20%+ reduction in miss rate) 19 Benefits of Set Associative Caches As the cache size grow, the relative improvement from associativity increases only slightly Since overall miss rate of a larger cache is lower, the opportunity for improving the miss rate decreases And the obsolete improvement in miss rate from associativity shrinks significantly 20 Cache Block Replacement Policy For deciding which block to replace when a new entry is coming Random Replacement: First in First Out (FIFO) Hardware randomly selects a cache item and throw it out Equally fair / equally unfair to all frames Least Recently Used (LRU) strategy: Use idea of temporal locality to select the entry that has not been accessed recently Additional bit(s) required in the cache entry to track access order - Must update on each access, must scan all on a replace For two way set associative cache one needs one bit for LRU replacement. Common approach is to use pseudo LRU strategy Example of a Simple “Pseudo” Least Recently Used Implementation: Assume 64 Fully Associative Entries Hardware replacement pointer points to one cache entry Whenever an access is made to the entry the pointer points to: - Move the pointer to the next entry -Otherwise: do not move the pointer 21 Source of Cache Misses Direct Mapped N way Set Associative Fully Associative Cache Size Big Medium Small Compulsory Miss Same Same Same Conflict Miss High Medium Zero Capacity Miss Low(er) Medium High Designing a cache Design Cache Effect on Miss rate Negative performance effect Increase size Decrease Capacity Misses May increase Access time Increase Associativity Decrease conflict Misses May increase Access time Increase Block Size May decrease compulsory misses May increase miss penalty May Increase Capacity Misses Not: If you are running “billions” of instructions compulsory misses are insignificand Key Cache Design Parameters L1 typical L2 typical Total size (blocks) 250 to 2000 4000 to 250,000 Total size (KB) 16 to 64 500 to 8000 Block size (B) 32 to 64 32 to 128 Miss penalty (clocks) 10 to 25 100 to 1000 Miss rates (global for L2) 2% to 5% 0.1% to 2% Two Machines’ Cache Parameters Intel P4 AMD Opteron L1 organization Split I$ and D$ Split I$ and D$ L1 cache size 8KB for D$, 96KB for trace cache (~I$) 64KB for each of I$ and D$ L1 block size 64 bytes 64 bytes L1 associativity 4-way set assoc. 2-way set assoc. L1 replacement ~ LRU LRU L1 write policy write-through write-back L2 organization Unified Unified L2 cache size 512KB 1024KB (1MB) L2 block size 128 bytes 64 bytes L2 associativity 8-way set assoc. 16-way set assoc. L2 replacement ~LRU ~LRU L2 write policy write-back write-back Where can a block be placed/found? # of sets Direct mapped # of blocks in cache Set associative (# of blocks in cache)/ associativity Fully associative 1 Location method Direct mapped Index Set associative Index the set; compare set’s tags Fully associative Compare all blocks tags Blocks per set 1 Associativity (typically 2 to 16) # of blocks in cache # of comparisons 1 Degree of associativity # of blocks Multilevel caches Two level cache structure allows the primary cache (L1) to focus on reducing hit time to yield a shorter clock cycle. The second level cache (L2) focuses on reducing the penalty of long memory access time. Compared to the cache of a single cache machine, L1 on a multilevel cache machine is usually smaller, has a smaller block size, and has a higher miss rate. Compared to the cache of a single cache machine, L2 on a multilevel cache machine is often larger with a larger block size. The access time of L2 is less critical than that of the cache of a single cache machine. 27