CS 1104 Help Session I Caches Colin Tan, ctank@comp.nus.edu.sg S15-04-15 Topics for Today • Session I - Cache, Session II - Virtual Memory • Basically we will pick off important topics and elaborate on them. Also will have simple problems to solve – Please have your calculator, paper and pen ready! • Help session notes are available at: http://www.comp.nus.edu.sg/~ctank Memory Hierarchy • Motivation – Not all memory is created equal • Cheap Memory => Slow • Fast Memory => Expensive – DRAM, 70 ns access time, $5/MByte – SRAM, 8 ns access time, $100/MByte • The purpose of memory hierarchy is that it allows us to have huge amounts of cheap that operate close to the speed of expensive but fast memory. How do we do that? • We can achieve this by using locality. – Locality is the behavior of programs such that: • Data or instructions that was recently accessed is likely to be accessed in the near future (Temporal Locality) – The variable x in this code fragment will have temporal locality for(int i=0; i<100; i++) y= y+x • The neighbors of data or instructions that was recently accessed are likely to also be accessed (Spatial Locality) – Program execution tends to be sequential, so if an instruction was just executed, it is likely that the next instruction next to it will also be executed. Caches • Locality means that only small portions of main memory will be used in the near future. • We can create a small memory system with fast but expensive devices to store these portions of main memory. • We can then access these portions of memory very quickly. • This is the concept behind caches. How Do Caches Help? • The average time to access memory (AMAT) is given by: AMAT = Tcache + miss_rate x Tmemory Tcache = Time to read the cache (8ns for SRAM cache) Tmemory = Time to read main memory (70ns for DRAM) miss_rate = Probability of not finding what we want in the cache. • Because of locality, miss_rate is very small – Typically about 3% to 5%. • On the average, time to access memory is very close to that of the cache (8ns) rather than that of the main memory (70ns). How do Caches Help? • Yet at the same time, we have the benefit of being able to have large amounts of memory – This is because most of our memory is cheap DRAM! • The net effect is that caches allow us to have huge amounts of cheap, slow memory, yet have access times of fast, expensive memory. Basics of Addressing • Memory Addresses – Memory is a huge array of data. – Just like arrays, we must specify the index of the piece of data that we want, so that we can get it out of the array. – In memory systems, this index is called an “Address”. Where do Addresses Come From? • For instruction fetches, the address of the instruction (i.e. the location in the memory where the instruction is in) comes from the Program Counter. • For data fetches, the address comes from the ALU stage of the pipeline whenever we do a lw or sw operation. • In the MIPS architecture, addresses are 32bit numbers. Cache Architecture • Caches consist of blocks (or lines). Each block stores data from memory: Block • Block allocation problem: – Given data from an address A, how do we decide which block of cache its data should go to? The Block Allocation Problem • 3 possible solutions: – Data from each address A will go to to a fixed block. • Direct Mapped Cache – Data from each address A will go to any block. • Fully associative cache – Data from address A will go to a fix set of blocks. • Data may be put into any block within a set. • Set associative cache. Direct Mapped Caches • The value of a portion of memory address is used to decide which block to send the data to: Address A Tag Block Index Block Offset Byte Offset • The Block Index portion is used to decide which block data from this address should go to. Example • The number of bits in the block index is log2N, where N is the total number of blocks. • For a 4-block cache, the block index portion of the address will be 2 bits, and these 2 bits can take on the value of 00, 01, 10 or 11. • The exact value of these 2 bits will determine which block the data for that address will go to. Example • The value of the two block index bits will determine which block the data will go to, following the scheme shown below: Cache 00 01 10 11 Solving Direct-Mapped Cache Problems • Question 7.7 Basic formula: Blk_Addr = floor(word_address/words_per_block) mod N – N here is the total number of blocks in the cache – This is the mathematical version of taking the value of the Block Index bits from the address. A Complication: Multiple Word Blocks • Single word blocks do not support spatial locality – Spatial locality: Likelihood of accessing neighbor of a piece of data that was just accessed is high. – But with single word blocks, none of the neighbors are in cache! • All accesses to neighbors that were not accessed before will miss! An Example Question 7.8 Accessing Individual Words • In our example, each block has 4 words. • But we always access memory 1 word at a time! (e.g. lw) • Use the Block Offset to specify which of the 4 words in a block we want to read: Address A Tag Block Index Block Offset Byte Offset The Block Offset • Number of block offset bits = log2M, where M is the number of words per block. • For our example, M=4. So number of block offset bits is 2. • These two bits can take on the values of 00, 01, 10 and 11. • Note that for single word blocks, the number of block offset bits is log2 1, which is 0. I.e. There are no block offset bits for single-word blocks. • These values determine exactly which word within a block address A is referring to: The Block Offset Word 00 Word 01 Word 10 Word 11 00 01 10 11 4 block cache, 4 words per block • The value of the 2 block offset bits (see previous slide) determine if our address A is referring to word00, word01, word10 or word11. Who am I? Purpose of the Tag • Many different addresses may map to the same block: e.g. (Block Index portions shown highlighted) 01000 00010010 00000000 00 01010 00010010 00000000 00 11011 00000000 00 00010010 • All 3 addresses are different, but all map to block 00010010 Disambiguation • We need a way to disambiguate the situation – Otherwise how do we know that the data in block x actually comes from address A and not from another address A’ that has the same block index bit value? • The portion of the address A to the left of the Block Index can be used for disambiguation. • This portion is called the tag, and the tag for address A is stored in the cache together with address A data. The Tag Tag Word 00 Word 01 Word 10 Word 11 00 01 10 11 • When we access the cache, the Tag portion and Block Index portions of address A are extracted. • The Block Index portion will tell the cache controller which block of cache to look at. • The Tag portion is compared against the tag stored in the block. If the tags match, we have a cache hit. The data is read from the cache. Accessing Individual Bytes Address A Tag Block Index Block Offset Byte Offset • MIPS addresses are byte addresses, and actually index individual bytes rather than words. • Each MIPS word consists of 4 bytes. • The byte offset tells us exactly which byte within a word we are referring to. Advantages & Disadvantages of Direct Mapped Caches • Advantages: – Simple to implement – Fast performance • Less time to detect a cache hit => less time to get data from the cache => faster performance • Disadvantages – Poor temporal locality. • Many addresses may map to the same block. • The next time address A is accessed, it may have been replaced by the contents of address A’. Improving Temporal Locality The Fully Associative Cache • In the fully associative cache, data from an address A can go to any block in cache. – In practice, data will go into the first available cache block. – When the cache is full, a replacement policy is invoked to choose which block of cache to throw out. Example • Question 7.22 Searching the Cache • In the fully associative cache, an address A is split into the following parts: Address A Tag Block Offset Byte Offset • The cache controller will search the entire cache to see if it can find a block with the same tag value as the tag portion of A. • If it can find such a block, we have a cache hit, and the controller reads the data from the cache. Advantages and Disadvantages Fully Associative Cache • Good temporal locality properties – Flexible block placement allows smart replacement policies such that blocks that are likely to be referenced again will not be replaced. E.g. LRU, LFU. • Disadvantages – Complex and too expensive for large caches • Each block needs a comparator to check the tag. • With 8192 blocks, we need 8192 comparators! A Compromise Set Associative Caches • Represents a compromise between directmapped and fully associative caches. • Cache is divided into sets of blocks. • An address A is mapped directly to a set using a similar scheme as for direct mapped caches. • Once the set has been determined, the data from A may be stored in any block within a set - Fully associative within a set! Set Associative Cache • An n-way set associative cache will have n blocks per set. • For example, for a 16-block cache that is implemented as a 2-way set associative cache, each set has 2 blocks, and we have a total of 8 sets. Set Associative Cache Block 0 Block 1 Set 000 Set 001 Set 010 Set 011 Set 100 Set 101 Set 110 Set 111 • An address A will be divided into: Address A Tag Set Index Block Offset Byte Offset Accessing a Set Associative Cache Address A Tag Set Index Block Offset Byte Offset • The Set Index portion of address A is extracted. This is used to index the sets (i.e. If the Set Index portion is 010, then this address is mapped to Set 010). • The tag portion of A is extracted and compared against the tags stored in Block 0 and Block 1 of Set 010. Accessing a Set Associative Cache • If a match is made either in Block 0 or Block 1 of Set 010, then we have a cache hit, and the data for A is read from the cache block. • If we have a miss, then the data for A is fetched from main memory, and placed in the first available block in Set 010. • If no blocks are available, a replacement policy is invoked to choose a block to replace. Example • Question 7.20 Basic formula: Blk_Addr = floor(word_address/words_per_block) mod N – Here N is the number of sets, NOT NUMBER OF BLOCKS! – This is the mathematical version of taking the value of the Block Index bits from the address. Multi-block Set vs. Multi-word blocks? • Confusion often arises over multi-block sets (or n-way set associative) vs. multi-word blocks. • Each block in a set can itself have multiple words, like the blocks in question 7.8. Each block will also have its own tag. Advantages and Disadvantages Set Associative Cache • Advantages – Almost as simple to build as a direct-mapped cache. – Only n comparators are needed for an n-way set associative cache. For 2-way set-associative, only 2 comparators are needed to compare tags. – Supports temporal locality by having full associativity within a set. Advantages and Disadvantages Set Associative Cache • Disadvantages – Not as good as fully-associative cache in supporting temporal locality. – For LRU schemes, because of small associativity, actually possible to have 0% hit rate for temporally local data. – E.g. If our accesses are A1 A2 A3 A1 A2 A3, and if A1, A2 and A3 map to the same 2-way set, then hit rate is 0% as subsequent accesses replace previous accesses in the LRU scheme. Writing to Cache • Remember that data in cache is merely a copy of data in main memory. • When data that is stored in a cache block is modified (e.g. when doing a sw to address A), then the copy in cache will become inconsistent with the copy in memory. • Need a way to maintain consistency. Memory/Cache Consistency 2 solutions • Write-through cache – In the write-through cache, consistency between cache data and memory data is maintained by updating both main memory and cache. – This is very slow • Must wait for both cache and memory writes to complete before CPU can proceed. • Memory writes are very slow! Memory/Cache Consistency 2 solutions • Write-back Cache – Only the cache copy of data is updated. – When the data in a block is updated, a special flag called the dirty bit will be set to indicate that the cache copy is now inconsistent with the memory copy. – If the block is chosen for replacement (either by replacement policy or because another address A’ maps to the same block), then the memory copy is updated if the dirty bit is set. – If dirtybit is not set, the block is simply replaced. Nitty-Gritty Use of Addresses by Cache • Addresses are used to access cache. • For Direct Mapped Cache: Address A Tag Block Index Block Offset Byte Offset Nitty-Gritty Use of Addresses by Cache • Direct Mapped Cache Address A Tag Block Index Block Offset Byte Offset Number of Byte Offset Bits B= log2(number of bytes per word) •On MIPS this is usually 2 bits Number of Block Offset Bits W= log2(Number of words per block) •0 bits for 1-word blocks Number of Block Index Bits I = log2(Number of blocks) Number of tag bits = address_length - B - W- I •address_length is 32 bits on MIPS Nitty-Gritty Use of Addresses by Cache • Fully Associative Cache Address A Tag Block Offset Byte Offset Number of Byte Offset Bits B= log2(number of bytes per word) •On MIPS this is usually 2 bits Number of Block Offset Bits W= log2(Number of words per block) •0 bits for 1-word blocks Number of tag bits = address_length - B - W •address_length is 32 bits on MIPS Note that there are no index bits for fully associative caches. Nitty-Gritty Use of Addresses by Cache • Set-Associative Cache Address A Tag Set Index Block Offset Byte Offset Number of Byte Offset Bits B= log2(number of bytes per word) •On MIPS this is usually 2 bits Number of Block Offset Bits W= log2(Number of words per block) •0 bits for 1-word blocks Number of Set Index Bits S = log2(Number of sets) Number of tag bits = address_length - B - W- S •address_length is 32 bits on MIPS Example • A cache built for the MIPS architecture has a total size 128 KB. Find the total number of tag, set index, block index, block offset, and byte offset bits for a given address A for each of the following cache architectures: – – – – Direct Mapped, 1 word per block. Direct Mapped, 8 words per block Fully associative, 2 words per block 2-way set associative, 4 words per block Example • Basic things you first need to work out: – What types of information do I need to determine for each cache architecture? • E.g. for set-associative, need to determine byte-offset, block offset, set index and tag bits. – What is the cache size in terms of words? – What is the total number of blocks that we would have, or the total number of sets? • This will give us the number of index bits. – Any other important information? Nitty-Gritty Cache Housekeeping Flags • Other than the data and tag bits, cache blocks need to store housekeeping flags. • The dirty bit (D) we saw earlier is an example. • Other bits include: – Valid bit (V) • When a cache first starts up, the tag and data bits are random. It is possible to have a cache hit because the tag from an address may match a random number in the tag field of a block. • But the data is random in invalid! • The Valid bit is normally off, and will be set when valid data is written to a block. Nitty-Gritty Cache Housekeeping Flags – Use Bit (U) • This is used by the LRU replacement algorithm to determine which block is LRU. • Present only in fully-associative and set-associative caches using LRU replacement policies. • A cache may thus look as complicated as this: V D U Tag Word 00 Word 01 Word 10 Word 11 Total Number of Bits in Cache? • When we speak of cache size, we normally refer to how many bytes of main memory data the cache can hold – E.g. a 64KB cache can hold up to 64KB of main memory data • But as we can see, data is not the only thing stored in a cache block. – We also have the tag and housekeeping flags! • Thus the total number of bits needed to implement a cache can be much bigger than the specified cache size! Example • We want to implement a 256KB write-back cache on the MIPS architecture. The cache will be 4-way set associative, with 4 word blocks. The LRU replacement policy will be used. Find the total number of bits of SRAM required to implement this cache. Example • Analysis – – – – – What housekeeping flags will be needed? What is the size of the data portion of each block? What is the number of blocks? What is the number of sets? What is the number of tag bits? • What is the number of byte offset, block offset and set index bits required? • Based on this analysis, you should be able to get the answer. Summary • Caches – Make use of locality to make it possible to have small amounts of fast expensive memory hold a copy of main memory data that is likely to be accessed soon. – Allows fast access of huge amounts of memory. • Cache types – Direct Mapped • Simple, fast • Poor temporal locality Summary – Fully Associative • Flexibility of block placement allows smart replacement algorithms that promotes temporal locality. • Expensive, slow. – Set Associative • Simpler to build than fully associative, yet gives good temporal locality through flexible placement of blocks (just like fully-associative). • Limited associativity can sometimes give poor performance Summary • Writing policies – Write-through • Simple to implement • Slow – Write-back • Fast • Difficult to implement • Housekeeping flags – Need extra info for the running of the cache • Total Cache Sizes vs. Cache Sizes – Not the same thing!