Ch7 MEMORY HIERARCHY OVERVIEW Accessing memory has always been one of the primary bottlenecks in computing. Fast memory is expensive, and only very high-performance, expensive computers are built with very large, fast memories. The rest of us live with slow (50 ns access, these days) DRAM memory. Consider: Today’s machines may have a clock cycle of 1 ns, but memory access requires 50 nsec! Obviously, a disconnect in performance. If we had to live with a simple memory of this type, we’d have a performance differential (assuming that ¼ of the instructions are loads and stores and all instructions have to do a fetch): Perf: 50ns memory/1.0 nsec memory=0.75*50+0.25*140/0.75*1.0+0.25*2.0 = 50 Obviously, this is a problem. Thus, the necessity of better memory performance. A number of techniques are used to greatly improve memory performance. Most important is, of course, cache memory. What is cache memory? A special fast memory that stores RECENTLY ACCESSED locations, and locations adjacent to those. 1. Level 1 cache memory is integrated with the CPU, runs full speed 2. NOT random access, instead, it’s structured to store recent program and data locations. 3. As long as required instructions and data are in cache, a hit occurs, and memory access does not slow down CPU (possibly 2-cycle Load/Store) 4. If info is NOT in cache, a miss occurs, and the CPU has to stop to await the slow process of memory being uploaded. 5. Because the amount of space available on the CPU die for cache memory is small (32 Kbytes-128 kbytes), often a LEVEL 2 cache is integrated into the computer’s motherboard. This is a cache memory that is larger than level 1 cache (256 kbyte-4 Mbyte), but necessarily slower because it is off-chip. Usually 10-25 nsec access. Why does cache memory work? If locations were accessed totally randomly, one would not expect many hits. However, the execution of a program is anything but random: 1 Principle of locality: Temporal locality: if an address is referenced, it is likely to be accessed again soon. Programs tend to consist of loops, so the first time through a loop may be slow because all the instructions need to be uploaded from memory. Data is often accessed within the loop too, so L/S addresses tend to be the same time after time. Spatial locality: If an address is referenced, it is likely that close-by locations will also be referenced. Loops consist of adjacent instructions. Data: related variables tend to be spaced closely together (arrays, structs, variable definitions). These principles suggest that it is indeed worthwhile to use a cache, and that rather than uploading one instruction or data location at a time, it may be useful to upload several. WHY? Memory (DRAM) requires substantial overhead to get the first location out. Subsequent locations can be accessed much faster. ½ ADDR RAS DRAM DATA OUT (1,2 or 4 bits) CAS DATA IN This is true because two cycles are needed to latch the address (1/2 the address at a time), but today’s DRAMS are pipelined so that only the first address needs to be given; all subsequent locations just require a little control handshaking. Consequently: cache memory operates on BLOCKS of adjacent bytes. A whole block is loaded from DRAM (or Level 2 cache) at a time. Block sizes tend to be 16-64 bytes (4-16 words). Thus, the primary purpose of cache memory is to improve the average access time of instruction and data fetches. 2 Today, faster memory is also available (SDRAM), at a premium price. Access times are down to 8 ns, allowing full-speed 133 MHz memory bus operation. Virtual memory and paging But, there are other aspects to the memory hierarchy system. Consider a user of, say, a UNIX system on which a number of other users are operating. That user wants to be certain that: No other user interferes with his/her program execution Multiple programs can run concurrently on the machine, even if they all start at Address 0 The machine will not run out of memory, even if a program requires 10 times as many locations as are available in DRAM! The memory hierarchy of today’s computers ensures that these are all true. Memory protection hardware compares accessed locations to the range of addresses allowed to the currently-running (in CPU) program, and causes an exception if the access is not legal (memory access violation). Active programs (called processes) share CPU resources: periodically an external timer generates an interrupt, and the O.S. decides whether to give a different process CPU time. This happens about every 10 ms. Virtual memory allows the second need to be met. Addresses accessed by the CPU are translated so that the target address may be in one of many different locations in DRAM. This allows all programs to start at location 0, as long as the mapping is different for all the programs. Locations in memory are organized into blocks of adjacent locations called PAGES; each page can be arbitrarily mapped. Demand paging allows writing memory blocks that have not been recently used to the hard disk drive, temporarily caching them at a lower level! This frees up DRAM memory for other uses. When a program needs a location that has been paged out to disk, the hardware detects the problem and causes an exception, and the operating system takes over and retrieves the page. 3 CACHE MEMORY (7.2) Consider a RAM memory read operation. The system responds to an applied address by outputting the bits stored at that location. A cache memory must work differently. It stores recently used locations somewhere in the cache, and must retrieve the information by address. A cache is organized into LINES. Each line is comprised of: 1) One block of data (or instructions) 2) Address information that allows determining whether the CPU address requested is located in the line 3) A few housekeeping control bits. Most systems today have a UNIFIED CACHE that contains both instructions and data, but a few have separate caches for data and instructions to allow simultaneous IF and Load/Store operations. Questions that must be addressed here in considering caches are: 1) Where can a block be placed in the cache? 2) How is a block found if it is in the cache? 3) Which block should be replaced on a cache miss? 4) What happens on a write? We’ll consider three different cache architectures. Let’s assume that each line contains 32 bytes. ADDR: A31-A5=block number, A4-A2=word in block, A1-A0=byte # Architecture 1: FULLY-ASSOCIATIVE CACHE. This is the hardest system to implement, but the most general, and so it is pretty much a theoretical construct. A block can be placed ANYWHERE in the FA cache. This means that the full address of the block must be stored in the cache. Hardest is that a LRU (least-recently used) algorithm must be implemented to determine which block should be replaced if there is a miss. This takes a lot of slow circuitry! Block diagram: 4 Tag Data ADDR DATA STATUS … =? Addr 5-31, hit MUX 2-4 Data out What happens: each line has an independent address comparator that tries to match the CPU address with the line address. If one matches, we have a hit UNLESS the line isn’t in use, and we have a spurious match. One of the status bits must therefore be “line valid”. On a hit, the appropriate word in the line is passed via mux to the CPU (on a read). We’ll talk about writes later. On a miss (read), the LRU hardware decides which line to replace, and the CPU is halted until the requested block is uploaded from DRAM or L2 cache. The block’s address is copied along with the data, and the line valid flag is set. Architecture 2: DIRECT-MAPPED CACHE This is the opposite case from the above: each block can only go to one line in the cache, resulting in the possibility that lines can be overwritten frequently. But it is very simple structurally. To avoid the possibility that nearby data or instructions can overwrite each other’s line in the cache, the lower CPU address bits are used to index into the cache. The location of the data or instructions in cache is found by taking the modulo of the of the address in memory and the number of locations in cache. If a power of 2, you just take the low number bits for the address in cache and the tag is the upper number of bits. Note that in a unified cache, if data and instructions have the same bit patterns, they will interfere with each other. Likewise, a long loop that has twice as many instructions as cache capacity will also render the cache 5 useless because instructions will constantly overwrite each other in the cache. ADDR: (A16-A5) 000 tag 001 A31-A17 002 … FFF Index data (32 bytes) status =? CPU A31-A17 A4-A2 DATA OUT In this case, there is only ONE comparator, and no LRU hardware. So simple! Yet, the failings of this architecture have led to an intermediate approach. Architecture 3: SET-ASSOCIATIVE CACHE This is an intermediate approach. The cache is divided into 2-8 sets that are each direct-mapped. However, data may reside in any one of the slots in the sets, at the direct-mapped address. A read results in one comparison in each set. If a hit, data is taken from that set. If a miss, a crude LRU algorithm is used to determine which set’s element to replace. A 4 way set associative cache means that there are 4 slots of data in each set. There could be a number of sets in the total cache. SET 1 ADDR: (A16-A5) 000 001 002 A31-A17 FFF DATA (32 bytes) =? 6 STATUS CPU A31-A17 A4-A2 DATA OUT SET 2,3,4 same as set 1 (some additional control circuitry needed, plus an RLU implementation- which is sometimes just random replacement). Let’s reconsider our questions: 1) Where can a block be placed in the cache? FA-anywhere, DM-one place, SA-one location in each set. 2) How is a block found if it is in the cache? FA-one of N comparators, DM-one comparator, SA-one of a few 3) Which block should be replaced on a cache miss? FA-LRU algorithm, DM-constrained, SA-LRU or random. A common LRU algorithm involves a status bit for each line that is periodically reset. When a hit occurs, the bit is set again. After a while, only the non-accessed lines have bits reset, and can be replaced 4) What happens on a write? In other words, if we update data in the program that we want to store in memory? This is a bit tricky. In some cases (esp. with a DM cache), we may want to avoid writing data to the cache, in case it overwrites instructions or data. A write does not guarantee a subsequent read! But, DRAM writes DO have a time penalty. Thus, there are a number of policies: On a HIT: Write-through (or store-through) – Information written BOTH to block in cache AND the lower-level memory. Write-back (copy back) – Information written ONLY to block in cache. A “dirty bit” in status is set to indicate that a write to memory needs to be done. On a MISS: Write allocate (fetch on write) – Block is loaded on a write miss, and one of the above actions is taken. 7 No-write allocate (write around) – the block is modified at the lower level only; the cache is not updated. How can we quantify cache performance? 1) CPUtime = (CPUclockCycles + MemoryStallCycles) * ClockCycleTime 2) MemoryStallCycles = ReadStallCycles + WriteStallCycles = Instructs/program * misses/instruct * MissPenalty 3) ReadStallCycles = Reads/Program x ReadMissRate x ReadMissPenalty Write cycles stalls are more complex because they depend on write protocols. If write-allocate, then a miss involves a read miss penalty. Otherwise, penalty may be minimal (or nonexistent) to transfer data into write buffer. If write buffer write-to-memory doesn’t complete before next read miss, however, there will be additional stalls on the read… 4)WriteStallCycles = (Writes/program * writeMissRate * WriteMissPenalty) + WriteBufferStalls IMPROVING PERFORMANCE To improve performance, what can we do? Look at the components of Expression 3: A) B) Reduce Reads/Program – these can be reduced by making the compiler smarter. The compiler must choose whether a variable remains in memory or is manipulated in a register location, within a window of the program. Good strategies lead to fewer loads and stores. Reduce ReadMissRate – by: Larger cache overall 8 Larger associativity in the cache (4-way vs. 2 way Set Assoc.) Optimizing block size (larger is better to a certain point) Compiler optimizations (e.g. ordering of variables in blocks) Reduce ReadMissPenalty – by: Faster main memory (Fig below) Wider memory bus (more bits uploaded at once) (Fig below) Additional cache level(s) – example follows PREFETCH instructions and or data- special instruction indicates to memory system that certain address will be read soon, giving memory system time to load it (while CPU is active with preceeding code!). Write optimizations – can also be used to improve performance: Defer main memory writes in a WRITE BUFFER. A memory write operation is delayed until there is no read. Note that read addresses must be compared with addresses in write buffer! Use write-back strategies. C) D) The miss penalty can be improved either by increasing the bandwidth or data width of the path between DRAM memory and cache: (example on page 561) CPU 32 32 L1 cache 128 32 128 32 L2 cache DRAM Normal bandwidth Memory access interleaved (4x faster) High bandwidth (4x faster) 9 SAMPLE PROBLEM USING LEVEL 2 CACHE A system has a CPU with avg CPI of 1.5 and a clock speed 1.0 GHz. Main memory access block transfer time is 150 nsec. The miss rate for a (dataintensive) program is 10%. How much faster will the machine be if a secondary cache is used that transfers a block in 20 nsec, with a 2% miss rate? Consider the system without Level 2 cache: MainMemoryMissPenalty = 150 ns / 1ns clock cycle = 150 clock cycles EffectiveCPI = BaseCPI + memoryStalls/instruction = 1.5 + 0.1 * 150 = 16.5 **** According to Amdahl’s law, improving CPU won’t help much! **** With second-level cache: Level2MissPenalty = 20ns/1ns clock = 20 clocks TotalCPI = base + Lev2 stalls +mainMemStalls = 1.5 + 0.1*20 + 0.02*150 = 1.5 + 2 + 3 = 6.5. Still somewhat far from optimal! Performance = 16.5 / 6.5 = 2.54 Definitely want Level 2 cache in there. But also want a reduced miss rate! Problems: 1) (7.1) Describe the general characteristics of a program that would exhibit very little temporal and spatial locality with regard to data accesses. Solution: This is a problem where you have no arrays or structures. Memory accesses jump around and are only accessed once. 2) (7.7) Here is a series of address references given as word addresses: 1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17. Assuming a direct mapped 10 cache with 16 one-word blocks that are initially empty, label each reference in the list as a hit or a miss and show the final contents of the cache. Solution: 1 – Miss 4 – Miss 8 – Miss 5 – Miss 20 – Miss 17 – Miss 19 – Miss 56 – Miss 9 – Miss 11 – Miss 4 – Miss 43 – Miss 5 – Hit 6 – Miss 9 – Hit 17 – Hit Value Slot 0 17 1 2 19 4 3 4 5 5 6 6 7 56 9 8 9 43 10 11 12 13 14 15 3. (7.8) Using the same references as problem 2, show the hits and misses and final cache contents for a direct mapped cache with four word blocks and a total size of 16 words. Solution. What this means is we still have the same total cache space, but since we have 4 word blocks and only 16 total blocks, we can only have 4 slots. The hits and misses: 1 – Miss 4 – Miss 8 – Miss 5 – Hit 11 20 – Miss 17 – Miss 19 – Hit 56 – Miss 9 – Miss 11 – Hit 4 – Miss 43 – Miss 5 – Hit 6 – Hit 9 – Miss 17 – Hit Final contents: Start of word block Block Number 16 4 8 17 5 9 18 6 10 19 7 11 0 1 2 3 4) (7.11) Consider a memory hierarchy using one of the three organizations for main memory shown in Figure 7.13 (a: one-word wide memory, b: fourword-wide memory, c: four word interleaved memory). Assume that the cache block size is 16 words. If the main memory latency for a new access is 10 cycles and the transfer time is 1 cycle, what are the miss penalties for each of these organizations? Solution: For a: one-word wide memory: This requires 16 main memory accesses to retrieve a cache block and words are transferred 1 at a time so: 1 + 16*10 + 16*1 = 177 clock cycles. 12 For b: four-word-wide: This requires 4 main memory accesses to retrieve a cache block and they are transferred 4 at a time: 1 + 4*10 + 4*1 = 45 clock cycles For c: four word interleaved: This requires 4 main memory accesses to retrieve a cache block and they are transferred 1 at a time: 1 + 4*10 + 16*1 = 57 5) (7.12) Suppose a processor with a 16 word block size has an effective miss rate per instruction of 0.5%. Assume that the CPI without cache misses is 1.2. Using the memories from problem 4, how much faster is this processor when using the wide memory than narrow or interleaved? Solution: First we have to calculate the effective CPI for each memory: a: Effective CPI = 1.2 + .005*177 = 2.085 b: Effective CPI = 1.2 + .005*45 = 1.425 c: Effective CPI = 1.2 + .005*57 = 1.485 Since it is the same software and CPU, we can compare CPI directly. Wide over narrow = 2.085-1.425 = 1.46 Wide over interleaved = 1.485/1.425 = 1.04 6) (7.20) Using the following series of references, 1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17, show the hits and misses and final cache contents for a two-way set associative cache with one word blocks and a total size of 16 words. Solution: Ok, two way set assoc. means we have two slots in each group. Each slot is only one word so we have 8 slots. 1 – Miss 4 – Miss 8 – Miss 5 – Miss 20 – Miss 17 – Miss 19 – Miss 56 – Miss 9 – Miss 13 11 – Miss 4 – Hit 43 – Miss 5 – Hit 6 – Miss 9 – Hit 17 – Hit Final state: Block # 0 1 2 3 4 5 6 7 Element #1 56 17 Element #2 8 9 43 4 5 6 11 20 7) (7.38) If all misses are classified into one of three categories: compulsory (blocks not there since we are just starting), capacity (blocks not there since they couldn’t all fit, some had to be replaced), or conflict (misses in direct mapped or set assoc. that wouldn’t occur in a full assoc.), which misses are likely to be reduced when a program is rewritten so as to require less memory? How about if the clock rate of the machine that the program is running on is increased? How about if the associativity of the existing cache is increased? Solution: Less memory: will likely reduce compulsory and capacity misses. Maybe conflict…tough to tell Increase clock rate : nothing. Clock rate doesn’t effect hit rate. Increase assoc.: should reduce conflict. 14 Homework: 1) (7.5) Describe the general characteristics of a program that would exhibit very high amounts of temporal locality, but very little spatial locality with regard to instruction fetches. 2) (7.9) Compute the total number of bits required to implement the cache in Figure 7.10. This number is different from the size of the cache, which usually refers to the number of bytes of data stored in the cache. The number of bits needed to implement the cache represents the total amount of memory needed for storing all of the data, tags and valid bits. 3) (7.13) Cache C1 is direct-mapped with 16 one-word blocks. Cache C2 is direct-mapped with 4 four-word blocks. Assume that the miss penalty for C1 is 8 clock cycles and C2 is 11 clock cycles. Assuming that the caches are initially empty, find a reference string for which C2 has a lower miss rate but spends more cycles on cache misses than C1. Use word addresses. 4) (7.21) Using the series of references: 1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17, show the hits and misses and final cache contents for a fully associative cache with one-word blocks and a total size of 16 words. Assume least recently used replacement. 15