CS 161 Ch 7: Memory Hierarchy LECTURE 22 Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan 1 1999 ©UCB Improving Caches °In general, want to minimize Average Access Time: = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate (recall Hit Time << Miss Penalty) °So far, have looked at • Larger Block Size • Larger Cache Reduce Miss Rate • Higher Associativity °What else to reduce miss penalty? Add a second level (L2) cache. 2 1999 ©UCB Current Memory Hierarchy Processor Control L1 cache regs Datapath L2 Cache Main Memory Secondary Memory Speed(ns): 0.5ns 2ns 6ns 100ns 10,000,000ns Size (MB): 0.0005 0.05 1-4 100-1000 100,000 Cost ($/MB): -$100 $30 $1 $0.05 Technology: Regs SRAM SRAM DRAM Disk 3 1999 ©UCB How do we calculate the miss penalty? °Access time = L1 hit time * L1 hit rate + L1 miss penalty * L1 miss rate °We simply calculate the L1 miss penalty as being the access time for the L2 cache °Access time = L1 hit time * L1 hit rate + (L2 hit time * L2 hit rate + L2 miss penalty * L2 miss rate) * L1 miss rate 4 1999 ©UCB Do the numbers for L2 Cache °Assumptions: • L1 hit time = 1 cycle, L1 hit rate = 90% • L2 hit time (also L1 miss penalty) = 4 cycles, L2 miss penalty= 100 cycles, L2 hit rate = 90% °Access time = L1 hit time * L1 hit rate + (L2 hit time * L2 hit rate + L2 miss penalty * (1 - L2 hit rate) ) * L1 miss rate = 1*0.9 + (4*0.9 + 100*0.1) *(1-0.9) = 0.9 + (13.6) * 0.1 = 2.26 clock cycles 5 1999 ©UCB What would it be without the L2 cache? °Assume that the L1 miss penalty would be 100 clock cycles °1 *0.9 + (100)* 0.1 °10.9 clock cycles vs. 2.26 with L2 °So gain a benefit from having the second, larger cache before main memory °Today’s L1 cache sizes: 16 KB-64 KB; L2 cache may be 512 KB to 4096 KB 6 1999 ©UCB An Example (pp. 576) Q: Suppose we have a processor with a base CPI of 1.0 assuming all references hit in the primary cache and a clock rate of 500 MHz. The main memory access time is 200 ns. Suppose the miss rate per instn is 5%. What is the revised CPI? How much faster will the machine run if we put a secondary cache (with 20-ns access time) that reduces the miss rate to memory to 2%? Assume same access time for hit or miss. A: Miss penalty to main memory = 200 ns = 100 cycles. Total CPI = Base CPI + Memory-stall cycles per instn. Hence, revised CPI = 1.0 + 5% x 100 = 6.0 When an L2 with 20-ns (10 cycles) access time is put, the miss rate to memory is reduced to 2%. So, out of 5% L1 miss, L2 hit is 3% and miss is 2%. The CPI is reduced to 1.0 + 5% x (10 + 40% x 100) = 3.5. Thus, the m/c with secondary cache is faster by 6.0/3.5 = 1.7 7 1999 ©UCB The Three Cs in Memory Hierarchy °The cache miss consists of three classes. - Compulsory misses: Caused due to first access to the block from memory – small but fixed independent of cache size. - Capacity misses: Because the cache cannot contain all the blocks for its limited size – reduces by increasing cache size - Conflict misses: Because multiple blocks compete for the same block or set in the cache. Also called collision misses. – reduces by increasing associativity See Fig. 7.30 for performance 8 1999 ©UCB An Example (pp. 576) Q: Suppose we have a processor with a base CPI of 1.0 assuming all references hit in the primary cache and a clock rate of 500 MHz. The main memory access time is 200 ns. Suppose the miss rate per instn is 5%. What is the revised CPI? How much faster will the machine run if we put a secondary cache (with 20-ns access time) that reduces the miss rate to memory to 2%? Assume same access time for hit or miss. A: Miss penalty to main memory = 200 ns = 100 cycles. Total CPI = Base CPI + Memory-stall cycles per instn. Hence, revised CPI = 1.0 + 5% x 100 = 6.0 When an L2 with 20-ns (10 cycles) access time is put, the miss rate to memory is reduced to 2%. So, out of 5% L1 miss, L2 hit is 3% and miss is 2%. The CPI is reduced to 1.0 + 5% x (10 + 40% x 100) = 3.5. Thus, the m/c with secondary cache is faster by 6.0/3.5 = 1.7 9 1999 ©UCB Appendix B: Memory Technology - SRAM ° SRAM – Static Random Access Memory used as cache. ° Internal design of a 4x2 SRAM using D-FFs shown in Fig. B.23. How many transistors each D-FF have? - A row is selected as per row decoder, which corresponds to the msbs of the address. -The outputs for each bit can be connected through a tri-state buffer, connected to the column decoder. –The circuit can be extended to include many chips and chip select signals (fig. B.21) ° A different organization is shown in Fig. B.24. ° Synchronous SRAM or DRAM – Ability to transfer a burst of data given a starting address and a burst length – suitable for transferring a block of data from main memory to cache. 10 1999 ©UCB Figs. B.23 and B.24 (Appendix B) Read SRAM from pp. B-26 through B-30 Figs B.21, B.22, B.23 and B.24 11 1999 ©UCB Dynamic Random Access Memory - DRAM ° DRAM organization is similar to SRAM except that each bit of DRAM is constructed using a pass transistor and a capacitor, shown in Fig. B.25 ° Less number of transistors/bit gives high density, but slow discharge through capacitor. ° Capacitor needs to be recharged or refreshed giving rise to high cycle time. ° Uses a two-level decoder as shown in Fig. B.26. Note that 2048 bits are accessed per row, but only one bit is used. ° Page-mode DRAM – Why not use these bits without row access by changing column address only? ° Nibble-mode RAM – provides 4 bits (nibble) for every row access 12 1999 ©UCB Figs. B.25 and B.26 (Appendix B) °Read DRAM pp. B-31 to B-33 °Understand Figs. B.25 and B.26 13 1999 ©UCB Main Memory Organizations Fig. 7.13 CPU CPU CPU Multiplexor Cache Cache Cache Bus Memory Memory wide memory organization one-word wide memory organization 14 Bus Bus Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 interleaved memory organization DRAM access time >> bus transfer time 1999 ©UCB Memory Access Time Example ° Assume that it takes 1 cycle to send the address, 15 cycles for each DRAM access and 1 cycle to send a word of data. ° Assuming a cache block of 4 words and one-word wide DRAM (fig. 7.13a), miss penalty = 1 + 4x15 + 4x1 = 65 cycles ° With main memory and bus width of 2 words (fig. 7.13b), miss penalty = 1 + 2x15 + 2x1 = 33 cycles. For 4-word wide memory, miss penalty is 17 cycles. Expensive due to wide bus and control circuits. ° With interleaved memory of 4 memory banks and same bus width (fig. 7.13c), the miss penalty = 1 + 1x15 + 4x1 = 20 cycles. The memory controller must supply consecutive addresses to different memory banks. Interleaving is universally adapted in high-performance computers. 15 1999 ©UCB