Computer Systems Organization: Lecture 9 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, UCB] ENEE350 Review: Major Components of a Computer Processor Control Datapath ENEE350 Devices Memory Input Output Processor-Memory Performance Gap µProc 55%/year (2X/1.5yr) Performance 10000 1000 “Moore’s Law” Processor-Memory Performance Gap (grows 50%/year) 100 10 DRAM 7%/year (2X/10yrs) 19 80 19 83 19 86 19 89 19 92 19 95 19 98 20 01 20 04 1 Year ENEE350 The Memory Hierarchy Goal Fact: Large memories are slow and fast memories are small How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? ENEE350 With hierarchy With parallelism A Typical Memory Hierarchy By taking advantage of the principle of locality Can present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology On-Chip Components Control eDRAM Instr Data Cache Cache 1’s 10’s 100’s 100’s K’s 10K’s M’s Size (bytes): Cost: ENEE350 ITLB DTLB Speed (%cycles): ½’s Datapath RegFile Second Level Cache (SRAM) highest Main Memory (DRAM) Secondary Memory (Disk) 1,000’s G’s to T’s lowest Characteristics of the Memory Hierarchy Processor 4-8 bytes (word) Increasing distance from the processor in access time L1$ 8-32 bytes (block) L2$ 1 to 4 blocks Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM Main Memory 1,024+ bytes (disk sector = page) Secondary Memory (Relative) size of the memory at each level ENEE350 Memory Hierarchy Technologies Caches use SRAM for speed and technology compatibility 21 Address Low density (6 transistor cells), Chip select high power, expensive, fast Output enable Static: content will last “forever” (until power turned off) SRAM 2M x 16 Write enable 16 Din[15-0] 16 Main Memory uses DRAM for size (density) High density (1 transistor cells), low power, cheap, slow Dynamic: needs to be “refreshed” regularly (~ every 8 ms) - 1% to 2% of the active cycles of the DRAM Addresses divided into 2 halves (row and column) - RAS or Row Access Strobe triggering row decoder - CAS or Column Access Strobe triggering column selector ENEE350 Dout[15-0] Classical RAM Organization (~Square) bit (data) lines R o w D e c o d e r row address RAM Cell Array word (row) line Column Selector & I/O Circuits data bit or word ENEE350 Each intersection represents a 6-T SRAM cell or a 1-T DRAM cell column address One memory row holds a block of data, so the column address selects the requested bit or word from that block Classical DRAM Organization (~Square Planes) bit (data) lines R o w D e c o d e r Each intersection represents a 1-T DRAM cell RAM Cell Array word (row) line column address row address Column Selector & I/O Circuits data bit data bit data bit ENEE350 The column address selects the requested bit from the row in each plane Memory Systems that Support Caches The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways on-chip CPU One word wide organization (one word wide bus and one word wide memory) Assume Cache bus 32-bit data & 32-bit addr per cycle Memory 1. 1 clock cycle to send the address 2. 25 clock cycles for DRAM to be read 3. 1 clock cycle to return a word of data Memory-Bus to Cache bandwidth ENEE350 number of bytes accessed from memory and transferred to cache/CPU per clock cycle One Word Wide Memory Organization on-chip CPU Cache If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory cycle to send address cycles to read DRAM bus cycle to return data total clock cycles miss penalty Memory Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock ENEE350 One Word Wide Memory Organization on-chip CPU If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory Cache 1 25 1 27 bus Memory cycle to send address cycles to read DRAM cycle to return data total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is 4/27 = 0.148 ENEE350 bytes per clock One Word Wide Memory Organization, con’t on-chip What if the block size is four words? cycle to send 1st address CPU cycles to read DRAM cycles to return last data word total clock cycles miss penalty Cache bus Memory Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock ENEE350 One Word Wide Memory Organization, con’t on-chip What if the block size is four words? 1 4 x 25 = 100 1 102 CPU Cache cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty 25 cycles bus 25 cycles 25 cycles Memory 25 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/102 = 0.157 ENEE350 bytes per clock Interleaved Memory Organization For a block size of four words on-chip cycle to send 1st address CPU cycles to read DRAM cycles to return last data word Cache total clock cycles miss penalty bus Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3 Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock ENEE350 Interleaved Memory Organization For a block size of four words on-chip 1 cycle to send 1st address CPU 25 cycles to read DRAM 4 cycles to return the 4 word Cache 30 total clock cycles miss penalty 25 cycles bus 25 cycles 25 cycles Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3 25 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/30 = 0.533 bytes per clock ENEE350 DRAM Memory System Summary Its important to match the cache characteristics with the DRAM characteristics ENEE350 caches access one block at a time (usually more than one word) use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache with the memory-bus characteristics make sure the memory-bus can support the DRAM access rates and patterns with the goal of increasing the Memory-Bus to Cache bandwidth