Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian, Al Davis, Norm Jouppi* University of Utah and *HP Labs Why a complete DRAM redesign? JEDEC SDRAM Standard High Density June 1994 Cost-per-bit over time Low Cost-per-bit Energy efficient Time for DRAM’s own “right-hand turn” Rethink design for modern constraints Courtesy: http://www.iiasa.ac.at 2 Memory Trends • Energy – Large scale systems attribute 25-40% of total power to the memory subsystem – Capital acquisition costs = operating costs over 3 years – Energy is a first-order design constraint • Access patterns – Increasing socket, core, and thread counts – Final memory request stream extremely random – Cannot design for locality 3 Row buffer hit rate (%) Memory Trends 100 90 80 70 60 50 40 30 20 10 0 1 Core 4 Core 16 Core 4 Percentage of Row Fetches Memory Trends 100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 Use Count >3 Use Count 3 Use Count 2 Use Count 1 5 Memory Trends • Energy – Large scale systems attribute 25-40% of total power to the memory subsystem – Capital acquisition costs = operating costs over 3 years – Energy is a first-order design constraint • Access patterns – – – – Increasing socket, core, and thread counts Final memory request stream extremely random Cannot design for locality What is exact overfetch degree? • DRAM Reliability – Critical apps require chipkill-level reliability – Building fault-tolerance out of unreliable components is expensive – Schroeder et al., SIGMETRICS 2009 6 Related Work • Overfetch – Ahn et al. (SC ’09), Ware et al. (ICCD ’06), Sudan et al. (ASPLOS ’10) • DRAM Low-power modes – Hur et al. (HPCA ’08), Fan et al. (ISLPED ’01), Pandey et al. (HPCA ’06) • DRAM Redesign – Loh (ISCA ’08), Beamer et al. (ISCA ’10) • Chipkill mechanisms – Yoon and Erez (ASPLOS ’10) 7 Executive Summary • Rethink DRAM design for modern constraints – Low-locality, reduced energy consumption, optimize TCO • Selective Bitline Activation (SBA) – Minimal design changes – Considerable dynamic energy reductions for small latency and area penalties • Single Subarray Access (SSA) – Significant changes to memory interface – Large dynamic and static energy savings • Chipkill-level reliability – Reduced energy and storage overheads for reliability 8 Outline • • • • • DRAM systems overview Selective Bitline Activation (SBA) Single Subarray Access (SSA) Chipkill-level reliability Conclusion 9 Basic Organization … Array 1/8th of the row buffer One word of data output DRAM chip or device Bank Rank DIMM Memory bus or channel On-chip Memory Controller 10 Basic DRAM Operation DRAM Chip DRAM Chip DRAM Chip DRAM Chip RAS CAS Cache Line One bank shown in each chip 11 Row Buffer Outline • • • • • DRAM systems overview Selective Bitline Activation (SBA) Single Subarray Access (SSA) Chipkill-level reliability Conclusion 12 Selective Bitline Activation • Activate only those bitlines corresponding to the requested cache line – reduce dynamic energy – Some area overhead depending on access granularity – we pick 16 cache lines for 12.5% area overhead • Requires no changes to the interface and minimal control changes 13 Outline • • • • • DRAM systems overview Selective Bitline Activation (SBA) Single Subarray Access (SSA) Chipkill-level reliability Conclusion 14 Key Idea $0.30 60W $3.00 13W • Incandescent light bulb • Energy-efficient light bulb It’s a small in capital costs • Lowworth purchase cost increase • Higher purchase cost to • High operating • Much lower operating gain large cost reductions in operating costs cost • Commodity • Value-addition And not 10X, just 15-20%! 15 Wishlist of features • Eliminate overfetch – Disregard locality • Increase opportunities for power-down • Increase parallelism • Enable efficient reliability mechanisms 16 SSA Architecture ONE DRAM CHIP ADDR/CMD BUS DIMM 64 Bytes Subarray Bitlines Bank Row buffer 8 8 8 8 8 8 8 8 DATA BUS MEMORY CONTROLLER Global Interconnect to I/O 17 SSA Basics • Entire DRAM chip divided into small subarrays • Width of each subarray is exactly one cache line • Fetch entire cache line from a single subarray in a single DRAM chip – SSA • Groups of subarrays combined into “banks” to keep peripheral circuit overheads low • Close page policy and “posted-RAS” similar to SBA • Data bus to processor essentially split into 8 narrow buses 18 SSA Operation DRAM Chip DRAM Chip DRAM Chip DRAM Chip Subarray Subarray Subarray Subarray Subarray Subarray Subarray Subarray Subarray Subarray Subarray Subarray Subarray Subarray Subarray Subarray Address Cache Line 19 Sleep Mode (or other parallel accesses) SSA Impact • Energy reduction – Dynamic – fewer bitlines activated – Static – smaller activation footprint – more and longer spells of inactivity – better power down • Latency impact – Limited pins per cache line – serialization latency – Higher bank-level parallelism – shorter queuing delays • Area increase – More peripheral circuitry and I/O at finer granularities – area overhead (< 5%) 20 Methodology • Simics based simulator – ‘ooo-micro-arch’ and ‘trans-staller’ • FCFS/FR-FCFS scheduling policies • Address mapping and DRAM models from DRAMSim • DRAM data from Micron datasheets • Area/Energy numbers from heavily modified CACTI 6.5 • PARSEC/NAS/STREAM benchmarks • 8 single-threaded OOO cores, 32 KB L1, 2 MB L2 • 2GHz processor, 400MHz DRAM 21 Dynamic Energy Reduction Relative DRAM Energy Consumption 2.50 Baseline Open Row 2.00 Baseline Close Row 1.50 1.00 SBA 0.50 SSA 0.00 Moving to close page policy – 73% energy increase on average Compared to open page, 3X reduction with SBA, 6.4X with SSA 22 Contributors to energy consumption 100% Termination Resistors 80% Global Interconnect 60% 40% Bitlines 20% Decoder + Wordline + Senseamps 0% BASELINE BASELINE SBA SSA (OPEN PAGE, (CLOSED FR-FCFS) ROW, FCFS) 64 cache lines in baseline 16 cache lines in SBA 1 cache line in SSA 23 Static Energy – Power down modes • Current DRAM chips already support several lowpower modes • Consider the low-overhead power down mode: 5.5X lower energy, 3 cycle wakeup time • For a constant 5% latency increase – 17% low-power operation in the baseline – 80% low-power operation in SSA 24 Cycles Latency Characteristics 800.00 700.00 600.00 500.00 400.00 300.00 200.00 100.00 0.00 Baseline Open Page Baseline Close Page SBA SSA • Impact of Open/Close page policy – 17% decrease (10/12) or 28% increase (2/12) • Posted-RAS adds about 10% • Serialization/Queuing delay balance in SSA - 30% decrease (6/12) or 40% increase (6/12) 25 Contributors to Latency 100% Data Transfer 80% DRAM Core Access 60% 40% Rank Switching delay (ODT) 20% Command/Addr Transfer 0% Queuing Delay BASELINE (OPEN BASELINE PAGE, FR-FCFS) (CLOSED ROW, FCFS) SBA 26 SSA Outline • • • • • DRAM systems overview Selective Bitline Activation (SBA) Single Subarray Access (SSA) Chipkill-level reliability Conclusion 27 DRAM Reliability • Many server applications require chipkill-level reliability – failure of an entire DRAM chip • One example of existing systems – 64-bit word requires 8-bit ECC – Each of these 72 bits must be read out of a different chip, else a chip failure will lead to a multi-bit error in the 72-bit field – unrecoverable! – Reading 72 chips - significant overfetch! • Chipkill even more of a concern for SSA since entire cache line comes from a single chip 28 Proposed Solution DIMM DRAM DEVICE L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C L9 C L10 C L11 C L12 C L13 C L14 C L15 C P1 C L8 C . . . . . . . . . . . . . . . . . . P7 C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C L – Cache Line C – Local Checksum P – Global Parity Approach similar to RAID-5 29 Chipkill design • Two-tier error protection • Tier - 1 protection – self-contained error detection – 8-bit checksum/cache line – 1.625% storage overhead – Every cache line read is now slightly longer • Tear -2 protection – global error correction – RAID-like striped parity across 8+1 chips – 12.5% storage overhead • Error-free access (common case) – 1 chip reads – 2 chip writes – leads to some bank contention – 12% IPC degradation • Erroneous access – 9 chip operation 30 Outline • • • • • DRAM systems overview Selective Bitline Activation (SBA) Single Subarray Access (SSA) Chipkill-level reliability Conclusion 31 Key Contributions • Redesign of DRAM microarchitecture • Substantial chip access energy savings (up to 6X) • Overall, performance is a wash • Minor area impact (12% with SBA, 4.5% with SSA) • Two-tier chipkill-level reliability with minimal energy and storage overheads 32 Now is the time for new architectures.. • Take into account modern constraints • Energy far more critical today than before • Cost-per-bit perhaps less important – optimize TCO – Operating costs over 3 years = capital acquisition costs • Memory reliability is important for many server applications. Memory system’s “right-hand-turn” is long overdue 33