A Performance Comparison of Contemporary DRAM Architectures Vinodh Cuppu, Bruce Jacob Brian Davis, Trevor Mudge University of Maryland University of Michigan About the Authors Trevor Mudge •Professor of EE and CS at University of Michigan •Ph.D: University of Illinois •Researching: •Comp. Systems design •Parallel Processing •Comp. Aided Design •Impact of Technology on Comp. Architecture About the Authors Brian Davis •Professor of E & C Engineering at Technical University of Michigan •Ph.D: University of Michigan, Nov 2000 •M.S. in CE at University of Michigan, Nov 1991 •Research: •New types of Hardware Description Language; specifically to enable more systematic methods for designing powerful DRAM architectures. About the Authors Bruce Jacob •Professor of E & C Engineering at Institute for Advanced Comp. Studies at University of Michigan •Ph.D: University of Michigan, 1997 •M.S. in CS & E at University of Michigan, Nov 1995 •A.B. in Math, cum laude at Harvard University, 1988 •Current Research: •Energy usage and voltage scaling in embedded systems About the Authors Vinodh Cuppu •Digital IC Logic Designer at Xtremespectrum, Inc. •M.S. in E & C Engineering at University of Maryland, Aug 2000 •B.E. in E & Communication Engineering at Unversity of Madras, India, May 1997 •Research: •Has published many well-regarded papers on DRAM and continues to model DRAM in different environments, specifically to see if it could be used in embedded applications Abstract In response to the growing gap between processor speed and main memory access time, many new DRAM architectures have been created. This paper tests the performance of a representative set of the architectures to see how all they respond to this trend. The architectures tested are: • Fast Page Mode •Extended Data Out • Synchronous Link • Synchronous • Rambus • Enhanced Synchronous • Direct Rambus Conventional DRAM Conventional DRAM Conventional DRAM Conventional DRAM Conventional DRAM Conventional DRAM Questions 1. What is the effect of improvements in DRAM technology on the memory latency and bandwidth problems? 2. Where is time spent in the primary memory system? What is the performance benefit of exploiting the page mode of contemporary DRAM? 3. How much locality is there in the address stream that reaches the primary memory system? Observations 1. There is a one-time tradeoff between cost, bandwidth and latency… • multiple DRAMs on same bus with bus optimizations (|request| ~> |transfer|) anything better requires faster bus and core 2. future bus technologies will expose row access time as the primary performance bottleneck… • widening buses present a clearer view of locality, so row hits are vital 3. buses …cannot halve the latency of a bus half as wide • even though the best latencies are seen from buses as wide as the L2 cache, they aren’t quite cost effective 4. ...critical word first does not mix well with burst mode • burst mode is likely to deliver unneeded data using a starting block out of address order 5. …the refresh mechanism used can significantly alter the average memory access time • can add wait cycles to row and column access Architectures: Fast Page Mode • Holds row open after first column is sent, in optimistic hope that the next access will be for a different column in the same row. Architectures: Extended Data Out • Added data latch holds column data immediately after sensing. This allows another transaction or a refresh to begin as soon as the column access is done Architectures: Synchronous DRAM • Often has a programmable buffer so it can return data over multiple cycles per request, making data available every clock cycle. • Transmits on clock cycles, making timing strobes from the memory controller unnecessary. Architectures: Enhanced SDRAM and Synchronous Link DRAM Enhanced SDRAM Synchronous Link DRAM • faster internal timing • open architecture, supplied by IEEE • SRAM row caches added to allow EDO-like behavior, namely the ability to satisfy requests for the cached row while freeing the bank up to do other things. • uses a packetized split request/response protocol • most significantly, it can support multiple concurrent transactions (if they reference unique banks) Architectures: Rambus DRAM • Uses a multiplexed address/data bus, so it limits communication to once every 4 cycles. • Transmits on both the rising and falling clock edges, reaching a theoretical maximum of 600 Megabytes per second. • Due to internal division of banks, up to 4 rows can remain open at once. Architectures: Direct Rambus DRAM •Faster core and transmission on both clock edges yields a theoretical maximum bandwidth of 1.6 Gigabytes per second. • Divided into 16 banks, employing 17 half-row buffers shared between each pair, limiting the amount of banks that can process transactions in parallel but also reducing the product size. •Uses a 3-byte-wide channel as opposed to Rambus’ single-bytewide channel and sends instructions over one byte width and data over the other two. • Most importantly, Direct Rambus does not multiplex its bus and has its internal structures arranged in such a manner that it can service 3 up to transactions at the same time. Methodology: Basis • Extensions written for SimpleScalar, an aggressive out-of-order processor simulator, so that it would model the DRAM architectures described. • A lot of the memory access time is overlapped with instruction execution in SimpleScalar, so two extra simulations were run; one where bus transmission was instantaneous and another where memory operation is instantaneous, and the following formulae applied to the results: Tp = time processing, Tl = memory latency stalls, To = overlapped mem. access Tu = exec. time with instantaneous bandwidth, Tm = total mem. access time Tb = memory bandwidth stalls, T = total real execution time • Tl = Tu – Tp • Tb = T – Tu • To = Tp – (T-Tm) Now memory access time can be separated out into different categories of stalls and the amount of time bandwidth and latency were overlapped. Methodology: Simulated architecture • Timing information for DRAM parts was found in technical reports. • Ran the simulated L2 cache at speeds of 100ns, 10ns and 1ns, scaling the CPU speed to match (CPU speed = 10x L2 speed). • Simulated architecture: Processor: eight-way superscalar, out of order Caches: L1: Lockup-free split (64K/64K), 2-way set associative with 64-byte linesizes L2: unified 1MB, 4-way set associative with a 128-byte linesize and write back, lockup-free, but only allows one outstanding request at a time • This represents a common workstation of the time (1999). Methodology: Balancing the architectures Interleaving: • Since the request size is 8 times the transfer size in the simulated organization chosen, DRAM access is a pipelined operation. The other DRAMs would gain an unfair advantage over FPM and EDO DRAM since both are not interleaved. The authors modeled interleaved versions that could fill the memory data bus as much as possible separately. These versions are labeled FPM3 and EDO2. •FPM1 is ‘pessimistic’, it closes the accessed row and precharges immediately. •FPM2 is ‘optimistic’, it holds the accessed row open and delays precharge. Bus Structure: • SLDRAM, RDRAM, and DRDRAM all use narrower, higher-speed buses and are simulated on a single-width bus in serial. This incurs an extra bit of latency since the simulated memory controller has to coalesce bus packets into properlysized blocks to send over the common bus used for the rest of the simulations, which is wider. To ameliorate this, transfer time over the narrow channel is taken to be instantaneous. Preliminary Results: Refresh Handling • DRAM refresh can affect performance dramatically • All DRAMs but Rambus have 64ms refresh time • Rambus has a 33ms refresh time and can refresh internal banks individually rather than an entire matrix at a time. • This is the basis for observation 5. • Since the time-interspersed scheme is so much better, it was used for all the DRAMs. This puts all the architectures on a more even footing. Results: Total Execution time • Interleaved DRAMs do much better (FPM3 & EDO2) • Pessimistic FPM1 does better than Optimistic FPM2 since refresh takes a little longer than row access. • Are newer DRAMS having trouble keeping up with CPU speed? • Is memory bandwidth really the biggest contributor to DRAM slowdown? A lot has been done to increase memory bandwidth, but what about latency? Results: Performance breakdown • FPM is the slowest • Interleaving is good, as is pessimistic strategy • EDO uses basically the same technology as FPM, but is faster due to better architecture • SDRAM is faster still and ESDRAM is even better since it tweaks timing and adds a SRAM cache to improve concurrency • SLDRAM and Rambus have higher access time compared with SDRAM and ESDRAM due to bus packing • SLDRAM and RDRAM make twice as many data transfers as DRDRAM, and if “…they had been organized… to put them on an even footing with DRDRAM… their latencies would be 20 to 30% lower.” “The parallel-channel results demonstrate the failure of a 100MHz 128-bit bus to keep up with today’s fastest parts.” Results: Parallel channel DRAM and bandwidth • The parallel bus architectures (SLDRAM, RDRAM and DRDRAM) have a much larger proportion of their access time tied up in Bus Transmission Time. • Speeding up the bus would make these run faster, and has been done fourfold since this paper’s time. • What is the effect though? • With Bus Transmission Time decreased, latency becomes the largest proportional slowdown… and fixing it is much harder. Conclusion: Questions answered 1, Effect of DRAM improvements? • Bandwidth problem Is being addressed, as newer architectures support multiple concurrent transactions, multiple concurrent accesses and/or multiple bus channels. • Latency is not being addressed and will become more of a problem. 2. Where is time spent? • Most time is spent in bus transmission, which needs to be improved. How much does Page Mode help? • There is a significant degree of locality in addresses accessed, so DRAMs that are internally multi-banked (so they have more than one row buffer) seem to do better… Page Mode then is useful, but other factors can get in the way. (FPM1 vs FPM2) Conclusion: Questions answered 3. How much locality is there is the address stream? • Quite a lot, actually, but the effect doesn’t scale well with large buffers. Hits in Victim-Row FIFO Buffer for FPM DRAM Conclusion: Brass Tacks • Bandwidth is a major slowdown for modern DRAM, but we today know that this is easily fixable as we have system buses of 400MHz. According to the results in this paper, DRAM latency is now the big problem and it isn’t very easy or cheap to fix. This paper’s prophecy has come to pass. • The box of tricks is getting empty… Techniques like interleaving, multiple transactions, multiple channels and other such bandwidth-dependant speedups are starting to become harder to find. • Need to devise ways to improve latency now: • different technology? • better exploitation of locality (address prediction) ? • more internal division? • multilevel internal caches? • a different type of storage matrix? The…. End? Epilogue: The bleeding edge One of the newest made-for-mainstream memories, GDDR3 • Micron delivered the first samples of GDDR3 to Nvidia and ATI on August 8, 2003 • The word is that ATI and Nvidia will have new top of the line graphics cards out, exploiting the new GDDR3 DRAM, by Q4 2003 or Q1 2004. • So, what does it do better? • .11 micron process (!) • Variable Write Latency • On-Die Termination • 6.4 GBs data rate • Posted CAS • operating voltage is half that of GDDR2 Epilogue: Revolutionary or evolutionary? Warning… high data rate, ready skepticism… • On-Die Termination: drops out reflection caused by signals hitting their terminals… safe. • .11 micron process: fit more logic in a small area and drive it with less power… safe • Low operating voltage: pleasant effect of small process size… safe • 6.4 GBs data rate, Variable Write Latency, Posted CAS, and clock rate of up to 800 MHz… danger! Epilogue: Revolutionary or evolutionary? Posted CAS: Adds latency cycles to the Column Select so READ/ACT commands (Row Select) don’t collide with Column Selects to allow faster internal clocking. Variable Write Latency: Adds latency to the Write operation so that it doesn’t corrupt a Read operation (RAW dependency). Write latency = CAS speed + AL (Posted CAS) -1. Not entirely as bad as it sounds (due to interleaving), so could only triple Write latency instead of sextupling it! High DRAM clock: Makes all this additive latency necessary. Now architects are intentionally adding latency. Epilogue: Verdict? Evolutionary… at best • GDDR3 is sending us in apparently the wrong direction, pumping up bandwidth at the expense of latency by upping the clock speed and adding tweaks to make sure the data stays consistent. • Bandwidth may be preferable over latency in a graphics processor to deliver increased frame rate, but the GDDR2 architecture acted as a roadmap for the DDR2 primary memory architecture. • Are there any low-latency primary memory architectures in development… and if so do any of them have a chance at survival in the market? • Ask me after December 8th. The End (and this time I mean it)