Class Presentation: A Performance Comparison of Contemporary

advertisement
A Performance Comparison of
Contemporary DRAM Architectures
Vinodh Cuppu, Bruce Jacob
Brian Davis, Trevor Mudge
University of Maryland
University of Michigan
About the Authors
Trevor Mudge
•Professor of EE and CS at University of
Michigan
•Ph.D: University of Illinois
•Researching:
•Comp. Systems design
•Parallel Processing
•Comp. Aided Design
•Impact of Technology on Comp.
Architecture
About the Authors
Brian Davis
•Professor of E & C Engineering at
Technical University of Michigan
•Ph.D: University of Michigan, Nov 2000
•M.S. in CE at University of Michigan, Nov
1991
•Research:
•New types of Hardware Description
Language; specifically to enable more
systematic methods for designing
powerful DRAM architectures.
About the Authors
Bruce Jacob
•Professor of E & C Engineering at Institute
for Advanced Comp. Studies at University
of Michigan
•Ph.D: University of Michigan, 1997
•M.S. in CS & E at University of Michigan,
Nov 1995
•A.B. in Math, cum laude at Harvard
University, 1988
•Current Research:
•Energy usage and voltage scaling in
embedded systems
About the Authors
Vinodh Cuppu
•Digital IC Logic Designer at
Xtremespectrum, Inc.
•M.S. in E & C Engineering at University of
Maryland, Aug 2000
•B.E. in E & Communication Engineering at
Unversity of Madras, India, May 1997
•Research:
•Has published many well-regarded
papers on DRAM and continues to
model DRAM in different environments,
specifically to see if it could be used in
embedded applications
Abstract
In response to the growing gap between processor speed and main
memory access time, many new DRAM architectures have been created.
This paper tests the performance of a representative set of the
architectures to see how all they respond to this trend.
The architectures tested are:
• Fast Page Mode
•Extended Data Out
• Synchronous Link
• Synchronous
• Rambus
• Enhanced Synchronous
• Direct Rambus
Conventional DRAM
Conventional DRAM
Conventional DRAM
Conventional DRAM
Conventional DRAM
Conventional DRAM
Questions
1. What is the effect of improvements in DRAM technology on
the memory latency and bandwidth problems?
2. Where is time spent in the primary memory system? What is
the performance benefit of exploiting the page mode of
contemporary DRAM?
3. How much locality is there in the address stream that reaches
the primary memory system?
Observations
1. There is a one-time tradeoff between cost, bandwidth and latency…
• multiple DRAMs on same bus with bus optimizations (|request| ~> |transfer|)
anything better requires faster bus and core
2. future bus technologies will expose row access time as the
primary performance bottleneck…
• widening buses present a clearer view of locality, so row hits are vital
3. buses …cannot halve the latency of a bus half as wide
• even though the best latencies are seen from buses as wide as the L2
cache, they aren’t quite cost effective
4. ...critical word first does not mix well with burst mode
• burst mode is likely to deliver unneeded data using a starting block out of
address order
5. …the refresh mechanism used can significantly alter the average
memory access time
• can add wait cycles to row and column access
Architectures: Fast Page Mode
• Holds row open after
first column is sent, in
optimistic hope that the
next access will be for
a different column in
the same row.
Architectures: Extended Data Out
• Added data latch holds column
data immediately after sensing. This
allows another transaction or a
refresh to begin as soon as the
column access is done
Architectures: Synchronous DRAM
• Often has a
programmable
buffer so it can
return data
over multiple
cycles per
request,
making data
available
every clock
cycle.
• Transmits on clock cycles, making timing
strobes from the memory controller
unnecessary.
Architectures: Enhanced SDRAM and
Synchronous Link DRAM
Enhanced SDRAM
Synchronous Link DRAM
• faster internal timing
• open architecture, supplied by IEEE
• SRAM row caches added to allow
EDO-like behavior, namely the
ability to satisfy requests for the
cached row while freeing the bank
up to do other things.
• uses a packetized split
request/response protocol
• most significantly, it can support
multiple concurrent transactions (if
they reference unique banks)
Architectures: Rambus DRAM
• Uses a multiplexed address/data bus, so it limits communication to once every 4 cycles.
• Transmits on both the rising and falling clock edges, reaching a theoretical maximum of
600 Megabytes per second.
• Due to internal division of banks, up to 4 rows can remain open at once.
Architectures: Direct Rambus DRAM
•Faster core and transmission on both clock edges yields a
theoretical maximum bandwidth of 1.6 Gigabytes per second.
• Divided into 16 banks, employing 17 half-row buffers shared
between each pair, limiting the amount of banks that can process
transactions in parallel but also reducing the product size.
•Uses a 3-byte-wide channel as opposed to Rambus’ single-bytewide channel and sends instructions over one byte width and
data over the other two.
• Most importantly, Direct Rambus does not multiplex its bus and
has its internal structures arranged in such a manner that it can
service 3 up to transactions at the same time.
Methodology: Basis
• Extensions written for SimpleScalar, an aggressive out-of-order processor
simulator, so that it would model the DRAM architectures described.
• A lot of the memory access time is overlapped with instruction execution in
SimpleScalar, so two extra simulations were run; one where bus transmission
was instantaneous and another where memory operation is instantaneous, and
the following formulae applied to the results:
Tp = time processing, Tl = memory latency stalls, To = overlapped mem. access
Tu = exec. time with instantaneous bandwidth, Tm = total mem. access time
Tb = memory bandwidth stalls, T = total real execution time
• Tl = Tu – Tp
• Tb = T – Tu
• To = Tp – (T-Tm)
Now memory access time can be
separated out into different categories of
stalls and the amount of time bandwidth
and latency were overlapped.
Methodology: Simulated architecture
• Timing information for DRAM parts was found in technical reports.
• Ran the simulated L2 cache at speeds of 100ns, 10ns and 1ns, scaling the
CPU speed to match (CPU speed = 10x L2 speed).
• Simulated architecture:
Processor: eight-way superscalar, out of order
Caches: L1: Lockup-free split (64K/64K), 2-way set
associative with 64-byte linesizes
L2: unified 1MB, 4-way set associative with a 128-byte
linesize and write back, lockup-free, but only allows one
outstanding request at a time
• This represents a common workstation
of the time (1999).
Methodology: Balancing the architectures
Interleaving:
• Since the request size is 8 times the transfer size in the simulated organization
chosen, DRAM access is a pipelined operation. The other DRAMs would gain an
unfair advantage over FPM and EDO DRAM since both are not interleaved. The
authors modeled interleaved versions that could fill the memory data bus as
much as possible separately. These versions are labeled FPM3 and EDO2.
•FPM1 is ‘pessimistic’, it closes the accessed row and precharges immediately.
•FPM2 is ‘optimistic’, it holds the accessed row open and delays precharge.
Bus Structure:
• SLDRAM, RDRAM, and DRDRAM all use narrower, higher-speed buses and
are simulated on a single-width bus in serial. This incurs an extra bit of latency
since the simulated memory controller has to coalesce bus packets into properlysized blocks to send over the common bus used for the rest of the simulations,
which is wider. To ameliorate this, transfer time over the narrow channel is taken
to be instantaneous.
Preliminary Results: Refresh Handling
• DRAM refresh can affect
performance dramatically
• All DRAMs but Rambus
have 64ms refresh time
• Rambus has a 33ms
refresh time and can
refresh internal banks
individually rather than an
entire matrix at a time.
• This is the basis for
observation 5.
• Since the time-interspersed scheme is so much better, it
was used for all the DRAMs. This puts all the architectures
on a more even footing.
Results: Total Execution time
• Interleaved DRAMs do
much better (FPM3 &
EDO2)
• Pessimistic FPM1 does
better than Optimistic
FPM2 since refresh takes
a little longer than row
access.
• Are newer DRAMS
having trouble keeping up
with CPU speed?
• Is memory bandwidth
really the biggest
contributor to DRAM
slowdown?
A lot has been done to increase memory bandwidth, but what about latency?
Results: Performance breakdown
• FPM is the slowest
• Interleaving is good, as is pessimistic
strategy
• EDO uses basically the same
technology as FPM, but is faster due
to better architecture
• SDRAM is faster still and ESDRAM
is even better since it tweaks timing
and adds a SRAM cache to improve
concurrency
• SLDRAM and Rambus have
higher access time compared with
SDRAM and ESDRAM due to bus
packing
• SLDRAM and RDRAM make twice
as many data transfers as
DRDRAM, and if “…they had been
organized… to put them on an even
footing with DRDRAM… their
latencies would be 20 to 30%
lower.”
“The parallel-channel results demonstrate the
failure of a 100MHz 128-bit bus to keep up with
today’s fastest parts.”
Results: Parallel channel DRAM and bandwidth
• The parallel bus
architectures (SLDRAM,
RDRAM and DRDRAM)
have a much larger
proportion of their access
time tied up in Bus
Transmission Time.
• Speeding up the bus
would make these run
faster, and has been done
fourfold since this paper’s
time.
• What is the effect though?
• With Bus Transmission Time decreased,
latency becomes the largest proportional
slowdown… and fixing it is much harder.
Conclusion: Questions answered
1, Effect of DRAM improvements?
• Bandwidth problem Is being addressed, as newer architectures support
multiple concurrent transactions, multiple concurrent accesses and/or
multiple bus channels.
• Latency is not being addressed and will become more of a problem.
2. Where is time spent?
• Most time is spent in bus transmission, which needs to be improved.
How much does Page Mode help?
• There is a significant degree of locality in addresses accessed, so
DRAMs that are internally multi-banked (so they have more than one row
buffer) seem to do better… Page Mode then is useful, but other factors can
get in the way. (FPM1 vs FPM2)
Conclusion: Questions answered
3. How much locality is there is the address stream?
• Quite a lot, actually, but the effect doesn’t scale well with large buffers.
Hits in Victim-Row FIFO Buffer for
FPM DRAM
Conclusion: Brass Tacks
• Bandwidth is a major slowdown for modern DRAM, but we today know that
this is easily fixable as we have system buses of 400MHz. According to the
results in this paper, DRAM latency is now the big problem and it isn’t very easy
or cheap to fix. This paper’s prophecy has come to pass.
• The box of tricks is getting empty… Techniques like interleaving, multiple
transactions, multiple channels and other such bandwidth-dependant speedups
are starting to become harder to find.
• Need to devise ways to improve latency now:
• different technology?
• better exploitation of locality (address prediction) ?
• more internal division?
• multilevel internal caches?
• a different type of storage matrix?
The…. End?
Epilogue: The bleeding edge
One of the newest made-for-mainstream memories, GDDR3
• Micron delivered the first samples of GDDR3 to Nvidia and ATI on August 8,
2003
• The word is that ATI and Nvidia will have new top of the line graphics cards
out, exploiting the new GDDR3 DRAM, by Q4 2003 or Q1 2004.
• So, what does it do better?
• .11 micron process (!)
• Variable Write Latency
• On-Die Termination
• 6.4 GBs data rate
• Posted CAS
• operating voltage is half that of
GDDR2
Epilogue: Revolutionary or evolutionary?
Warning… high data rate, ready skepticism…
• On-Die Termination: drops out reflection caused by signals hitting their
terminals… safe.
• .11 micron process: fit more logic in a small area and drive it with less
power… safe
• Low operating voltage: pleasant effect of small process size… safe
• 6.4 GBs data rate, Variable Write Latency, Posted CAS, and clock rate of up
to 800 MHz… danger!
Epilogue: Revolutionary or evolutionary?
Posted CAS:
Adds latency cycles to the Column Select so READ/ACT commands (Row
Select) don’t collide with Column Selects to allow faster internal clocking.
Variable Write Latency:
Adds latency to the Write operation so that it doesn’t corrupt a Read operation
(RAW dependency). Write latency = CAS speed + AL (Posted CAS) -1. Not
entirely as bad as it sounds (due to interleaving), so could only triple Write
latency instead of sextupling it!
High DRAM clock:
Makes all this additive latency necessary. Now
architects are intentionally adding latency.
Epilogue: Verdict?
Evolutionary… at best
• GDDR3 is sending us in apparently the wrong direction, pumping up
bandwidth at the expense of latency by upping the clock speed and adding
tweaks to make sure the data stays consistent.
• Bandwidth may be preferable over latency in a graphics processor to deliver
increased frame rate, but the GDDR2 architecture acted as a roadmap for the
DDR2 primary memory architecture.
• Are there any low-latency primary memory
architectures in development… and if so do any of
them have a chance at survival in the market?
• Ask me after December 8th.
The End
(and this time I mean it)
Download