Topic 9 - Electronic Systems group

advertisement
An introduction to SDRAM
and memory controllers
5kk73
Outline
► Part
1: DRAM and controller basics
– DRAM architecture and operation
– Timing constraints
– DRAM controller
► Part
2: DRAMs in embedded systems
– Challenges in sharing DRAMs
– Real-time guarantees with DRAMs
– Future DRAM architectures and controllers
2
Memory device
►
“A device that preserves information for retrieval”
3
- Web definition
Semiconductor memories
►
“Semiconductor memory is an electronic data storage device, often
used as computer memory, implemented on a semiconductor-based
integrated circuit” - Wikipedia definition
►
The main characteristics of semiconductor memory are low cost, high
density (bits per chip), and ease of use
4
Semiconductor memory types
► RAM
(Random Access Memory)
– DRAM (Dynamic RAM)
• Synchronous DRAM (SDRAM)
– SRAM (Static RAM)
► ROM
(Read Only Memory)
– Mask ROM, Programmable ROM
(PROM), EPROM (Erasable PROM),
UVEPROM (Ultra Violet EPROM)
► NVRAM
(Non-Volatile RAM) or Flash
memory
5
Memory hierarchy
Processor
L1 Cache
L2 Cache
Registers
L1 Cache
L2 Cache
Off-chip memory
Off-chip memory
Secondary memory
(Hard disk)
Secondary
memory
Size (capacity)
6
Access speed
Registers
Distance from processor
CPU
Memory hierarchy
Processor
Module
Memory
type used
Registers
SRAM
L1 Cache
L2 Cache
Off-chip memory
CPU
Registers
L1 Cache
Secondary
memory
Capacity
Managed by
1 cycle
~500B
Software/com
piler
SRAM
1-3 cycles
~64KB
Hardware
L2 Cache
SRAM
5-10 cycles
1-10MB
Hardware
Off-chip
memory
DRAM
~100 cycles
~10GB
Software/OS
Secondary
memory
Disk drive
~1000 cycles
~1TB
Software/OS
7
Credits: J.Leverich, Stanford
Access
Time
SRAM vs DRAM
Static Random Access Memory
►
Dynamic Random Access Memory
Bitlines driven by transistors
►
A bit is stored as charge on the
capacitor
►
Bit cell loses charge over time
(read operation and circuit
leakage)
- Fast (10x)
►
1 transistor and 1 capacitor vs. 6
transistors
– Large (~6-10x)
8
Credits: J.Leverich, Stanford
-
Must periodically refresh
-
Hence the name Dynamic RAM
SRAM vs DRAM: Summary
► SRAM
–
–
–
–
–
is preferable for register files and L1/L2 caches
Fast access
No refreshes
Simpler manufacturing (compatible with logic process)
Lower density (6 transistors per cell)
Higher cost
► DRAM
is preferable for stand-alone memory chips
– Much higher capacity
– Higher density
– Lower cost
– DRAM is the main focus in this lecture!
9
Credits: J.Leverich, Stanford
DRAM: Internal architecture
Bank 4
Bank 3
Bank 2
MS bits
Row
decoder
Address
Address register
Bank 1
Memory Array
–Row Buffer
►
Bit cells are arranged to form
a memory array
►
Multiple arrays are organized
as different banks
–Row Buffer
–Row Buffer
Sense amplifiers
(row buffer)
LS bits
– Typical number of banks are
4, 8 and 16
Column
decoder
►
Data
10
Credits: J.Leverich, Stanford
Sense amplifiers raise the
voltage level on the bitlines to
read the data out
DRAM: Read access sequence
MS bits
Row
decoder
Address
Address register
Bank 1
Memory Array
–Row Buffer
–Row Buffer
►
Decode row address & drive wordlines
►
Selected bits drive bit-lines
Sense amplifiers
LS bits
– Entire row read
Column
decoder
Data
11
Credits: J.Leverich, Stanford
►
Amplify row data
►
Decode column address & select
subset of row
►
Send to output
►
Precharge bit-lines for next access
DRAM: Memory access protocol
►
Bank 1
n
Row
decoder
Address
RAS
Memory Array
2n
– RAS = Row Address Strobe
– CAS = Column Address
Strobe
2n Row x
–Row Buffer
2nColumn
–Row Buffer
2m
Sense amplifiers
CAS
2m
m
Column
decoder
1
►
Data is accessed by issuing
memory commands
►
5 basic commands
–
–
–
–
–
Data
12
Credits: J.Leverich, Stanford
To reduce pin count, row and
column share same address
pins
ACTIVATE
READ
WRITE
PRECHARGE
REFRESH
DRAM: Basic operation
Addresses
Commands
(Row 0, Column 0)
ACTIVATE Row 0
(Row 0, Column 1)
READ Column 0
Columns
(Row 0, Column 10)
READ Column 1
Row 0
(Row 1, Column 0)
READ Column 10
–Row Buffer
PRECHARGE Row 0
ACTIVATE Row 1
READ Column 0
1
0
Row Row
buffer
Column address
Column decoder
Data
13
Credits: J.Leverich, Stanford
Rows
Row decoder
Row address
Row 1
Row buffer MISS!
HIT!
DRAM: Basic operation (Summary)
► Access
to an “open row”
– No need to issue ACTIVATE command
– READ/WRITE will access row buffer
► Access
–
–
–
–
to a “closed row”
If another row is already active, issue PRECHARGE first
Issue ACTIVATE to open a new row
READ/WRITE will access row buffer
Optional: PRECHARGE after READ/WRITEs finished
• If PRECHARGE issued  Closed-page policy
• If not  Open-page policy
14
Credits: J.Leverich, Stanford
DRAM: Burst access
►
Each READ/WRITE command can transfer multiple words (8 in
DDR3)
►
Observe the number of words transferred in a single clock cycle
– Double Data Rate (DDR)
15
Credits: J.Leverich, Stanford
DRAM: Banks
► DRAM
chips can consist of multiple banks
– Address = (Bank x, Row y, Column z)
► Banks
operate independently, but share command,
address and data pins
– Each bank can have a different row active
– Can overlap ACTIVATE and PRECHARGE latencies!(i.e. READ
to bank 0 while ACTIVATING bank 1)  Bank-level parallelism
Bank 1
Bank 0
Row 0
Row 1
–Row Buffer
–Row Buffer
16
Credits: J.Leverich, Stanford
DRAM: Bank-level parallelism
► Enable DRAM access from different banks in parallel
– Reduces memory access latency and improves efficiency!
17
Credits: J.Leverich, Stanford
2Gb x8 DDR3 Chip [Micron]
►
Observe the bank organization
18
Credits: J.Leverich, Stanford
2Gb x8 DDR3 Chip [Micron]
►
Observe row width, bi-directional bus and 64 8 data-path
19
Credits: J.Leverich, Stanford
DDR3 SDRAM: Current standard
► Introduced
► SDRAM
in 2007
 Synchronous DRAM (Clocked)
– DDR = Double Data Rate
• Data transferred on both clock edges
–
–
–
–
–
400 MHz = 800 MT/s
x4, x8, x16 datapath widths
Minimum burst length of 8
8 banks
1Gb, 2Gb, 4Gb capacity
20
DRAM: Timing Constraints
tRAS
tRCD
CMD
ACT
NOP
NOP
RD
tRL
tRP
NOP
DATA
NOP
NOP
D1
Dn
PRE
NOP
NOP
– tRCD= Row to Column command delay
• Time taken by the charge stored in the capacitor cells to reach the
sense amps
– tRAS= Time between RAS and data restoration in DRAM array
(minimum time a row must be open)
– tRP= Time to precharge DRAM array
► Memory
controller must respect the physical device
characteristics!
21
DRAM: Timing Constraints
► There
–
–
–
–
–
are a bunch of other timing constraints…
tCCD= Time between column commands
tWTR= Write to read delay (bus turaround time)
tCAS= Time between column command and data out
tWR= Time from end of last write to PRECHARGE
tFAW= Four ACTIVATE window (limits current surge)
• Maximum number of ACTIVATEs in this window is limited to four
– tRC= tRAS+ tRP= Row “cycle” time
• Minimum time between accesses to different rows
► Timing
constraints makes performance analysis and
memory controller design difficult!
22
DRAM controller
DRAM controller
Front-end
Request
scheduler
Back-end
Memory
map
Command
generator
DRAM
Address
Command
►
Request scheduler decides which memory request to be selected
►
Memory map translates logical address  physical address
•
•
►
Loical address = incoming address
Physical address = (Bank, Row Column)
Command generator issues memory commands respecting the
physical device characteristics
23
Request scheduler
► Many
algorithms exist to determine how to schedule
memory requests
– Prefer requests targeting open rows
• Increases number of row buffer hit
– Prefer read after read and write after write
• Minimize bus turnaround
– Always prefer reads, since reads are blocking and writes often posted
• Reduce stall cycles of processor
24
Memory map
►
Memory map decodes logical address to physical address
– Physical address is (bank, row, column)
– Decoding is done by slicing the bits in the logical address
Logical addr.
0x10FF00
►
Memory
map
Physical addr.
(2, 510, 128)
Several memory mapping schemes exist
– Continuous, Bank Interleaved
25
Continuous memory map
► Map
sequential address to columns in row
► Switch
bank when all columns in row are visited
► Switch
row when all banks are visited
26
Bank-interleaved memory map
►Bank-interleaved
memory map
– Maps bursts to different banks in interleaving fashion
– Active row in a bank is not changed until all columns
are visited
27
Memory map generalization
►
Continuous and interleaving memory map are just 2 possible memory
mapping schemes
– In the most general case, an arbitrary set of bits out of the logical address
could be used for the row, column and bank address, respectively
Example memory map (1 burst per bank, 2 banks interleaving, 8
words per burst):
Bit 0
Bit 26
Logical address:
RRR RRRR RRRR RRBB CCCC CCCB CCCW
Burst-size
Example memory:
Row
Bank-offset
Bank interleaving
16-bit DDR3-1600 64 MB
8 banks
8K rows / bank
1024 columns / row
16 bits / column
Can be done in different ways – choice affects
memory efficiency!
28
Command generator
► Decide
selection of memory requests
► Generate
SDRAM commands without violating timing
constraints
Command
generator
29
Command generator
► Different
page policies to determine which command to
schedule
– Close-page policy: Close rows as soon as possible to activate
new one faster, i.e, not to waste time to PRECHARGE the open
row of the previous request
– Open page policy: Keep rows open as long as possible to
benefit from locality, i.e., assuming the next request will target
the same open row
30
Open page or Close page?
Addresses
Commands
(Row 0, Column 0)
ACTIVATE Row 0
(Row 0, Column 1)
READ Column 0
Columns
(Row 0, Column 10)
READ Column 1
Row 0
(Row 1, Column 0)
READ Column 10
–Row Buffer
PRECHARGE Row 0
ACTIVATE Row 1
READ Column 0
1
0
Row Row
buffer
Column address
Column decoder
Data
31
Credits: J.Leverich, Stanford
Rows
Row decoder
Row address
Row 1
Row buffer MISS!
HIT!
A modern DRAM controller [Altera]
32
Image: Altera
Conclusions (Part 1)
► SDRAM
is used as off-chip high-volume storage
– Cheaper, slower than SRAM
► DRAM
timing constraints makes it hard to design
memory controller
► Selection
of memory map and command/request
sheduling algorithms impacts memory access time
and/or efficiency
33
Outline
► Part
1: DRAM and controller basics
– DRAM architecture and operation
– Timing constraints
– DRAM controller
► Part
2: DRAMs in embedded systems
– Challenges in sharing DRAMs
– Real-time guarantees with DRAMs
– Future DRAM architectures and controllers
34
Trends in embedded systems
► Embedded
systems get increasingly complex
– Increasingly complex applications (more functionality)
– Growing number of applications integrated in a device
– Requires increased system performance without increasing power
► The
case of a generic car manufacturer
– Typical number of ECUs in a car in 2000  20
– Number of ECUs in Audi A8 Sedan  over 80
35
System-on-Chip (SoC)
► The
resulting complex contemporary platforms are
heterogeneous multi-processor systems
– Resources in the system are shared to reduce cost
36
SoC: Video and audio processing system
is typically used as shared main memory for cost
and reasons
Video
Engine
Interconnect
► DRAM
Audio
Processor
Host CPU
DMA
Controller
GPU
Input
processor
LCD
Controller
DRAM
Memory
controller
A.B. Soares et.al., Development of a SoC for Digital Television Set-Top Box: Architecture
and System Integration Issues, International Journal of Reconfigurable Computing Volume 2013
37
Set-top box architecture [Philips]
38
DRAM controller architecture
DRAM controller
Client 1
Client 4
Memory
map
DRAM
Bus
Client 2
Client 3
Command
generator
Client n
Arbiter
► The
arbiter grants memory access to one of the memory
clients at a time
– Example: Round-Robin, Time Division Multiplexing (TDM) prioritybased arbiters
39
DRAM controller for real-time systems
►
Clients in real-time systems have requirements on latency/bandwidth
– A fixed set of memory access parameters such as burst size, pagepolicy etc in the back-end bounds transaction execution time
– Predictable arbiters, such as TDM fixed time slots, Round Robin
bounds response time
Bounds
execution time
Client 1
Client 4
Interconnect
Client 2
Client 3
DRAM
Back-end
Client n
Arbiter
Bounds
response time
40
B.Akesson et.al., “Predator: A Predictable SDRAM Memory Controller”, CODES+ISSS, 2007
DRAMs in the market
Family
DDR
LPDDR
WIDE IO
►
Generations
Datapath width (bits)
Frequency range (MHz)
DDR
16
100-200
DDR2
16
200-400
DDR3
16
400-1066
DDR4
16
800-1600
LPDDR
16 and 32
133 - 208
LPDDR2
16 and 32
333-533
LPDDR3
16 and 32
667 - 800
SDR
128
200-266
Observe the increase in operating frequency with every generation
41
DRAMs: Bandwidth vs clock frequency
18
WIDE IO SDR
16
Peak bandwidth (GB/s)
14
12
10
8
6
LPDDR3
DDR4
LPDDR2
4
DDR3
LPDDR
2
DDR
DDR2
0
0
200
400
600
800
1000
1200
1400
1600
Max operating frequency (MHz)
►
WIDE IO gives much higher bandwidth at lower frequency
– Low power consumption
42
1800
Multi-channel DRAM: WIDE IO
► Bandwidth
demands of future
embedded systems > 10 GB/s
– Memory power consumption scales
up with memory operating frequency
 “Go parallel”
► Multi-channel
memories
– Each channel is an independent
memory module with dedicated data
and control path
– WIDE IO DRAM (4 channels)
43
Channel 3
128-bit IO
Channel 2
128-bit IO
Channel 1
128-bit IO
Channel 4
128-bit IO
Multi-channel DRAM controller
DRAM controller 1
Memory
client 1
Atomizer
Back-end
CS
Sequence gen 1
Arbiter
DRAM controller 2
Memory
client 2
Atomizer
CS
Sequence gen 2
Channel 1
Channel 2
Back-end
Arbiter
►
The Atomizer chops the incoming requests into a number of service units
►
Channel Selector (CS) routes the service units to the different memory
channels according to the configuration in the Sequence Genrators
M.D.Gomony et.al., “Architecture and Optimal Configuration of a Real-Time Multi-Channel Memory Controller”,
DATE, 2012
44
Multi-channel DRAM controller
►
Multi-channel memories allow memory requests to be interleaved
across multiple memory channels
– Reduces access latency
DRAM controller 1
Channel 1
Back-end
CS
Memory
client 1
Arbiter
Atomizer
DRAM controller 2
Back-end
Sequence gen 1
Arbiter
45
Channel 2
Wide IO memory controller [Cadence]
46
Image: Cadence
Future DRAM: HMC
► Hybrid
Memory Cube (HMC)
– 16 memory channels
► How
does the memory controller for HMC look like?
47
Image: Micron, HMC
Conclusions (part 2)
► DRAMs
are shared in multi-processor SoC to reduce cost
and to enable communication between the processing
elements
► Sharing
DRAMs between multiple memory clients can
done using different arbitration algorithms
► Predictable
arbitration and back-end provides real-time
guarantees on latency and bandwidth to real-time clients
► Multi-channel
DRAMs allows a memory request to be
interleaved across memory channels
48
Questions?
m.d.gomony@tue.nl
49
References
►
B. Jacob et al., Memory systems: cache, DRAM, disk. Morgan
Kaufmann, 2007
►
B.Akesson et.al., “Predator: A Predictable SDRAM Memory
Controller”, CODES+ISSS, 2007
►
M.D.Gomony et.al., “Architecture and Optimal Configuration of a
Real-Time Multi-Channel Memory Controller”, DATE, 2012
►
http://hybridmemorycube.org/
50
Download