CS210_305_10 - CIS Personal Web Pages

advertisement
CS.305
Computer Architecture
<local.cis.strath.ac.uk/teaching/ug/classes/CS.305>
Memory: Structures
Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005,
and from slides kindly made available by Dr Mary Jane Irwin, Penn State University.
Review: Major Components of a Computer
Processor
Control
Datapath
Memory: Structures
Devices
Memory
Input
Output
CS210_305_10/2
Processor-Memory Performance Gap
µProc
55%/year
(2X/1.5yr)
10000
Performance
“Moore’s Law”
1000
Processor-Memory
Performance Gap
(grows 50%/year)
100
10
DRAM
7%/year
(2X/10yrs)
1
1980 1984 1988 1992 1996 2000 2004
Memory: Structures
Year
CS210_305_10/3
The “Memory Wall”

Logic vs DRAM speed gap continues to grow
Clocks per DRAM access
Clocks per instruction
1000
100
10
Core
Memory
1
0.1
0.01
VAX/1980
Memory: Structures
PPro/1996
2010+
CS210_305_10/4
Memory Performance Impact on Performance

Suppose a processor executes at


Ideal CPI,
1.1
ideal CPI = 1.1
50% arith/logic, 30% ld/st, 20% control
and that 10% of data memory
operations miss with a 50 cycle miss
penalty

InstrMiss,
0.5
DataMiss,
1.5
CPI = ideal CPI + average stalls per instruction
= 1.1(cycle) + ( 0.30 (datamemops/instr)
x 0.10 (miss/datamemop) x 50 (cycle/miss) )
= 1.1 cycle + 1.5 cycle = 2.6
so 58% of the time the processor is stalled waiting for
memory!

A 1% instruction miss rate would add an additional 0.5 to
the CPI!
Memory: Structures
CS210_305_10/5
The Memory Hierarchy Goal

Fact: Large memories are slow (and low cost per bit)
and fast memories are small (and high cost per bit)

How do we create a memory that gives the illusion of
being large, cheap and fast (most of the time)?


With hierarchy
With parallelism
Memory: Structures
CS210_305_10/6
A Typical Memory Hierarchy

By taking advantage of the principle of locality

Can present the user with as much memory as is available in the
cheapest technology

at the speed offered by the fastest technology
On-Chip Components
Control
eDRAM
Instr Data
Cache Cache
1’s
10’s
100’s
100’s
K’s
10K’s
M’s
Size (bytes):
Cost:
Memory: Structures
ITLB DTLB
Speed (%cycles): ½’s
Datapath
RegFile
Second
Level
Cache
(SRAM)
highest
Main
Memory
(DRAM)
Secondary
Memory
(Disk)
1,000’s
G’s to T’s
lowest
CS210_305_10/7
Characteristics of the Memory Hierarchy
Processor
4-8 bytes (word)
Increasing
distance
from the
processor in
access time
L1$
8-32 bytes (block)
L2$
1 to 4 blocks
Inclusive– what
is in L1$ is a
subset of what
is in L2$ is a
subset of what
is in MM that is
a subset of is
in SM
Main Memory
1,024+ bytes (disk sector = page)
Secondary Memory
(Relative) size of the memory at each level
Memory: Structures
CS210_305_10/8
Memory Hierarchy Technologies

Caches use SRAM for speed
and technology compatibility



21
Address
Low density (6 transistor cells), Chip select
Output enable
high power, expensive, fast
Static: content will last
“forever” (until power
turned off)
SRAM
2M x 16
Write enable
16
Dout[15-0]
Din[15-0]
16
Main Memory uses DRAM for size (density)

High density (1 transistor cells), low power, cheap, slow

Dynamic: needs to be “refreshed” regularly (~ every 8 ms)
- 1% to 2% of the active cycles of the DRAM

Addresses divided into 2 halves (row and column)
- RAS or Row Access Strobe triggering row decoder
- CAS or Column Access Strobe triggering column selector
Memory: Structures
CS210_305_10/9
Memory Performance Metrics

Latency: Time to access one word





Access time: time between the request and when the data is
available (or written)
Cycle time: time between requests
Usually cycle time > access time
Typical read access times for SRAMs in 2004 are 2 to 4 ns for
the fastest parts to 8 to 20ns for the typical largest parts
Bandwidth: How much data from the memory can be
supplied to the processor per unit time

width of the data channel * the rate at which it can be used
Size: DRAM to SRAM 4 to 8
 Cost/Cycle time: SRAM to DRAM 8 to 16

Memory: Structures
CS210_305_10/10
Classical RAM Organization (~Square)
bit (data) lines
R
o
w
D
e
c
o
d
e
r
row
address
RAM Cell
Array
word (row) line
Column Selector &
I/O Circuits
data bit or word
Memory: Structures
Each intersection
represents a
6-T SRAM cell or
a 1-T DRAM cell
column
address
One memory row holds a block
of data, so the column address
selects the requested bit or word
from that block
CS210_305_10/11
Classical DRAM Org’n (~Square Planes)
bit (data) lines
R
o
w
D
e
c
o
d
e
r
Each intersection
represents a
1-T DRAM cell
RAM Cell
Array
word (row) line
column
address
row
address
Column Selector &
I/O Circuits
data bit
data bit
data bit
Memory: Structures
The column address
selects the requested
bit from the row in each
plane
CS210_305_10/12
Classical DRAM Operation
DRAM Organization:

N rows x N column x M-bit

Read or Write M-bit at a time

Each M-bit access requires
a RAS / CAS cycle
N cols
DRAM
N rows

Column
Address
Row
Address
M bit planes
M-bit Output
Cycle Time
1st M-bit Access
2nd M-bit Access
RAS
CAS
Row Address
Memory: Structures
Col Address
Row Address
Col Address
CS210_305_10/13
Page Mode DRAM Operation
Page Mode DRAM


N cols
N x M SRAM to save a row
DRAM
After a row is read into the
SRAM “register”

Only CAS is needed to access
other M-bit words on that row

RAS remains asserted while CAS
is toggled
Row
Address
N rows

Column
Address
N x M SRAM
M bit planes
M-bit Output
Cycle Time
1st M-bit Access
2nd M-bit
3rd M-bit
4th M-bit
RAS
CAS
Row Address
Memory: Structures
Col Address
Col Address
Col Address
Col Address
CS210_305_10/14
Synchronous DRAM (SDRAM) Operation
+1
After a row is
read into the SRAM register


Inputs CAS as the starting “burst”
address along with a burst length
Transfers a burst of data from a
series of sequential addresses
within that row
- A clock controls transfer of
successive words in the burst –
300MHz in 2004
Cycle Time
N cols
DRAM
N rows

Column
Address
Row
Address
N x M SRAM
M bit planes
M-bit Output
1st M-bit Access 2nd M-bit 3rd M-bit
4th M-bit
RAS
CAS
Row Address
Memory: Structures
Col Address
Row Add
CS210_305_10/15
Other DRAM Architectures


Double Data Rate SDRAMs – DDR-SDRAMs
(and DDR-SRAMs)

Double data rate because they transfer data on both the rising
and falling edge of the clock

Are the most widely used form of SDRAMs
DDR2-SDRAMs
<http://tinyurl.com/yyw9k9> ->
http://www.corsairmemory.com/corsair/products/tech/memory_basics/153707/main.swf
Memory: Structures
CS210_305_10/16
DRAM Memory Latency & Bandwidth
Module Width
Year
Mb/chip
Die size (mm2)
Pins/chip
BWidth (MB/s)
Latency (nsec)
DRAM
Page
DRAM
16b
1980
0.06
35
16
13
225
16b
1983
0.25
45
16
40
170
FastPage FastPage
DRAM
DRAM
Synch
DRAM
DDR
SDRAM
32b
64b
64b
64b
1986
1993
1997
2000
1
16
64
256
70
130
170
204
18
20
54
66
160
267
640
1600
125
75
62
52
Patterson, CACM Vol 47, #10, 2004
In the time that the memory to processor bandwidth
doubles the memory latency improves by a factor of only
1.2 to 1.4
 To deliver such high bandwidth, the internal DRAM has to
be organized as interleaved memory banks

Memory: Structures
CS210_305_10/17
Memory Systems that Support Caches

The off-chip interconnect and memory architecture can
affect overall system performance in dramatic ways
on-chip
CPU
One word wide organization
(one word wide bus and
one word wide memory)

Cache
bus
32-bit data
&
32-bit addr
per cycle Memory

Assume

1 clock cycle to send the address

25 clock cycles for DRAM cycle time, 8
clock cycles access time

1 clock cycle to return a word of data
Memory-Bus to Cache bandwidth

Memory: Structures
number of bytes accessed from memory
and transferred to cache/CPU per clock
cycle
CS210_305_10/18
One Word Wide Memory Organization

on-chip
CPU
Cache
If the block size is one word, then for a
memory access due to a cache miss,
the pipeline will have to stall the
number of cycles required to return
one data word from memory
cycle to send address
cycles to read DRAM
cycle to return data
total clock cycles miss penalty
bus
Memory

Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
bytes per clock
Memory: Structures
CS210_305_10/19
One Word Wide Memory Organization

on-chip
CPU
Cache
If the block size is one word, then for a
memory access due to a cache miss,
the pipeline will have to stall the
number of cycles required to return
one data word from memory
1
25
1
27
bus
cycle to send address
cycles to read DRAM
cycle to return data
total clock cycles miss penalty
Memory

Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
4/27 = 0.148 bytes per clock
Memory: Structures
CS210_305_10/20
One Word Wide Memory Organization, con’t

on-chip
What if the block size is four words?
1
4 x 25 = 100
1
102
CPU
Cache
cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
25 cycles
bus
25 cycles
25 cycles
Memory
25 cycles

Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
(4 x 4)/102 = 0.157 bytes per clock
Memory: Structures
CS210_305_10/22
One Word Wide Memory Organization, con’t

on-chip
What if the block size is four words
and if a fast page mode DRAM is used?
1
25 + 3*8 = 49
1
51
CPU
Cache
cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
bus
25 cycles
8 cycles
8 cycles
Memory
8 cycles

Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
(4 x 4)/51 = 0.314 bytes per clock
Memory: Structures
CS210_305_10/24
Interleaved Memory Organization

For a block size of four words
on-chip
CPU
1 cycle to send 1st address
25 + 3 = 28 cycles to read DRAM
1 cycles to return last data word
Cache
30 total clock cycles miss penalty
25 cycles
bus
25 cycles
25 cycles
Memory Memory Memory Memory
bank 0 bank 1 bank 2 bank 3
25 cycles
Number of bytes transferred
per clock cycle (bandwidth) for a
single miss is

(4 x 4)/30 = 0.533 bytes per clock
Memory: Structures
CS210_305_10/26
DRAM Memory System Summary

Its important to match the cache characteristics


with the DRAM characteristics


caches access one block at a time (usually more than one
word)
use DRAMs that support fast multiple word accesses,
preferably ones that match the block size of the cache
with the memory-bus characteristics


make sure the memory-bus can support the DRAM access
rates and patterns
with the goal of increasing the Memory-Bus to Cache
bandwidth
Memory: Structures
CS210_305_10/27
Download