AMA-L14-Prefetching

advertisement
Lecture 14: DRAM and Prefetching
• DRAM = Dynamic RAM
• SRAM: 6T per bit
– built with normal high-speed CMOS technology
• DRAM: 1T per bit
– built with special DRAM process optimized for density
Lecture 14: DRAM and Prefetching
2
SRAM
DRAM
wordline
wordline
b
Lecture 14: DRAM and Prefetching
b
b
3
• You can use a “dead” transistor gate:
But this wastes area because
we now have two transistors
And the “dummy” transistor may need
to be bigger to hold enough charge
Lecture 14: DRAM and Prefetching
4
• There are other advanced structures
Cell Plate Si
“Trench Cell”
Cap Insulator
Refilling Poly
Storage Node Poly
Si Substrate
Field Oxide
DRAM figures from this slide and previous were taken from
Prof. Nikolic’s EECS141/2003 Lecture notes from UC-Berkeley
Lecture 14: DRAM and Prefetching
5
Row Decoder
Row
Address
Memory
Cell Array
Sense Amps
Column
Address
Row Buffer
Column Decoder
Data Bus
Lecture 14: DRAM and Prefetching
6
• High-Level organization is very similar to SRAM
– cells are only single-ended
• changes precharging and sensing circuits
• makes reads destructive: contents are erased after reading
– row buffer
• read lots of bits all at once, and then parcel them out based on
different column addresses
– similar to reading a full cache line, but only accessing one word at a
time
• “Fast-Page Mode” FPM DRAM organizes the DRAM row to contain
bits for a complete page
– row address held constant, and then fast read from different locations
from the same page
Lecture 14: DRAM and Prefetching
7
Vdd
sense amp
bitline
voltage
01
Wordline Enabled
Sense Amp Enabled
Vdd
Lecture 14: DRAM and Prefetching
After read of 0 or 1, cell contains
something close to 1/2
storage
cell voltage
8
• So after a read, the contents of the DRAM cell are
gone
• The values are stored in the row buffer
• Write them back into the cells for the next read in
the future
DRAM cells
Sense Amps
Row Buffer
Lecture 14: DRAM and Prefetching
9
• Fairly gradually, the DRAM cell
will lose its contents even if it’s
not accessed
– This is why it’s called “dynamic”
– Contrast to SRAM which is “static”
in that once written, it maintains its
value forever (so long as power remains on)
01
Gate Leakage
• All DRAM rows need to be
regularly read and re-written
Lecture 14: DRAM and Prefetching
10
Accesses are
asynchronous:
triggered by RAS and
CAS signals, which
can in theory occur at
arbitrary times (subject
to DRAM timing
constraints)
Lecture 14: DRAM and Prefetching
11
Double-Data Rate (DDR) DRAM
transfers data on both rising and
falling edge of the clock
Command frequency
does not change
Burst Length
Timing figures taken from “A Performance Comparison of Contemporary
DRAM Architectures” by Cuppu, Jacob, Davis and Mudge
Lecture 14: DRAM and Prefetching
12
More wire delay getting
to the memory chips
Significant wire delay just getting from
the CPU to the memory controller
Width/Speed varies
depending on memory type
(plus the return trip…)
Lecture 14: DRAM and Prefetching
13
Like Write-Combining Buffer,
Scheduler may coalesce multiple
accesses together, or re-order to
reduce number of row accesses
Read
Queue
Write
Queue
Response
Queue
Commands
Data
To/From CPU
Scheduler
Buffer
Memory
Controller
Bank 0
Lecture 14: DRAM and Prefetching
Bank 1
14
• Access latency dominated by wire delay
– mostly in the wordline and bitlines/sense
– PCB traces between chips
• Process technology improvements provide smaller
and faster transistors
– DRAM density doubles at about the same rate as Moore’s
Law
– DRAM latency improves very slowly because wire delay has
not improved as fast as logic delay
Lecture 14: DRAM and Prefetching
15
• CPUs
– frequency has increased at about 60% per year
• DRAM
– end-to-end latency has decreased only about 10% per year
Number of cycles for memory access keeps
increasing
– A.K.A. the memory wall
– Note: absolute latency of memory is decreasing
• Just not nearly as fast as the CPU
Lecture 14: DRAM and Prefetching
16
• Caching
– reduces average memory instruction latency by avoiding
DRAM altogether
• Limitations
– Capacity
• programs keep increasing in size
– Compulsory misses
Lecture 14: DRAM and Prefetching
17
• Clock FSB faster
– DRAM chips may not be able to keep up
• Latency dominated by wire delay
– Bandwidth may be improved (DDR vs. regular) but latency
doesn’t change much
• Instead of 2 cycles for row access, may take 3 cycles at a faster bus
speed
• Doesn’t address latency of the memory access
Lecture 14: DRAM and Prefetching
18
Memory controller can run
at CPU speed instead of
FSB clock speed
All on same chip:
No slow PCB wires to drive
Disadvantage: memory type is now
tied to the CPU implementation
Lecture 14: DRAM and Prefetching
19
• If memory takes a long time, start accessing earlier
L1
L2
Load
Prefetch
Load
DRAM
Total Load-to-Use Latency
Data
Much improved Load-to-Use Latency
Data
May cause resource
contention due to
extra cache/DRAM
activity
Somewhat improved Latency
Lecture 14: DRAM and Prefetching
20
Reordering can
mess up your
code
A
R1 = R1- 1
B
R1 = [R2]
R0 = [R2]
A
A
R1 = R1- 1
C
R1 = [R2]
R3 = R1+4
(Cache missing
instruction in red)
Lecture 14: DRAM and Prefetching
B
C
R3 = R1+4
Hopefully the load miss
is serviced by the time
we get to the consumer
B
C
R1 = [R2]
R3 = R1+4
Using a prefetch instruction
(or load to $zero) can help
to avoid problems with
data dependencies
21
• Pros:
– can leverage compiler level information
– no hardware modifications
• Cons:
– prefetch instructions increase code footprint
• may cause more I$ misses, code alignment issues
– hard to hoist prefetches early enough to cover main
memory latency
• If memory is 100 cycles, and the CPU can sustain 2 instructions per
cycle, then load needs to be moved 200 instructions earlier in the
code
– aggressive hoisting leads to many useless prefetches
• control flow may go somewhere else (like block B in previous slide)
Lecture 14: DRAM and Prefetching
22
DRAM
Hardware
monitors miss
traffic to DRAM
HW
Prefetcher
CPU
Lecture 14: DRAM and Prefetching
Depending on prefetch
algorithm/miss patterns,
prefetcher injects
additional memory
requests
Cannot be overly aggressive
since prefetches may contend
for memory bandwidth, and
may pollute the cache (evict
other useful cache lines)
23
• Very simple, if a request for cache line X goes to
DRAM, also request X+1
– assumes spatial locality
• often a good assumption
– low chance of tying up the memory bus for too long
• FPM DRAM already will have the correct page open for the request
for X, so X+1 will likely be available in the row buffer
• Can optimize by doing Next-Line-Unless-Crossing-A-PageBoundary prefetching
Lecture 14: DRAM and Prefetching
24
• Obvious extension
– fetch the next N lines:
• X+1, X+2, …, X+N
• Need to carefully tune N
– larger N may make it:
• more likely to prefetch something useful
• more likely to evict something useful
• more likely to stall a useful load due to bus contention
Lecture 14: DRAM and Prefetching
25
Figures from Jouppi “Improving Direct-Mapped Cache Performance by the
Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA’90
Lecture 14: DRAM and Prefetching
26
Lecture 14: DRAM and Prefetching
27
• Can independently track multiple “inter-twined”
sequences/streams of accesses
• Separate buffers prevent prefetch streams from
polluting cache until line is used at least once
– similar effect to filter/promotion caches
• Can extend to “Quasi-Sequential” Stream buffer
– add comparator to all entries, and skip-ahead (partial flush)
if hit on a non-head entry
Lecture 14: DRAM and Prefetching
28
Layout in linear memory
Column traversal
of a matrix
If array starts at address A, and we are
accessing the kth column, each element
is B bytes large, and there are N elements
per row of the matrix, then the addresses
accessed are:
A+Bk, A+Bk+N, A+Bk+2N, A+Bk+3N, …
Or, if you miss on address X, prefetch X+N
Lecture 14: DRAM and Prefetching
29
• Like Next-N-Line prefetching, need to limit how far
ahead stride is allowed to go
– previous example: no point in prefetching past the end of
the array
• How can you tell the difference between:
– A[i]  A[i+1]
– XY
– Typically only do stride prefetch if same stride observed at
least a few times
Lecture 14: DRAM and Prefetching
30
What if we’re doing Y = A + X?
Miss traffic now looks like:
A+Bk, X+Bk,Y+Bk, A+Bk+N, X+Bk+N,Y+Bk+N, A+Bk+2N, X+Bk+2N,Y+Bk+2N, …
(X-A)
No detectable stride!
(Y-X)
(A+N-Y)
Lecture 14: DRAM and Prefetching
31
Tag
0x409A34
Load R1 = 0[R2]
A
0x409A50
A+Bk+3N N
Load R3 = 0[R4]
<program is here>
0x409A5C
Addr Stride Count
Store R5 = 0[R6]
Lecture 14: DRAM and Prefetching
2
+
X
X+Bk+3N N
2
Y
Y+Bk+2N N
1
If seen same
stride enough
times
(count > q)
Prefetch
A=Bk+4N
32
A
B
D
A
C
F
B
C
D
E
Actual memory
layout
F
Linked-List Traversal
(no chance for stride to
get this right)
E
Lecture 14: DRAM and Prefetching
33
D
A
What to Prefetch Next
D
F
A
E
F
B
C
E
?
B
B
Similar to history-based branch predictors:
Last time I saw X, Y happened
C
C
Ex 1: X = taken branch, Y = not-taken
E
D
Ex 2: X = Missed A,Y = Missed B
F
Lecture 14: DRAM and Prefetching
34
• Like branch predictors, longer history enables
learning more complex patterns
– and increases training time
DFS traversal: ABDBEBACFCGCA
A
B
D
C
E
F
G
AB
F
BD
E
DB
BE
D
A
EB
B
B
BA
AC
Lecture 14: DRAM and Prefetching
Prefetch prediction
table
C
35
• Alternative to explicitly remembering the patterns is
to remember multiple next-states
G
A
D
B
F
C
A
D
E
F
G
B
C
E
C
B
C
B, C
D, E, A
F, G, A
B
Lecture 14: DRAM and Prefetching
36
Miss to DRAM
DRAM
Cache line comes back
1
4128
900120230
900120758
Maybe!
Maybe!
Go ahead and prefetch these
Scan for anything that looks like a pointer
(is it within the heap range?)
Nope
struct bintree_node_t {
int data1;
int data2;
struct bintree_node_t * left;
struct bintree_node_t * right;
};
Lecture 14: DRAM and Prefetching
Nope
This allows you to walk the tree
(or other pointer-based data structures
which are typically hard to prefetch)
37
• Don’t necessarily need extra hardware to store
patterns
• Prefetch speed is slower:
X
A
DRAM Latency
DRAM Latency
X+N
DRAM Latency
X+2N
Stride Prefetcher
Pointer Prefetching
DRAM Latency
B
DRAM Latency
C
DRAM Latency
See “Pointer-Cache Assisted Prefetching” by Collins et al. MICRO-2002
for reducing this serialization effect.
Lecture 14: DRAM and Prefetching
38
Load PC
Value Predictor
for address only
– Normal VPred misprediction
causes pipeline flush
– Misprediction of address just
causes spurious memory
accesses
Lecture 14: DRAM and Prefetching
L1
L2
DRAM
• Takes advantage of value locality
• Mispredictions are less painful
39
• compare to simply increasing LLC size
• complex prefetcher vs. simpler with slightly larger
cache
• metrics: performance, power, area, bus utilization
– key is balancing prefetch aggressiveness with resource
utilization (reduce pollution, cache port contention, DRAM
bus contention)
Lecture 14: DRAM and Prefetching
40
• Prefetching can be done at any level of the cache
hierarchy
• Prefetching algorithm may vary as well
– depends on why you’re having misses
• capacity, conflict or compulsory
– may make capacity misses worse
– simpler technique (victim cache) may be better for conflict
– has better chance than other techniques for compulsory
• behaviors vary by cache level, I$ vs. D$
Lecture 14: DRAM and Prefetching
41
Download