Lecture 14: DRAM and Prefetching • DRAM = Dynamic RAM • SRAM: 6T per bit – built with normal high-speed CMOS technology • DRAM: 1T per bit – built with special DRAM process optimized for density Lecture 14: DRAM and Prefetching 2 SRAM DRAM wordline wordline b Lecture 14: DRAM and Prefetching b b 3 • You can use a “dead” transistor gate: But this wastes area because we now have two transistors And the “dummy” transistor may need to be bigger to hold enough charge Lecture 14: DRAM and Prefetching 4 • There are other advanced structures Cell Plate Si “Trench Cell” Cap Insulator Refilling Poly Storage Node Poly Si Substrate Field Oxide DRAM figures from this slide and previous were taken from Prof. Nikolic’s EECS141/2003 Lecture notes from UC-Berkeley Lecture 14: DRAM and Prefetching 5 Row Decoder Row Address Memory Cell Array Sense Amps Column Address Row Buffer Column Decoder Data Bus Lecture 14: DRAM and Prefetching 6 • High-Level organization is very similar to SRAM – cells are only single-ended • changes precharging and sensing circuits • makes reads destructive: contents are erased after reading – row buffer • read lots of bits all at once, and then parcel them out based on different column addresses – similar to reading a full cache line, but only accessing one word at a time • “Fast-Page Mode” FPM DRAM organizes the DRAM row to contain bits for a complete page – row address held constant, and then fast read from different locations from the same page Lecture 14: DRAM and Prefetching 7 Vdd sense amp bitline voltage 01 Wordline Enabled Sense Amp Enabled Vdd Lecture 14: DRAM and Prefetching After read of 0 or 1, cell contains something close to 1/2 storage cell voltage 8 • So after a read, the contents of the DRAM cell are gone • The values are stored in the row buffer • Write them back into the cells for the next read in the future DRAM cells Sense Amps Row Buffer Lecture 14: DRAM and Prefetching 9 • Fairly gradually, the DRAM cell will lose its contents even if it’s not accessed – This is why it’s called “dynamic” – Contrast to SRAM which is “static” in that once written, it maintains its value forever (so long as power remains on) 01 Gate Leakage • All DRAM rows need to be regularly read and re-written Lecture 14: DRAM and Prefetching 10 Accesses are asynchronous: triggered by RAS and CAS signals, which can in theory occur at arbitrary times (subject to DRAM timing constraints) Lecture 14: DRAM and Prefetching 11 Double-Data Rate (DDR) DRAM transfers data on both rising and falling edge of the clock Command frequency does not change Burst Length Timing figures taken from “A Performance Comparison of Contemporary DRAM Architectures” by Cuppu, Jacob, Davis and Mudge Lecture 14: DRAM and Prefetching 12 More wire delay getting to the memory chips Significant wire delay just getting from the CPU to the memory controller Width/Speed varies depending on memory type (plus the return trip…) Lecture 14: DRAM and Prefetching 13 Like Write-Combining Buffer, Scheduler may coalesce multiple accesses together, or re-order to reduce number of row accesses Read Queue Write Queue Response Queue Commands Data To/From CPU Scheduler Buffer Memory Controller Bank 0 Lecture 14: DRAM and Prefetching Bank 1 14 • Access latency dominated by wire delay – mostly in the wordline and bitlines/sense – PCB traces between chips • Process technology improvements provide smaller and faster transistors – DRAM density doubles at about the same rate as Moore’s Law – DRAM latency improves very slowly because wire delay has not improved as fast as logic delay Lecture 14: DRAM and Prefetching 15 • CPUs – frequency has increased at about 60% per year • DRAM – end-to-end latency has decreased only about 10% per year Number of cycles for memory access keeps increasing – A.K.A. the memory wall – Note: absolute latency of memory is decreasing • Just not nearly as fast as the CPU Lecture 14: DRAM and Prefetching 16 • Caching – reduces average memory instruction latency by avoiding DRAM altogether • Limitations – Capacity • programs keep increasing in size – Compulsory misses Lecture 14: DRAM and Prefetching 17 • Clock FSB faster – DRAM chips may not be able to keep up • Latency dominated by wire delay – Bandwidth may be improved (DDR vs. regular) but latency doesn’t change much • Instead of 2 cycles for row access, may take 3 cycles at a faster bus speed • Doesn’t address latency of the memory access Lecture 14: DRAM and Prefetching 18 Memory controller can run at CPU speed instead of FSB clock speed All on same chip: No slow PCB wires to drive Disadvantage: memory type is now tied to the CPU implementation Lecture 14: DRAM and Prefetching 19 • If memory takes a long time, start accessing earlier L1 L2 Load Prefetch Load DRAM Total Load-to-Use Latency Data Much improved Load-to-Use Latency Data May cause resource contention due to extra cache/DRAM activity Somewhat improved Latency Lecture 14: DRAM and Prefetching 20 Reordering can mess up your code A R1 = R1- 1 B R1 = [R2] R0 = [R2] A A R1 = R1- 1 C R1 = [R2] R3 = R1+4 (Cache missing instruction in red) Lecture 14: DRAM and Prefetching B C R3 = R1+4 Hopefully the load miss is serviced by the time we get to the consumer B C R1 = [R2] R3 = R1+4 Using a prefetch instruction (or load to $zero) can help to avoid problems with data dependencies 21 • Pros: – can leverage compiler level information – no hardware modifications • Cons: – prefetch instructions increase code footprint • may cause more I$ misses, code alignment issues – hard to hoist prefetches early enough to cover main memory latency • If memory is 100 cycles, and the CPU can sustain 2 instructions per cycle, then load needs to be moved 200 instructions earlier in the code – aggressive hoisting leads to many useless prefetches • control flow may go somewhere else (like block B in previous slide) Lecture 14: DRAM and Prefetching 22 DRAM Hardware monitors miss traffic to DRAM HW Prefetcher CPU Lecture 14: DRAM and Prefetching Depending on prefetch algorithm/miss patterns, prefetcher injects additional memory requests Cannot be overly aggressive since prefetches may contend for memory bandwidth, and may pollute the cache (evict other useful cache lines) 23 • Very simple, if a request for cache line X goes to DRAM, also request X+1 – assumes spatial locality • often a good assumption – low chance of tying up the memory bus for too long • FPM DRAM already will have the correct page open for the request for X, so X+1 will likely be available in the row buffer • Can optimize by doing Next-Line-Unless-Crossing-A-PageBoundary prefetching Lecture 14: DRAM and Prefetching 24 • Obvious extension – fetch the next N lines: • X+1, X+2, …, X+N • Need to carefully tune N – larger N may make it: • more likely to prefetch something useful • more likely to evict something useful • more likely to stall a useful load due to bus contention Lecture 14: DRAM and Prefetching 25 Figures from Jouppi “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA’90 Lecture 14: DRAM and Prefetching 26 Lecture 14: DRAM and Prefetching 27 • Can independently track multiple “inter-twined” sequences/streams of accesses • Separate buffers prevent prefetch streams from polluting cache until line is used at least once – similar effect to filter/promotion caches • Can extend to “Quasi-Sequential” Stream buffer – add comparator to all entries, and skip-ahead (partial flush) if hit on a non-head entry Lecture 14: DRAM and Prefetching 28 Layout in linear memory Column traversal of a matrix If array starts at address A, and we are accessing the kth column, each element is B bytes large, and there are N elements per row of the matrix, then the addresses accessed are: A+Bk, A+Bk+N, A+Bk+2N, A+Bk+3N, … Or, if you miss on address X, prefetch X+N Lecture 14: DRAM and Prefetching 29 • Like Next-N-Line prefetching, need to limit how far ahead stride is allowed to go – previous example: no point in prefetching past the end of the array • How can you tell the difference between: – A[i] A[i+1] – XY – Typically only do stride prefetch if same stride observed at least a few times Lecture 14: DRAM and Prefetching 30 What if we’re doing Y = A + X? Miss traffic now looks like: A+Bk, X+Bk,Y+Bk, A+Bk+N, X+Bk+N,Y+Bk+N, A+Bk+2N, X+Bk+2N,Y+Bk+2N, … (X-A) No detectable stride! (Y-X) (A+N-Y) Lecture 14: DRAM and Prefetching 31 Tag 0x409A34 Load R1 = 0[R2] A 0x409A50 A+Bk+3N N Load R3 = 0[R4] <program is here> 0x409A5C Addr Stride Count Store R5 = 0[R6] Lecture 14: DRAM and Prefetching 2 + X X+Bk+3N N 2 Y Y+Bk+2N N 1 If seen same stride enough times (count > q) Prefetch A=Bk+4N 32 A B D A C F B C D E Actual memory layout F Linked-List Traversal (no chance for stride to get this right) E Lecture 14: DRAM and Prefetching 33 D A What to Prefetch Next D F A E F B C E ? B B Similar to history-based branch predictors: Last time I saw X, Y happened C C Ex 1: X = taken branch, Y = not-taken E D Ex 2: X = Missed A,Y = Missed B F Lecture 14: DRAM and Prefetching 34 • Like branch predictors, longer history enables learning more complex patterns – and increases training time DFS traversal: ABDBEBACFCGCA A B D C E F G AB F BD E DB BE D A EB B B BA AC Lecture 14: DRAM and Prefetching Prefetch prediction table C 35 • Alternative to explicitly remembering the patterns is to remember multiple next-states G A D B F C A D E F G B C E C B C B, C D, E, A F, G, A B Lecture 14: DRAM and Prefetching 36 Miss to DRAM DRAM Cache line comes back 1 4128 900120230 900120758 Maybe! Maybe! Go ahead and prefetch these Scan for anything that looks like a pointer (is it within the heap range?) Nope struct bintree_node_t { int data1; int data2; struct bintree_node_t * left; struct bintree_node_t * right; }; Lecture 14: DRAM and Prefetching Nope This allows you to walk the tree (or other pointer-based data structures which are typically hard to prefetch) 37 • Don’t necessarily need extra hardware to store patterns • Prefetch speed is slower: X A DRAM Latency DRAM Latency X+N DRAM Latency X+2N Stride Prefetcher Pointer Prefetching DRAM Latency B DRAM Latency C DRAM Latency See “Pointer-Cache Assisted Prefetching” by Collins et al. MICRO-2002 for reducing this serialization effect. Lecture 14: DRAM and Prefetching 38 Load PC Value Predictor for address only – Normal VPred misprediction causes pipeline flush – Misprediction of address just causes spurious memory accesses Lecture 14: DRAM and Prefetching L1 L2 DRAM • Takes advantage of value locality • Mispredictions are less painful 39 • compare to simply increasing LLC size • complex prefetcher vs. simpler with slightly larger cache • metrics: performance, power, area, bus utilization – key is balancing prefetch aggressiveness with resource utilization (reduce pollution, cache port contention, DRAM bus contention) Lecture 14: DRAM and Prefetching 40 • Prefetching can be done at any level of the cache hierarchy • Prefetching algorithm may vary as well – depends on why you’re having misses • capacity, conflict or compulsory – may make capacity misses worse – simpler technique (victim cache) may be better for conflict – has better chance than other techniques for compulsory • behaviors vary by cache level, I$ vs. D$ Lecture 14: DRAM and Prefetching 41