An introduction to SDRAM and memory controllers 5kk73 Outline ► Part 1: DRAM and controller basics – DRAM architecture and operation – Timing constraints – DRAM controller ► Part 2: DRAMs in embedded systems – Challenges in sharing DRAMs – Real-time guarantees with DRAMs – Future DRAM architectures and controllers 2 Memory device ► “A device that preserves information for retrieval” 3 - Web definition Semiconductor memories ► “Semiconductor memory is an electronic data storage device, often used as computer memory, implemented on a semiconductor-based integrated circuit” - Wikipedia definition ► The main characteristics of semiconductor memory are low cost, high density (bits per chip), and ease of use 4 Semiconductor memory types ► RAM (Random Access Memory) – DRAM (Dynamic RAM) • Synchronous DRAM (SDRAM) – SRAM (Static RAM) ► ROM (Read Only Memory) – Mask ROM, Programmable ROM (PROM), EPROM (Erasable PROM), UVEPROM (Ultra Violet EPROM) ► NVRAM (Non-Volatile RAM) or Flash memory 5 Memory hierarchy Processor L1 Cache L2 Cache Registers L1 Cache L2 Cache Off-chip memory Off-chip memory Secondary memory (Hard disk) Secondary memory Size (capacity) 6 Access speed Registers Distance from processor CPU Memory hierarchy Processor Module Memory type used Registers SRAM L1 Cache L2 Cache Off-chip memory CPU Registers L1 Cache Secondary memory Capacity Managed by 1 cycle ~500B Software/com piler SRAM 1-3 cycles ~64KB Hardware L2 Cache SRAM 5-10 cycles 1-10MB Hardware Off-chip memory DRAM ~100 cycles ~10GB Software/OS Secondary memory Disk drive ~1000 cycles ~1TB Software/OS 7 Credits: J.Leverich, Stanford Access Time SRAM vs DRAM Static Random Access Memory ► Dynamic Random Access Memory Bitlines driven by transistors ► A bit is stored as charge on the capacitor ► Bit cell loses charge over time (read operation and circuit leakage) - Fast (10x) ► 1 transistor and 1 capacitor vs. 6 transistors – Large (~6-10x) 8 Credits: J.Leverich, Stanford - Must periodically refresh - Hence the name Dynamic RAM SRAM vs DRAM: Summary ► SRAM – – – – – is preferable for register files and L1/L2 caches Fast access No refreshes Simpler manufacturing (compatible with logic process) Lower density (6 transistors per cell) Higher cost ► DRAM is preferable for stand-alone memory chips – Much higher capacity – Higher density – Lower cost – DRAM is the main focus in this lecture! 9 Credits: J.Leverich, Stanford DRAM: Internal architecture Bank 4 Bank 3 Bank 2 MS bits Row decoder Address Address register Bank 1 Memory Array –Row Buffer ► Bit cells are arranged to form a memory array ► Multiple arrays are organized as different banks –Row Buffer –Row Buffer Sense amplifiers (row buffer) LS bits – Typical number of banks are 4, 8 and 16 Column decoder ► Data 10 Credits: J.Leverich, Stanford Sense amplifiers raise the voltage level on the bitlines to read the data out DRAM: Read access sequence MS bits Row decoder Address Address register Bank 1 Memory Array –Row Buffer –Row Buffer ► Decode row address & drive wordlines ► Selected bits drive bit-lines Sense amplifiers LS bits – Entire row read Column decoder Data 11 Credits: J.Leverich, Stanford ► Amplify row data ► Decode column address & select subset of row ► Send to output ► Precharge bit-lines for next access DRAM: Memory access protocol ► Bank 1 n Row decoder Address RAS Memory Array 2n – RAS = Row Address Strobe – CAS = Column Address Strobe 2n Row x –Row Buffer 2nColumn –Row Buffer 2m Sense amplifiers CAS 2m m Column decoder 1 ► Data is accessed by issuing memory commands ► 5 basic commands – – – – – Data 12 Credits: J.Leverich, Stanford To reduce pin count, row and column share same address pins ACTIVATE READ WRITE PRECHARGE REFRESH DRAM: Basic operation Addresses Commands (Row 0, Column 0) ACTIVATE Row 0 (Row 0, Column 1) READ Column 0 Columns (Row 0, Column 10) READ Column 1 Row 0 (Row 1, Column 0) READ Column 10 –Row Buffer PRECHARGE Row 0 ACTIVATE Row 1 READ Column 0 1 0 Row Row buffer Column address Column decoder Data 13 Credits: J.Leverich, Stanford Rows Row decoder Row address Row 1 Row buffer MISS! HIT! DRAM: Basic operation (Summary) ► Access to an “open row” – No need to issue ACTIVATE command – READ/WRITE will access row buffer ► Access – – – – to a “closed row” If another row is already active, issue PRECHARGE first Issue ACTIVATE to open a new row READ/WRITE will access row buffer Optional: PRECHARGE after READ/WRITEs finished • If PRECHARGE issued Closed-page policy • If not Open-page policy 14 Credits: J.Leverich, Stanford DRAM: Burst access ► Each READ/WRITE command can transfer multiple words (8 in DDR3) ► Observe the number of words transferred in a single clock cycle – Double Data Rate (DDR) 15 Credits: J.Leverich, Stanford DRAM: Banks ► DRAM chips can consist of multiple banks – Address = (Bank x, Row y, Column z) ► Banks operate independently, but share command, address and data pins – Each bank can have a different row active – Can overlap ACTIVATE and PRECHARGE latencies!(i.e. READ to bank 0 while ACTIVATING bank 1) Bank-level parallelism Bank 1 Bank 0 Row 0 Row 1 –Row Buffer –Row Buffer 16 Credits: J.Leverich, Stanford DRAM: Bank-level parallelism ► Enable DRAM access from different banks in parallel – Reduces memory access latency and improves efficiency! 17 Credits: J.Leverich, Stanford 2Gb x8 DDR3 Chip [Micron] ► Observe the bank organization 18 Credits: J.Leverich, Stanford 2Gb x8 DDR3 Chip [Micron] ► Observe row width, bi-directional bus and 64 8 data-path 19 Credits: J.Leverich, Stanford DDR3 SDRAM: Current standard ► Introduced ► SDRAM in 2007 Synchronous DRAM (Clocked) – DDR = Double Data Rate • Data transferred on both clock edges – – – – – 400 MHz = 800 MT/s x4, x8, x16 datapath widths Minimum burst length of 8 8 banks 1Gb, 2Gb, 4Gb capacity 20 DRAM: Timing Constraints tRAS tRCD CMD ACT NOP NOP RD tRL tRP NOP DATA NOP NOP D1 Dn PRE NOP NOP – tRCD= Row to Column command delay • Time taken by the charge stored in the capacitor cells to reach the sense amps – tRAS= Time between RAS and data restoration in DRAM array (minimum time a row must be open) – tRP= Time to precharge DRAM array ► Memory controller must respect the physical device characteristics! 21 DRAM: Timing Constraints ► There – – – – – are a bunch of other timing constraints… tCCD= Time between column commands tWTR= Write to read delay (bus turaround time) tCAS= Time between column command and data out tWR= Time from end of last write to PRECHARGE tFAW= Four ACTIVATE window (limits current surge) • Maximum number of ACTIVATEs in this window is limited to four – tRC= tRAS+ tRP= Row “cycle” time • Minimum time between accesses to different rows ► Timing constraints makes performance analysis and memory controller design difficult! 22 DRAM controller DRAM controller Front-end Request scheduler Back-end Memory map Command generator DRAM Address Command ► Request scheduler decides which memory request to be selected ► Memory map translates logical address physical address • • ► Loical address = incoming address Physical address = (Bank, Row Column) Command generator issues memory commands respecting the physical device characteristics 23 Request scheduler ► Many algorithms exist to determine how to schedule memory requests – Prefer requests targeting open rows • Increases number of row buffer hit – Prefer read after read and write after write • Minimize bus turnaround – Always prefer reads, since reads are blocking and writes often posted • Reduce stall cycles of processor 24 Memory map ► Memory map decodes logical address to physical address – Physical address is (bank, row, column) – Decoding is done by slicing the bits in the logical address Logical addr. 0x10FF00 ► Memory map Physical addr. (2, 510, 128) Several memory mapping schemes exist – Continuous, Bank Interleaved 25 Continuous memory map ► Map sequential address to columns in row ► Switch bank when all columns in row are visited ► Switch row when all banks are visited 26 Bank-interleaved memory map ►Bank-interleaved memory map – Maps bursts to different banks in interleaving fashion – Active row in a bank is not changed until all columns are visited 27 Memory map generalization ► Continuous and interleaving memory map are just 2 possible memory mapping schemes – In the most general case, an arbitrary set of bits out of the logical address could be used for the row, column and bank address, respectively Example memory map (1 burst per bank, 2 banks interleaving, 8 words per burst): Bit 0 Bit 26 Logical address: RRR RRRR RRRR RRBB CCCC CCCB CCCW Burst-size Example memory: Row Bank-offset Bank interleaving 16-bit DDR3-1600 64 MB 8 banks 8K rows / bank 1024 columns / row 16 bits / column Can be done in different ways – choice affects memory efficiency! 28 Command generator ► Decide selection of memory requests ► Generate SDRAM commands without violating timing constraints Command generator 29 Command generator ► Different page policies to determine which command to schedule – Close-page policy: Close rows as soon as possible to activate new one faster, i.e, not to waste time to PRECHARGE the open row of the previous request – Open page policy: Keep rows open as long as possible to benefit from locality, i.e., assuming the next request will target the same open row 30 Open page or Close page? Addresses Commands (Row 0, Column 0) ACTIVATE Row 0 (Row 0, Column 1) READ Column 0 Columns (Row 0, Column 10) READ Column 1 Row 0 (Row 1, Column 0) READ Column 10 –Row Buffer PRECHARGE Row 0 ACTIVATE Row 1 READ Column 0 1 0 Row Row buffer Column address Column decoder Data 31 Credits: J.Leverich, Stanford Rows Row decoder Row address Row 1 Row buffer MISS! HIT! A modern DRAM controller [Altera] 32 Image: Altera Conclusions (Part 1) ► SDRAM is used as off-chip high-volume storage – Cheaper, slower than SRAM ► DRAM timing constraints makes it hard to design memory controller ► Selection of memory map and command/request sheduling algorithms impacts memory access time and/or efficiency 33 Outline ► Part 1: DRAM and controller basics – DRAM architecture and operation – Timing constraints – DRAM controller ► Part 2: DRAMs in embedded systems – Challenges in sharing DRAMs – Real-time guarantees with DRAMs – Future DRAM architectures and controllers 34 Trends in embedded systems ► Embedded systems get increasingly complex – Increasingly complex applications (more functionality) – Growing number of applications integrated in a device – Requires increased system performance without increasing power ► The case of a generic car manufacturer – Typical number of ECUs in a car in 2000 20 – Number of ECUs in Audi A8 Sedan over 80 35 System-on-Chip (SoC) ► The resulting complex contemporary platforms are heterogeneous multi-processor systems – Resources in the system are shared to reduce cost 36 SoC: Video and audio processing system is typically used as shared main memory for cost and reasons Video Engine Interconnect ► DRAM Audio Processor Host CPU DMA Controller GPU Input processor LCD Controller DRAM Memory controller A.B. Soares et.al., Development of a SoC for Digital Television Set-Top Box: Architecture and System Integration Issues, International Journal of Reconfigurable Computing Volume 2013 37 Set-top box architecture [Philips] 38 DRAM controller architecture DRAM controller Client 1 Client 4 Memory map DRAM Bus Client 2 Client 3 Command generator Client n Arbiter ► The arbiter grants memory access to one of the memory clients at a time – Example: Round-Robin, Time Division Multiplexing (TDM) prioritybased arbiters 39 DRAM controller for real-time systems ► Clients in real-time systems have requirements on latency/bandwidth – A fixed set of memory access parameters such as burst size, pagepolicy etc in the back-end bounds transaction execution time – Predictable arbiters, such as TDM fixed time slots, Round Robin bounds response time Bounds execution time Client 1 Client 4 Interconnect Client 2 Client 3 DRAM Back-end Client n Arbiter Bounds response time 40 B.Akesson et.al., “Predator: A Predictable SDRAM Memory Controller”, CODES+ISSS, 2007 DRAMs in the market Family DDR LPDDR WIDE IO ► Generations Datapath width (bits) Frequency range (MHz) DDR 16 100-200 DDR2 16 200-400 DDR3 16 400-1066 DDR4 16 800-1600 LPDDR 16 and 32 133 - 208 LPDDR2 16 and 32 333-533 LPDDR3 16 and 32 667 - 800 SDR 128 200-266 Observe the increase in operating frequency with every generation 41 DRAMs: Bandwidth vs clock frequency 18 WIDE IO SDR 16 Peak bandwidth (GB/s) 14 12 10 8 6 LPDDR3 DDR4 LPDDR2 4 DDR3 LPDDR 2 DDR DDR2 0 0 200 400 600 800 1000 1200 1400 1600 Max operating frequency (MHz) ► WIDE IO gives much higher bandwidth at lower frequency – Low power consumption 42 1800 Multi-channel DRAM: WIDE IO ► Bandwidth demands of future embedded systems > 10 GB/s – Memory power consumption scales up with memory operating frequency “Go parallel” ► Multi-channel memories – Each channel is an independent memory module with dedicated data and control path – WIDE IO DRAM (4 channels) 43 Channel 3 128-bit IO Channel 2 128-bit IO Channel 1 128-bit IO Channel 4 128-bit IO Multi-channel DRAM controller DRAM controller 1 Memory client 1 Atomizer Back-end CS Sequence gen 1 Arbiter DRAM controller 2 Memory client 2 Atomizer CS Sequence gen 2 Channel 1 Channel 2 Back-end Arbiter ► The Atomizer chops the incoming requests into a number of service units ► Channel Selector (CS) routes the service units to the different memory channels according to the configuration in the Sequence Genrators M.D.Gomony et.al., “Architecture and Optimal Configuration of a Real-Time Multi-Channel Memory Controller”, DATE, 2012 44 Multi-channel DRAM controller ► Multi-channel memories allow memory requests to be interleaved across multiple memory channels – Reduces access latency DRAM controller 1 Channel 1 Back-end CS Memory client 1 Arbiter Atomizer DRAM controller 2 Back-end Sequence gen 1 Arbiter 45 Channel 2 Wide IO memory controller [Cadence] 46 Image: Cadence Future DRAM: HMC ► Hybrid Memory Cube (HMC) – 16 memory channels ► How does the memory controller for HMC look like? 47 Image: Micron, HMC Conclusions (part 2) ► DRAMs are shared in multi-processor SoC to reduce cost and to enable communication between the processing elements ► Sharing DRAMs between multiple memory clients can done using different arbitration algorithms ► Predictable arbitration and back-end provides real-time guarantees on latency and bandwidth to real-time clients ► Multi-channel DRAMs allows a memory request to be interleaved across memory channels 48 Questions? m.d.gomony@tue.nl 49 References ► B. Jacob et al., Memory systems: cache, DRAM, disk. Morgan Kaufmann, 2007 ► B.Akesson et.al., “Predator: A Predictable SDRAM Memory Controller”, CODES+ISSS, 2007 ► M.D.Gomony et.al., “Architecture and Optimal Configuration of a Real-Time Multi-Channel Memory Controller”, DATE, 2012 ► http://hybridmemorycube.org/ 50