A Case for Refresh Pausing in DRAM Memory Systems Prashant Nair Chia-Chen Chou Moinuddin Qureshi 1 Introduction • Dynamic Random Access Memory (DRAM) used as main memory • DRAM stores data as charge on capacitor DRAM Chip DRAM cells leak data! 1 Leakage DRAM is a volatile memory Charge leaks quickly 2 Refresh: Restoring Data in DRAM DRAM maintains data by Refresh operations DRAM Chip Refresh Refresh Refresh Refresh Charge on cells restored JEDEC specified DRAM retention time: 64ms (< 85 C) 32ms (> 85 C) Time between Refresh ≤ Retention Time DRAM relies on Refresh for data integrity 3 Refresh: A Growing Problem Time spent in Refresh proportional to number of Rows Increasing memory capacity More time spent in Refresh ~36% ~18% 2.8% 1Gb 5.1% 2Gb 7.7% 4Gb 9% 8Gb 16Gb 32Gb Chip Density The time for doing Refresh is increasing with chip density 4 Refresh Blocks Reads Memory unavailable for Read/Write during Refresh A B time No Refresh A B Wait B Serviced REFRESH Interference due to Refresh time Refresh blocks reads Higher read latency 5 Impact of Refresh Performance 60% Performance Loss Increase in Read Latency Read Latency 50% 40% 30% 20% 10% 0% 8Gb 16Gb 32Gb 40% 35% 30% 25% 20% 15% 10% 5% 0% 8Gb 16Gb 32Gb Impact of Refresh is significant, and increasing Our Goal: Reduce the Read Latency impact of Refresh 6 Outline Introduction & Motivation Refresh Operation: Background Refresh Pausing Evaluation Alternative Proposals Summary 7 Refresh Operation Row 1 Row 2 Row 3 Row 4 Row 5 A DRAM Bank Refresh Refresh Row n-1 Refresh Row n Refresh operates on a Row granularity 8 Refresh Modes • Burst Mode: Refresh 64ms Memory unavailable until all rows finish refresh • Distributed Mode: Refresh 8K refresh pulses in 64ms 64ms Distributed mode reduces contention from Refresh 9 Refresh Bundle Every pulse refreshes a ‘Bundle of rows’ Chip Size 512 Mb 1Gb 2Gb 4Gb or 8Gb (Twin 4Gb die) Rows in a Refresh bundle (per bank) 1 2 4 8 Refresh Bundle currently have upto 8 rows, and increasing 10 The Latency Wall of Refresh TRFC is the time to do refresh for every refresh pulse available TRFC unavailable 8Gb available TRFC unavailable 16Gb available TRFC unavailable 32Gb Current 8Gb chips have TRFC of 350ns >> read latency High TRFC Read waits for refresh for long time 11 Outline Introduction & Motivation Refresh Operation: Background Refresh Pausing Evaluation Alternative Proposals Summary 12 Refresh Pausing Insight: Make Refresh Operations Interruptible A Refresh B Baseline system time Request B arrives A Refresh B Refresh (Cont.) Refresh Pausing Interrupted time Request B arrives Pausing at arbitrary point can cause data loss Pausing Refresh reduces wait time for Reads 13 Refresh Pausing: When to Pause? Bank Refresh Pulse (4 rows in a bundle) X Chip With Refresh Without Refresh Pausing Pausing Rows Pause Read X d c b a Row Buffer Refresh Pausing at Row boundary to service read 14 Refresh Pausing: Interface Details • Memory Controller generates a Refresh Enable (RE) signal • Pausing requires ‘active low’ detection of RE • One way communication only RE Pause 1 0 Memory Controller Refresh Enable (RE) to DRAM Resume 15 Refresh Pausing: Track a Paused Row • Row Address Counter increments the addresses • Stop the increment using a simple AND gate • Active Low Refresh Enable as ‘Refresh Pause’ DRAM Address Generator Refresh Bundle Addresses EN Row Address Counter RE Incrementer 16 Refresh Pausing: Memory Scheduler • Scheduler schedules: Read, Write, and Refresh • Responsible for Pausing Refresh for Read • Keeps track of refresh time done before Pause Scheduler Read Queue Bus Processor DRAM Write Queue Memory Controller Refresh Enable 17 Forced Refresh • Pausing can delay Refresh Reads/Writes Forced Refresh Refresh Pulses Refresh Issued Refresh Not Issued • JEDEC allows delay of up-to 8 pending refresh • If 8 pending refresh, then issue ‘Forced Refresh’ • Forced Refresh cannot be Paused Forced Refresh for data integrity 18 Outline Introduction & Motivation Refresh Operation: Background Refresh Pausing Evaluation Alternative Proposals Summary 19 Experimental Setup • Simulator: uSIMM from Memory Scheduling Championship (MSC) • Workloads: MSC Suite COMMERCIAL(5), PARSEC(9), BIOBENCH(2) and SPEC(2) • Configuration: Number of Cores Last Level Cache DRAM (DDR3) Channels, Ranks, Banks Refresh (Baseline) 4 1MB 8 Chips/Rank, 8Gb/Chip 4,2,8 Distributed (JEDEC) • Results presented for temperature > 85C (paper also has <85C) 20 Normalized Read Latency Results: Read Latency Normalized Read Latency Refresh Pausing No Refresh 1.00 7% 0.95 0.90 0.85 0.80 0.75 COMMERCIAL SPEC PARSEC BIOBENCH GMEAN - Refresh Pausing gives ~7% read latency reduction for an 8Gb chip 21 Results: Performance Performance Comparison Refresh Pausing No Refresh Speedup 1.12 1.10 1.08 1.06 1.04 1.02 COMMERCIAL SPEC PARSEC BIOBENCH GMEAN - Refresh Pausing gives ~5% performance improvement for an 8Gb chip 22 Results: Impact of Chip Density Impact of Density on RefreshNo Pausing Refresh Pausing Refresh Speedup 1.4 1.3 1.2 1.1 1.0 8Gb 16Gb 32Gb Refresh Pausing more effective as chips density increases 23 Outline Introduction & Motivation Refresh Operation: Background Refresh Pausing Evaluation Alternative Proposals Summary 24 Elastic Refresh for Scheduling Refresh [MICRO’10] • Elastic Refresh waits for idle period before issuing a refresh • Estimates average inter-arrival time of memory request Request A No Refreshes A Request A With Refreshes A Request B 3 units 4 units A Request B 7 units Wait time B Refresh Request A Elastic Refresh B Refresh time Request B B time The “Wait and Watch” policy can increase wait times 25 Comparison with Elastic Refresh Comparision of Elastic Refresh Elastic Refresh Refresh Pausing No Refresh 1.15 Speedup 1.10 1.05 1.00 0.95 0.90 COMMERCIAL SPEC PARSEC BIOBENCH GMEAN Refresh Pausing outperforms Elastic Refresh 26 DDR4 proposals: x2 and x4 modes Reduce bundles size and have more bundles DDR3 Distributed Mode TRFC TRFC TREFI DDR4 x2 Mode TRFC/2 TREFI/2 TRFC/2 TREFI/2 TRFC/2 TRFC/2 TREFI/2 • In x2 mode, TREFI is reduced by 2 (x4 mode by 4) • In x2 mode TRFC is reduced by 2 (x4 mode by 4) Fine Grained Refresh to reduce contention of Refresh 27 Comparison with DDR4 1.40 1.35 Speedup 1.30 1.25 1.20 1.15 1.10 1.05 1.00 DDR4 x2 DDR4 x4 Pausing 16Gb No DDR4 x2 DDR4 x4 Pausing No Refresh Refresh 32Gb DDR4 modes (x2 and x4) useful but not enough 28 Outline Introduction & Motivation Refresh Operation: Background Refresh Pausing Evaluation Alternative Proposals Summary 29 Summary • DRAM relies on Refresh for data integrity • Time for Refresh increases with chip density • Refresh blocks read, increases read latency • Refresh Pausing: make Refresh Interruptible • Pausing provides 5% improvement for 8Gb, increases with higher density • Applicable also to DDR4 (fine grained refresh) 30 THANK YOU 31 Refresh+Read • Reads operate on a rank • Refreshes may also operate on the same rank • DRAMs serve only a single request at a time Rank Reads Refresh Scheduler Read Queue 32 Refresh Row Bundle Refresh Row Bundle REFRESH TREFI Row n Row 1 TRFC TREC REFRESH • TRFC : Time to refresh one bundle of rows • TREC : Current Recovery Time • TREFI : Time until next bundle refresh Larger refresh-row bundle implies larger TRFC 33 DRAM Organization Hierarchically organized as Channels, Ranks and Banks Banks Rank 2 Chip Rank 1 READ Channel Rows 34 Refresh Modes Burst and Distributed Mode Chips Refresh Rank Bank Refresh Rows Distributed Burst Mode Mode Distributed mode: Only a few rows in all banks refresh; In burst mode, all rows in all banks refresh simultaneously 35 refresh is distributed in time Transactions in DRAMs Three transactions of concern – Reads – Writes – Refreshes Refresh Write Processor Read DRAM Bus Mismanagement of requests leads to collisions! A scheduler is needed to manage requests to DRAM 36 Temperature Sensitivity of Refresh Pausing Temperature Sensitivity 40.00% 30.00% 20.00% 8Gb 16Gb 10.00% 32Gb 0.00% <85C <85C >85C No Refresh Refresh Pausing No Refresh >85C Refresh Pausing - Upto 22% increase in speedup for future chips The savings of Refresh Pausing is higher while operating at high temperatures 37 Auto and Self Refresh • Special Refresh Modes for DRAMs • Auto Refresh – Internal Counter issues pulses in distributed fashion (CBR and RAS only) • Self Refresh – DRAM is internally refreshed at a power optimized rate (Activity == 0) Self Refresh Modes are only used when DRAMs stay idle 38 Mitigating Penalty • Pause a refresh bundle at row granularity • TRPC = row cycle time + current recovery time • Current recovery time is small for individual rows • Thus refreshes can be made interruptible a. b. Maximum Refresh penalty without pausing is TRFC Maximum Refresh penalty with pausing is to TRPC 39