MICRO-43 Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory Jeffrey Stuecheli1,2, Dimitris Kaseridis1, Hillery C. Hunter3 & Lizy K. John1 1ECE Department, The University of Texas at Austin 2IBM Corp., Austin 3IBM Thomas J. Watson Research Center Laboratory for Computer Architecture 12/7/2010 Overview/Summary Refresh overhead is increasing with device density Due to the nature of this increase, performance is suffering Current refresh scheduling methods ineffective in hiding these delays We propose more sophisticated mitigation methods – Elastic Refresh Scheduling 2 Laboratory for Computer Architecture 12/7/2010 Background Basic DRAM/Refresh Info Each bit stored on a capacitor Single read transistor to hold charge Leakage, looses charge over time Refresh: Rewrite cell on periodic basis DDR3 – Temperature dependence on refresh requirement, 64ms@85oC, 32ms@95oC – DRAM device contains internal address counter – JEDEC simply specifies the time interval (tREFI, time REFresh Interval) tREFI = 64ms/8096 = 7.8 us (3.9 us for 95oC) 3 Laboratory for Computer Architecture 12/7/2010 Background Transition to denser devices 7.8 us based on 8k Rows per bank DRAM device density doubles ~2 year With one refresh per row, tREFI would half each generation Instead, multiple rows are refreshed with each command 95 nm 512 MBit Current delivery constraints forces increase in tRFC with denser devices 42 nm 2GBit 4 Laboratory for Computer Architecture 12/7/2010 Background “Stacked” Refresh Operations in a Single Command Example Source: TN-47-16 Designing for High-Density DDR2 Memory Introduction by MICRON 5 Laboratory for Computer Architecture 12/7/2010 Background tRFC Growth with DRAM Density In the most basic terms, tRFC should scale linearly with density DRAM type Refresh Completion Time 512Mbit 90ns 1Gbit 110ns 2Gbit 160ns 4Gbit 300ns 8Gbit 350ns – Based strictly on current to charge capacitance ~Fixed charge per bit This has been reflected in the DDR3 spec, with the exception of 8 GBit Net, even if DRAM vendors can slow the growth, the delay is large today 6 Laboratory for Computer Architecture 12/7/2010 IPC Degradation over No-Refresh 7 0% Laboratory for Computer Architecture Integer 12/7/2010 Floating Point G. Mean sphix3 wrf lbm tonto GemsFDTD calculix povray soplex dealII namd leslie3d cactusADM 4 Gbit gromacs zeusmp milc 2 Gbit gamess bwaves G. Mean xalancbmk 30% astar omnetpp h264ref libquantum sjeng hmmer gobmk mcf gcc bzip2 perlbench Motivation Slowdown Effects Observed in Simulation Simics/Gems 4 cores, 2 1333MHz channels, 2 DDR3 Ranks/channel 25% 8 Gbit 20% 15% 10% 5% Motivation Why it is so bad 26ns 326ns Refresh Worst Case Refresh Hit DRAM Read Refreshes Reads tREFI 8 Laboratory for Computer Architecture tRFC bandwidth overhead o (95 C per Rank) latency overhead (95 o C) 512Mb 90ns 2.7% 1.4ns 1Gb 110ns 3.3% 2.1ns 2Gb 160ns 5.0% 4.9ns 4Gb 300ns 7.7% 11.5ns 8Gb 350ns 9.0% 15.7ns DRAM capacity tRFC 12/7/2010 Motivation Postponing Refresh Operations Each cell needs to be refreshed every 64 ms, Refresh command spacing is based around an average rate. As such, cell failure will not occur if no refresh is sent as tREFI expires. Current DDR3 spec allows the controller to fall eight tREFI intervals behind (backlog count) – Cell refresh rate is elongated by 0.1% (8 in 8k) 9 Laboratory for Computer Architecture 12/7/2010 Motivation Current Approaches Demand Refresh (DR) – Most basic policy, sends refresh operations as high priority operations every tREFI period Delay Until Empty (DUE) – Policy utilizes DRAM ability to postpone refreshes. – Refresh operations are postponed until no reads are queued, or the max backlog count has been reached Why These policies are ineffective – DR: Does nothing to hide refreshes – DUE: Too aggressive in sending refresh operations. Does not take advantage of the backlog in many cases. 10 Laboratory for Computer Architecture 12/7/2010 Elastic Refresh Exploit – Non-uniform request distribution – Refresh overhead just has to fit in free cycles Initially not aggressive, converges with DUE as refresh backlog grows Latency sensitive workloads are often lower bandwidth Decrease the probability of reads conflicting with refreshes 11 Laboratory for Computer Architecture 12/7/2010 Elastic Refresh Idle Delay Function Introduce refresh backlog dependent idle threshold With a log backlog, there is no reason to send refresh command With a bursty request stream, the probability of a future request decreases with time As backlog grows, decrease this delay threshold Idle Delay Threshold Constant 1 12 Laboratory for Computer Architecture 2 Proportional 3 4 5 6 Refresh Backlog High Priority 7 12/7/2010 8 Elastic Refresh Tuning the Idle Delay Function The optimal shape of the IDF is workload dependent IDF can be controlled with the listed parameters Our system contains hardware to determine “good” parameters – Max Delay and Proportional Slope Parameter Units Description Max Delay Memory Clocks Sets the delay in the constant region Proportional Slope Memory Clocks per Postponed Step Sets slope of the proportional region Postponed Step Point where the idle delay goes to zero High Priority Pivot 13 Laboratory for Computer Architecture 12/7/2010 Elastic Refresh Max Delay Circuit Circuit used to collect average Rank idle period Conceptually, given a exponential type distribution, the average can be used to find the tail Calculated average is used as Max Delay DRAM Read Sent 1 + 0 Current Idle Count (14) Circuit function, – Accumulate idle delay over 1024 events – Average calculated with concatenation of accumulator Laboratory for Computer Architecture Operation Count (10) cat + Max Delay (10) Delay Accumlator (20) 14 carry + 12/7/2010 To Idle Delay Function Elastic Refresh Proportional Slope Circuit Postponed < Threshold Low Conceptually, proportional region acts to gracefully transition to high priority, while utilizing full postponed range Circuit works to balance the utilization across the postponed range (High/Low counts) High carry carry + + Divide By 2 Divide By 2 Low High Integral - PI type controller adjusts slot to balance High/Low counts + w(p) w(i) + Prop Slope To Idle Delay Function 15 Laboratory for Computer Architecture 12/7/2010 Elastic Refresh Hardware Cost Trivial integration into DUE based policies – Structure replaces “empty” indication of DUE Logic size – ~100 latch bits for static policy – ~80 additional latch bits for dynamic policy Logic cycle time – Low frequency compared to ALU functions in processor core. – Infrequent updates could enable pipelined control. Refresh Scheduler Request Input Interface 16 Output To DRAM IO Drivers Rank Queues x N Input Queue Bank Queues x 8 tREFI Counter Refresh Queue Laboratory for Computer Architecture 12/7/2010 Simulation Methodology Simics extended with GEMS model – 1, 4 & 8 cores CMP – First-Ready, First-Come-First-Served memory controller policy – DDR3 1333MHz 8-8-8 memory, 2 MC, 2 Ranks/MC – tRFC= 550ns, tREFI = 3.9μs @95oC (estimation of 16GBit) – Refresh policies: • Demand Refresh (DR) • Defer Until Empty (DUE) • Elastic Refresh policies SPEC cpu2006 workloads 17 Laboratory for Computer Architecture 12/7/2010 18 Integer Laboratory for Computer Architecture 12/7/2010 Floating Point GeoMean sphinx3 wrf lbm tonto GemsFDTD calculix povray soplex dealII Integer namd leslie3d cactusADM gromacs zeusmp milc Fixed Delay (FD) Dynamic Delay (DD) gamess bwaves Geo Mean xalancbmk astar omnetpp h264ref 1.35 libquantum IPC Improvement 1.4 sjeng 1 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 hmmer gobmk mcf gcc bzip2 perlbench IPC Improvement Results 1 Core 1.3 1.25 1.2 1.15 1.1 1.05 Floating Point 8 Cores Related Work B. Bhat and F. Mueller,“Making DRAM refresh predictable,” Real-Time Systems, Euromicro Conference 2010 M. Ghosh and H. S. Lee, “Smart Refresh: An enhanced memory controller design for reducing energy in conventional and 3D die-stacked DRAMs,” in MICRO 40 K. Toshiaki, P. Paul, H. David, K. Hoki, J. Golz, F. Gregory, R. Raj, G. John, R. Norman, C. Alberto, W. Matt, and I. Subramanian, “An 800 MHz embedded DRAM with a concurrent refresh mode,” in IEEE ISSCC Digest of Technical Papers, Feb. 2004 19 Laboratory for Computer Architecture 12/7/2010 Conclusions The significant degradation of refresh can be mitigated with low overhead mechanisms Commodity DRAM is cost driven – Elastic refresh requires no DRAM changes Future work: – Coordinate refresh with other structures on the CMP – Investigate refresh for future DRAM devices (DDR4) • Example, dynamically select how many rows to refreshed 20 Laboratory for Computer Architecture 12/7/2010 Thank You, Questions? Laboratory for Computer Architecture University of Texas Austin IBM Austin IBM T. J. Watson Lab 21 Laboratory for Computer Architecture 12/7/2010