Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer, Yi Lu (alizade, adelj)@stanford.edu, (dachuang, sundaes)@memoir-systems.com, yilu4@illinois.edu ACM Sigmetrics/Performance 2012 What Is Embedded DRAM? 1. 2nd Most Common Embedded Memory • • • Consists of 1 Transistor, 1 Capacitor cell 2X-3X denser than SRAM 2X-4X slower than SRAM 2. Supported by Key ASIC and IP Vendors • IBM, TSMC, NEC, Mosys, ST 3. Used in a Number of Applications • Servers, Networking, Storage, Gaming, Mobile 4. Industry Examples • • • • Select Storage Capacitor Data eDRAM 1T1C Memory Cell IBM's P7 Sony Playstations, Nintendo GameCube, Wii Apple iPhone, Microsoft Zune HD, Xbox 360 Cisco Catalyst 3K-10K ACM Sigmetrics/Performance 2012 2 Problem: eDRAM Refresh Causes Memory Bandwidth Loss DRAM Capacitor has Finite Retention Time (W = Tref) Bank Rows 1 Example: W= 18us @ 100C = 4050 cycles @ 225 MHz R Refresh Port R/W Port Example: R = 64 rows All 64 rows will lose data in 4050 cycles! Solution: Periodic Refresh --- Reserve Refresh Cycles for Every Cell in Memory Causes Bandwidth Loss = R/W = 64 rows/4050 cycles ~ 1.58% ACM Sigmetrics/Performance 2012 3 Trend: Higher Density Multi-banked Macros (Mb/mm2) (1) More Rows are Packed Together and Need to be Refreshed 1 (2) More Banks are Packed Together and Need to be Refreshed 2 Memory Banks B Shared Refresh and R/W Ports Rows 1 R 1 M Shared Circuitry to Conserve Area (5) Low Clock Speed Mode Decreases ‘Clock-time’ to Refresh (3) Smaller Capacitor with Lower Geometry → Smaller W (4) Smaller W with Higher Temperature Periodic Refresh Bandwidth Loss ~ RB/W (Note: W ≥ RB) Does not Scale with Larger Macros, Geometry & Low Power Modes ACM Sigmetrics/Performance 2012 4 Examples of Periodic Refresh with Multi-banked Macros M = Number of Memory Ports, B = Number of Banks, R = Number of Rows, W = Cell Retention Time M=1 (Ports) W (@ 100C) Normal (250 MHz) 18.2 us = 4050 cycles Low Power (150 MHz) 18.2 us = 2699 cycles R Rows B Banks Periodic Refresh Loss 64 8 12.6% 128 8 25% 64 8 18.9% 128 8 38% The Problem is Only Getting Worse Over Time … ACM Sigmetrics/Performance 2012 5 Vendor Solution: Concurrent Refresh 1 2 Memory Banks B Rows 1 Concurrent Refresh Port R/W Ports Refresh Port R 1 M Concurrent Refresh++: Refresh a Bank Which is Not Being Concurrently Accessed ++T. Kirihata et. al., An 800-MHz embedded DRAM with a concurrent refresh mode. Solid-State Circuits, IEEE Journal of, 40(6):1377–1387, June 2005. ACM Sigmetrics/Performance 2012 6 How is Concurrent Refresh Used Today? Memory Banks 1 B 2 RP1 RP3 RP4 RP16 RP2 Next Concurrent Accessed Bank Bank 2 Refresh Pointer Deficit Register 3 Count Deficit Register Tracks Non-refreshed Bank(s) Standard Observation: N-1 out of N Banks Get Refreshed for Any Pattern Concurrent Refresh Overhead is Proportional to 1 bank Concurrent Refresh Overhead = R/W = 64 rows /4050 cycles = ~1.58% Our Observation: This is Incorrect! In Some Cases, Refresh Overhead can be Very Bad, Close to 100% for ANY Concurrent Refresh Scheduler ACM Sigmetrics/Performance 2012 7 Goals of Our Work: An Industry Outlook Design a Concurrent Refresh Scheduler that can 1. Provide Deterministic Memory Performance Guarantees − Maximize Memory Throughput (Optimality) 2. Be Universally Applicable − − For any eDRAM macro with B banks, R Rows, M memory ports For any characteristics of cell retention time W++, and Clock speed 3. Maximize Memory Burst Tolerance 4. Have Low Implementation Overhead ++Note that W is itself a function of temperature, process, and the micro-architecture of the eDRAM ACM Sigmetrics/Performance 2012 8 Problem Formulation We consider a general class of algorithms that require X refresh (idle) timeslots in every Y consecutive timeslots. Refresh ......Refresh . . . . . …. Refresh Refresh Any Refresh Window Refresh Window 1 Refresh Fixed TDM Constraint Refresh Window 2 Refresh Window 3 Refresh Refresh Any Refresh Window Refresh Window 4 Sliding Window Constraint Supports X idle cycles in any (t, t+Y) Sliding Window Constraint Gives Maximum Flexibility for Handling Bursts, and When to Provide Idle Cycles ACM Sigmetrics/Performance 2012 9 Key Performance Metrics Refresh Overhead = X / Y • Memory bandwidth wasted on refresh Burst Tolerance = Y – X • Maximum number of consecutive memory accesses without interruption for refresh We’ll Consider the Simple Case When the User is Required to Send X = 1 Idle in Y Cycles, and M = 1 ACM Sigmetrics/Performance 2012 10 Our Solution: Versatile Refresh Algorithm Memory Banks 1 B 2 RP2 RP1 RP2 RP1 RPB RP3 RP3 RP4 ....... RP4 RP1 RPB RP2 Max Register Next Concurrent Deficit Register 1 1 0 2 3 Count Refresh Pointer Pointer Count Bank with deficit has priority for refresh. Maximum Allowed Deficit Register Controls Burst Tolerance (Y) ACM Sigmetrics/Performance 2012 11 Necessary Refresh Overhead for any Algorithm: Intuition, X=1 At each time the BR memory cells have distinct ages ≥ (0, …, BR-1) An adversary keeps reading from a particular bank; idle slots are needed to refresh cells in that bank. A total of BR inequalities to ensure cells are refreshed in time Interestingly, only two of these inequalities matter • • The one corresponding to the oldest cell The one corresponding to the oldest “youngest cell in each bank” ACM Sigmetrics/Performance 2012 12 Necessary Refresh Overhead for any Algorithm: Derivation, X=1 How much can the adversary age the oldest cell? • • Current age is at least BR-1 Must wait for at least 1 idle before it is picked up: (BR -1) + Y ≤ W 𝑂𝑉𝑛𝑒𝑐 1 1 = ≥ 𝑌 𝑊 − 𝐵𝑅 + 1 How much can the adversary age the oldest “youngest cell in each bank”? • • Current age is at least B-1 Must wait for at least R idles before it is picked up: (B-1) + YR ≤ W 𝑂𝑉𝑛𝑒𝑐 ACM Sigmetrics/Performance 2012 1 𝑅 = ≥ 𝑌 𝑊 − (𝐵 − 1) 13 Optimality for Versatile Refresh Overhead: Results, X =1 Necessity: Result for any Algorithm 𝑂𝑉𝑛𝑒𝑐 1 𝑅 ≥ 𝑚𝑎𝑥 , 𝑊 − 𝐵𝑅 + 1 𝑊 − (𝐵 − 1) 𝑅 > 𝑊 Sufficiency: Result for VR Algorithm (with parameter X): 𝑂𝑉𝑠𝑢𝑓 𝑋 𝑅 ≤ max , 𝑊 − 𝐵𝑅 𝑊 − 𝐵𝑋 − 1 Nearly Optimal Refresh For X=1 ACM Sigmetrics/Performance 2012 14 Performance Guarantees of Versatile Refresh Algorithm “Bad” Region with High Overhead Worst-case Refresh Overhead (X/Y) 1 Increasing X Near-optimal Refresh Overhead for X = 1 Refresh Overhead ~ R/W, for W large 1/B R/W 0 RB Wc = RB + B-1 Cell Retention Time (W) Why Would We Ever Use Large X? ACM Sigmetrics/Performance 2012 15 Why Would We Ever Use Large X? Because of Burst Tolerance (large X → large Y – X) • If memory accesses are bursty, refreshes can be hidden There is a Critical Value of X for Max Burst Tolerance 𝑊 𝑋𝑐 = 𝑚𝑖𝑛 𝑅, −𝑅 𝐵 Example: B = 16, R = 128, W = 2500 ACM Sigmetrics/Performance 2012 16 Calculations for Customer ASIC++ Theoretical In Practice Total Bandwidth 375 MHz 375 MHz Versatile Refresh Formula 6825 > 257N 6825 > 257N Versatile Refresh Constraint 1 in 26.55 1 in 26 Data-path 360 MHz 360 MHz Refresh 14.12 MHz 14.42 MHz Extra Bandwidth for CPU 0.88 MHz 0.58 MHz R = 1024, B = 6 Banks, W = 18.2us @ 100C = 6825 cycles @ 375MHz (++Note that these numbers have been sanitized) ACM Sigmetrics/Performance 2012 17 Versatile Refresh Enhancement Enhancement: • No-conflict slot: A timeslot where the bank the VR scheduler wants to refresh is not being accessed. • Any idle slot is a no-conflict slot; but not vice versa • For VR, no-conflict slots are as good as idle slots. Observation: • This allows lower refresh overhead (possibly zero) for non-adversarial memory access patterns ACM Sigmetrics/Performance 2012 18 Fully Enhanced Versatile Refresh Algorithm Memory Banks 1 B 2 RP2 RP1 RP2 RP1 RPB RP3 RP3 RP4 ....... RP4 RP1 RPB RP2 Max Register Deficit Register Next Refresh 2 3 Count Pointer Bank Pointer X idles in Y timeslots Count Repeat for Multiple Memory Ports (M) No conflict feedback Enforcer Module (User Logic) ACM Sigmetrics/Performance 2012 19 Simulation: Synthetic Statistical Workload Refresh Overhead (%) Periodic Refresh VR (X = 4) VR (X = 128) 9 8 7 6 5 4 3 2 1 0 VR with X = 4: Min worst-case overhead (best for adversarial) VR with X = 128: Max burst tolerance (best for benign) 0.001 0.01 0.1 0.25 alpha 0.5 1 Parameter Alpha Controls Degree of Temporal Locality • alpha ~ 0 → always read from bank 1 (adversarial) • alpha ~ 1 → read from random banks (benign) ACM Sigmetrics/Performance 2012 Refresh Overhead has Disappeared Completely! 20 Conclusion With Versatile Refresh A Designer Can … 1. Exactly Calculate Available Memory Bandwidth − − For any eDRAM macro with B banks, R Rows, M memory ports For any characteristics of Temperature, W= Tref and Clock speed 2. Achieve Optimal Worst-case Memory Bandwidth 3. Design for Large Burst Tolerance 4. Potentially Eliminate Back-pressure − Simplify associated complex design and verification 5. Maximize Best-case Memory Bandwidth 6. Avail of a Formally Verified VR Controller − On a suitably reduced memory instance ACM Sigmetrics/Performance 2012 21