Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, Al Davis School of Computing, University of Utah ASPLOS-2010 DRAM Memory Constraints • Modern machines spend nearly 25% - 40% of total system power for memory. • Some commercial servers already have larger power budgets for memory than CPU. • Main memory access is one of the largest performance bottlenecks. We address both performance and power concerns for DRAM memory accesses. 2 DRAM Access Mechanism A few bitsthe Many bits column readwithin from Accesses Array are then selected a DRAM cellsbegin to service device with a th fromof the 1/8 therowsingle CPU request! selecting a bank, buffer. row buffer then a row. These bits are then the output from One word of the device. data output Row … DRAM chip or device Bank Memory Controller Rank DIMM Memory or channelrequest and CPU makesbus a memory the Memory Controller converts it to appropriate DRAM commands. 3 DRAM Access Inefficiencies - I • Over fetch due to large row-buffers. • 8 KB read into row buffer for a 64 byte cache line. • Row-buffer utilization for a single request < 1%. • Why are row buffers so large? • Large arrays minimize cost-per-bit. • Striping a cache line across multiple chips (arrays) improves data transfer bandwidth. 4 DRAM Access Inefficiencies - II • Open page policy • • FR-FCFS request scheduling (First-Ready FCFS) • • Row buffers kept open with the hope that subsequent requests will be row buffer hits. Memory controller schedules requests to open row-buffers first. Access Latency Access Energy Row-buffer Hit ~ 75 cycles ~ 18 nJ Row-buffer Miss ~ 225 cycles ~ 38 nJ Diminishing locality in multi-cores. 5 DRAM Row-buffer Hit-rates With increasing core counts, DRAM row-buffer hit-rates reduce. 6 Key Observation Cache Block Access Pattern Within OS Pages For heavily accessed pages in a given time interval, accesses are usually to a few cache blocks. 7 Outline DRAM Basics. Motivation. • Basic Idea. • Software Only Implementation (ROPS). • Hardware Implementation (HAM). • Results. 8 Basic Idea Gather all heavily accessed chunks of independent OS pages and map Reserved DRAM them to the same DRAM row. Region 4 KB OS Pages 1 KB micro-pages DRAM Memory Hottest micro-pages Coldest micro-pages 9 Basic Idea • Identifying “hot” micro-pages. • Memory controller counters and OS daemon. • Reserved rows in DRAM for hot micro-pages. • Simplifies book-keeping overheads. 4MB capacity loss from a 4GB system (< 0.1%). • EPOCH based schemes. • Expose EPOCH length to the OS for flexibility. 10 Software Only Implementation (ROPS) Reduced OS Page size (ROPS) Baseline CPU Memory Request Virtual Address X Translation Lookaside Buffer (TLB) Translation Physical Address Y Lookaside Hot micro-pages Buffer (TLB) Physical Address Z Cold micro-pages 4 GB Main Memory Y • Shrink the OS page size to 1KB • Every Epoch: 1. Migrate hot micro-pages. • TLB shoot-down and page table update. 2. Promote cold micro-pages to a superpage. • Page table/TLB updated. 4 MB Reserved DRAM region 11 Software Only Implementation (ROPS) • Reduced OS Page Size (ROPS). • Throughout the system, reduce page size to 1KB size. • Migrate hot micro-pages via DRAM-copy • Hot micro-pages live in the same row-buffer in the reserved DRAM region. • Mitigate reduction in TLB reach by promoting cold micro-pages to 4KB superpages. • Superpage creation facilitated by “reservation-based” page allocation. • Allocate four 1KB micro-pages to contiguous DRAM frames. • Allows contiguous virtual addresses to be placed in contiguous physical addresses → makes superpage creation easy. 12 Hardware Implementation (HAM) Hardware Assisted BaselineMigration (HAM) 4 GB Main Memory CPU Memory Request Physical Address X Page A X Mapping Table Old Address New Address X Y New addr . Y Y 4 MB Reserved DRAM region 13 Hardware Implementation (HAM) Hardware Assisted Migration (HAM). New level of address indirection − Place data wherever you want in the DRAM. Maintain a Mapping Table (MT) − Preserve old physical addresses of migrated micro-pages. DRAM-copy of hot micro-pages to the reserved rows. Populate/update MT every EPOCH. 14 Results Schemes Evaluated Baseline Oracle/Profiled: Best-effort estimate of expected benefit in the next epoch based on a prior profile run. Epoch Based ROPS and HAM Evaluated 5M, 10M, 50M, and 100M. Trends are similar, best perf. with 5M and 10M. Simulation Parameters • Simics simulation platform. • DRAMSimL1 based DRAM Inst. and Data Cachetiming. CPU 4-core Out-of-Order CMP, 2 GHz freq. Private, 32 KB/2-way, 1-cycle access L2 Unified Cache Shared, 128 KB/8-way, 10-cycle access • DRAM timing and energy figures from Micron datasheets. Total DRAM Capacity 4 GB DIMM Configuration 8 DIMMs, 1 rank/DIMM, 64 bit channel, 8 devices/DIMM Active Row-Buffers per DIMM 4 DIMM-Level Row-Buffer Size 8 KB 15 Results Accesses to Micro-Pages in Reserved Rows in an Epoch % of Total accesses to micropages in reserved rows Total # 4KB pages touched in an Epoch. % Accesses to micro-pages 4KB pages touched 16 Results 5M cycle EPOCH, ROPS, HAM and ORACLE Percent change in performance Hardware Apart Applications from assisted 9%with perf. migration room gains, for offers our improvement schemes better returns also show save due average to lower performance energy TLB management at theImprovement same overheads. time! of 9% 17 Results ROPS, HAM and ORACLE Energy consumption of the DRAM sub-system % Reduction in DRAM energy 18 Conclusions • On average, for applications with room for improvement and with our best performing scheme • Average performance ↑ 9% (max. 18%) • Average memory energy consumption ↓ 18% (max. 62%). • Average row-buffer utilization ↑ 38% • Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses. • Future work • Can co-locate hot micro-pages that are accessed around the same time. 19 That's all for today … Questions? http://www.cs.utah.edu/arch-research 20