Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu* Veynu Narasiman Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin *Microsoft Research and Carnegie Mellon University 1 Outline Motivation Mechanism Experimental Evaluation Conclusion 2 Modern DRAM Systems DRAM Bank Rows and columns of DRAM cells A row buffer in each bank Non-uniform access latency: Row B Row-hit: Data is in the row buffer Row Buffer Row A Row-conflict: Data is not in the row buffer Row-conflict Row-hit Needs to access the DRAM cells Data Bus Row-hit latency < Row-conflict latency Processor: Row AB Prioritize row-hit accesses to increase DRAM throughput [Rixner et al. ISCA2000] 3 Problems of Prefetch Handling How to schedule prefetches vs demands? Demand-first: Always prioritizes demands over prefetch requests Demand-prefetch-equal: Always treats them the same Neither of these perform best Neither take into account both: 1. Non-uniform access latency of DRAM systems 2. Usefulness of prefetches 4 When Prefetches are Useful Stall DRAM Row A B Execution Demand-first Row Buffer 2 row-conflicts, 1 row-hit DRAM Row-conflict Row-hit Processor DRAM Controller Miss Y Pref Row A :X Miss X Miss Z Dem Row B : Y Pref Row A :Z Processor needs Y, X, and Z 5 When Prefetches are Useful Stall DRAM Demand-first Row Buffer Row A B Execution 2 row-conflicts, 1 row-hit DRAM Row-conflict Row-hit Processor DRAM Controller Miss Y Miss X Miss Z Pref Row A :X Demand-pref-equal outperforms demand-first Dem Row B : Y Demand-pref-equal Pref Row A :Z 2 row-hits, 1 row-conflict DRAM Processor Processor needs Y, X, and Z Miss Y Saved Cycles Hit X Hit Z 6 When Prefetches are Useless DRAM Row A Demand-first Row Buffer DRAM Y X Z Processor DRAM Controller Saved Cycles Miss Y Pref Row Demand-first A :X Dem Row B : Y Pref Row A outperforms demand-pref-equal Demand-pref-equal :Z DRAM X Z Y Processor Processor needs ONLY Y Miss Y 7 Demand-first vs. Demand-pref-equal policy Stream prefetcher enabled IPC normalized to no prefetching 3 2.5 Demand-first Demand-pref-equal Useless prefetches: Off-chip bandwidth Queue resources Cache Pollution 2 1.5 1 0.5 0 sl le m tu es an 3d ie qu im av bw lib sw ilc m t ar l p e lg m am ga Goal 1: Adaptively Goal schedule 2: Eliminate prefetches useless based prefetches on prefetch usefulness Demand-pref-equal Demand-first is better is better 8 Goals 1. Maximize the benefits of prefetching: Increase DRAM throughput by adaptively scheduling requests based on prefetch usefulness → increase timeliness of useful prefetches 2. Minimize the harm of prefetching: Adaptively delay the service of useless prefetches and remove useless prefetches → increase efficiency of resource utilization Achieve higher performance and efficiency 9 Outline Motivation Mechanism Experimental Evaluation Conclusion 10 Prefetch-Aware DRAM Controllers (PADC) To DRAM Adaptive Prefetch Scheduling (APS): Prioritizes prefetch and demand requests based on prefetch accuracy estimation Adaptive Prefetch Dropping (APD): Cancels likely-useless prefetches from memory request buffer based on prefetch accuracy Update Memory request buffer Request priority Drop Request Info APS APD PADC Prefetch accuracy from each core 11 Prefetch Accuracy Estimation #Prefetches used #Prefetches sent Prefetch accuracy = Hardware support: Prefetch bit (per L2 cache line, MSHR entry): Indicates whether it is a prefetch or demand Prefetch sent counter (per core) Prefetch used counter (per core) Prefetch accuracy register (per core) Estimated every 100K cycles 12 Adaptive Prefetch Scheduling (APS) 1. Adaptively change the priority of prefetch requests To DRAM Low prefetch accuracy → prioritize demands from the core High prefetch accuracyUpdate → treat demands and prefetches equally Memory request buffer Request priority APS 2. In a CMP system: prioritize demand requests from a core Drop APD that has many useless prefetches Request Info To avoid starving demand requests from a core with low prefetch PADC accuracy → improves system performance Prefetch accuracy from each core 13 Adaptive Prefetch Scheduling (APS) 1. Critical requests All demand requests Prefetch requests from cores whose prefetch accuracy ≥ promotion threshold 2. Urgent requests Demand requests from cores whose prefetch accuracy < promotion threshold 14 Adaptive Prefetch Scheduling (APS) Each memory request buffer entry: priority fields C RH U FCFS Prioritization order: 1. Critical request (C) 2. Row-hit request (RH) 3. Urgent request (U) 4. Oldest request (FCFS) 15 Adaptive Prefetch Dropping (APD) To DRAM Proactively drops old prefetches based on prefetch accuracy estimation Update Old requests are likely uselessAPS Request APS prioritizes demand priorityrequests when prefetch accuracy is low Memory request A prefetch that is hit by a demand is promoted to a demand buffer Drop Request Info APD Dropping old, useless prefetches saves resources (bandwidth, queues, caches)PADC Prefetch each core requests Saved resources can be accuracy used from by useful 16 Adaptive Prefetch Dropping (APD) Each memory request buffer entry: drop information P ID AGE Prefetch bit (P) Core ID field (ID) Age field (AGE) Drop prefetch requests whose AGE > Drop threshold Drop threshold is dynamically determined based on prefetch accuracy estimation Lower accuracy → Lower threshold 17 Hardware Cost for 4-core CMP Cost (bits) Prefetch Accuracy Estimation APS 128 APD 1,536 Total 34,720 Total storage: 34,720 bits (~4.25KB) are needed ~ 4KB are prefetch bits in each cache line 33,056 If prefetch bits are already implemented: ~228B Logic is not on the critical path Scheduling and dropping decisions are made every DRAM bus cycle 18 Outline Motivation Mechanism Experimental Evaluation Conclusion 19 Simulation Methodology x86 cycle accurate simulator Baseline processor configuration Per core Shared 4-wide issue, out-of-order, 256-entry ROB 512KB, 8-way unified L2 cache (1MB for single core processor) Stream prefetcher (Lookahead, prefetch degree: 4, prefetch distance: 64) On-chip, demand-first FR-FCFS memory controller 64, 128, 256 L2 MSHRs, memory request buffer for 1-, 4-, 8-core DDR3 1333, 15-15-15ns, 4KB row buffer PADC configuration Promotion threshold: 85% Prefetch accuracy (%) Drop threshold: 0~10 10~30 30~70 70~100 Threshold (core cycles) 100 1,500 50,000 100,000 20 Workloads for Evaluation Single-core processor: All 55 SPEC 2000/2006 benchmarks Single-threaded 38 prefetch sensitive benchmarks 17 prefetch insensitive benchmarks CMP: Randomly chosen multiprogrammed workloads from 55 benchmarks: 4-core CMP: 32 workloads 8-core CMP: 21 workloads 21 Performance of PADC 1 1 0.8 0.6 0.4 No-pref Demand-first Demand-pref-equal PADC 0 Normalized to demand-first 1 Normalized to demand-first Normalized to demand-first 8-core CMP 4-core CMP Single-core 0.2 1.2 1.2 1.2 0.8 0.6 0.4 4.3% 0.6 0.4 No-pref No-pref 0.2 Demand-first Demand-pref-equal 0.2 PADC Demand-first Demand-pref-equal PADC 0 0 Average 0.8 Average 8.2% Average 9.9% 22 Bus Traffic of PADC 3.5 12 Single-core 20 4-core CMP 8-core CMP 18 3 10 16 14 2 1.5 8 Million cache lines Million cache lines Million cache lines 2.5 6 12 10 8 4 1 0.5 No-pref Demand-first Demand-pref-equal PADC 0 No-pref 2 Demand-first Demand-pref-equal 4 PADC 2 0 Average -10.4% 6 No-pref Demand-first Demand-pref-equal PADC 0 Average -10.7% Average -9.4% 23 Performance with Other Prefetchers 4-core CMP 1.2 Stride Normalized to no prefetching Normalized to no prefetching 0.8 0.6 0.4 No-pref 0.8 0.6 0.4 0.2 Demand-first PADC PADC 0 Average 6.0% 0.8 0.6 0.4 No-pref Demand-first 0 Markov 1 1 1 0.2 1.2 GHB Normalized to no prefetching 1.2 0.2 No-pref Demand-first PADC 0 Average 6.6% Average 2.2% 24 Bus Traffic with Other Prefetchers 4-core CMP 12 Stride Markov GHB 10 10 8 8 8 6 4 6 4 Demand-first 2 4 Demand-first 0 Average -5.7% No-pref 2 PADC PADC 0 6 No-pref No-pref 2 Million cache lines 10 Million cache lines Million cache lines 12 12 Demand-first PADC 0 Average -6.8% Average -10.3% 25 Outline Motivation Mechanism Experimental Evaluation Conclusion 26 Conclusions Prefetch-Aware DRAM Controllers (PADC) Adaptive Prefetch Scheduling Adaptive Prefetch Dropping Increase DRAM throughput by exploiting row-buffer locality when prefetches are useful Delay service of prefetches when they are useless With APS, remove useless prefetches effectively while keeping the benefits of useful prefetches Improve performance and bandwidth efficiency for both single-core and CMP systems Low cost and easily implementable 27 Questions? 28 Performance Detail Single-core: 38 prefetch-sensitive: 6.2% Prefetch-friendly: 29 benchmarks Prefetch-unfriendly: 9 benchmarks 17 out of 38 are memory intensive (MPKI > 10) : 11.8% 17 prefetch-insensitive 29 Two Channel Memory Performance 1.2 1.2 8-core CMP 4-core CMP 1 1 31% 0.8 0.6 0.4 1ch-demand-first Normalized to demand-first Normalized to demand-first 16% 0.8 0.6 0.4 No-pref Demand-first 0.2 Demand-pref-equal PADC 0 0.2 1ch-demand-first No-pref Demand-first Demand-pref-equal PADC 0 Average 5.9% Average 5.5% 30 Two Channel Memory Bus Traffic 12 20 4-core CMP 8-core CMP 18 10 16 Million cache lines Million cache lines 14 8 6 12 10 8 4 6 No-pref No-pref Demand-first 2 4 Demand-pref-equal Demand-pref-equal PADC Demand-first 2 PADC 0 0 Average -12.9% Average -13.2% 31 Comparison with Feedback Directed Prefetching 4-core CMP 12 1.2 Bus traffic 1 10 0.8 8 0.6 Demand-first 0.4 0.2 fdp-demand-first Million cache lines Normalized to demand-first Performance 6 Demand-first 4 fdp-demand-first apd-demand-first apd-demand-first fdp-demand-pref-equal fdp-demand-pref-equal fdp-aps 2 PADC(aps-apd) 0 fdp-aps PADC(aps-apd) 0 Average 6.4% Average 32 Performance on Single-Core Normalized IPC to demand-first 1.8 No-pref Demand-first Demand-pref-equal APS-only APS-APD (PADC) 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 n x le ea gm p so m tu es an p tp 3d ie sl le qu im av bw lib sw ilc m t ar ne om l p e lg m am ga 33 Prefetch Friendly Application libquantum 1.8 3 2 1.5 Useless Useful Demand 1 0.5 0 Normalized IPC to demand-first Bus traffic 2.5 Million cache lines Performance 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 em em an t rs t rs fi d- f re fi d- an p o- an an f re ) C AD (P D AP SAP y al l qu on -e Sef AP pr d- D D N em em p o- ) C AD (P D AP Sal AP y l qu on -e Sef AP d-pr D D N 34 0 0 1 30 25 20 15 0.4 0.2 Useless Useful Demand 10 5 Performance 1.2 Bus traffic 35 0.8 0.6 Normalized IPC to demand-first art Million cache lines Prefetch Unfriendly Application em em an p o- fi d- f re an t rs ) C AD (P D AP SAP ly al qu on -e Sef pr AP dD D N an fi df re an p o- em em t rs ) C AD (P D AP SAP al ly qu on -e Sef pr AP dD D N 35 Average Performance on Single-Core All 55 SPEC 2000/2006 CPU benchmarks 3.5 1.2 Performance Normalized IPC to demand-first Bus traffic 3 Million cache lines 2.5 2 1.5 1 0.5 0 1 0.8 0.6 0.4 0.2 0 em em an t rs t rs fi d- f re fi d- an p o- an an f re ) C AD (P D AP SAP l y al qu on -e Sef AP pr d- D D N em em p o- ) C AD (P D AP SAP l y al qu on -e Sef AP pr d- D D N 36 System Performance on 4-Core CMP 32 randomly chosen 4-core workloads 3.5 20 System performance Average bus traffic 18 3 16 2.5 Demand-first Demand-pref-equal 2 APS-only APS-APD (PADC) 1.5 1 Million cache lines No-pref Metric 14 12 10 8 6 4 0.5 2 No-pref Demand-first Demand-pref-equal APS-only APS-APD (PADC) 0 0 WS HS Traffic 37 System Performance on 8-core CMP 21 randomly chosen 8-core workloads 5 20 System performance 4.5 Average bus traffic 18 4 16 Demand-first 3 Demand-pref-equal APS-only 2.5 APS-APD (PADC) 2 1.5 Million cache lines No-pref 3.5 Metric 14 12 10 8 6 1 4 0.5 2 0 0 WS HS No-pref Demand-first Demand-pref-equal APS-only APS-APD (PADC) Traffic 38 4 3 0 0 0.4 0.2 Useless Useful Demand 2 1 Performance 1.4 6 1.2 5 1 0.8 0.6 Normalized IPC to demand-first leslie3d Million cache lines Prefetch Friendly Application Bus traffic em em an p o- fi d- f re an t rs ) C AD (P D AP SAP l y al qu on -e Sef pr AP dD D N em em an f re an p o- fi d- t rs ) C AD (P D AP SAP al ly qu on -e Sef AP pr dD D N 39 Prefetch Unfriendly Application ammp 0.9 1.4 0.7 Useless Useful Demand 0.6 0.5 0.4 0.3 0.2 0.1 0 Normalized IPC to demand-first Bus traffic 0.8 Million cache lines Performance 1.2 1 0.8 0.6 0.4 0.2 0 em em an t rs t rs fi d- f re fi d- an p o- an an f re ) C AD (P D AP SAP l y al qu on -e Sef AP pr d- D D N em em p o- ) C AD (P D AP Sal AP ly qu on -e Sef AP d-pr D D N 40 Performance on 4-Core omnetpp, libquantum, galgel, and GemsFDTD on 4-core System performance CMP Individual speedup 2.5 0.9 0.8 0.7 0.6 No-pref Demand-first Demand-pref-equal APS-only APS-APD (PADC) 2 Metric Speedup to single application run IPC 0.5 0.4 No-pref Demand-first Demand-pref-equal APS-only APS-APD (PADC) 1.5 1 0.3 0.2 0.5 0.1 0 0 omnetpp libquantum galgel GemsFDTD WS HS 41 0 Individual speedup 2 0.5 1 0.4 APS-APD (PADC) APS-only Demand-pref-equal Demand-first No-pref APS-APD (PADC) APS-only Demand-pref-equal Demand-first No-pref APS-APD (PADC) APS-only D sF l TD m tu p tp an e lg em G ga qu ne om lib GemsFDTD galgel libquantum omnetpp Demand-pref-equal 0 3 0.6 Demand-first No-pref 0.1 4 0.7 APS-APD (PADC) APS-only 0.2 Demand 5 0.8 Demand-pref-equal Demand-first No-pref 0.3 Useless Useful 6 Demand-pref-equal APS-only APS-APD (PADC) Million cache lines omnetpp, libquantum, galgel, and GemsFDTD on 4-core No-pref 7 CMP 0.9 Demand-first Speedup to single application run IPC Performance on 4-Core 42 System Performance on 4-Core omnetpp, libquantum, galgel, and GemsFDTD 2.5 18 System performance Bus traffic 2 1.5 Metric No-pref Demand-first Demand-pref-equal APS-only APS-APD (PADC) 1 Million cache lines 16 Useless Useful Demand 14 12 10 8 6 4 2 0 em em an fi d- f re an p o- t rs 0 WS HS ) C AD (P D AP Sal AP ly qu on f-e Sre AP d-p D D N 0.5 43