BEAR: MITIGATING BANDWIDTH BLOAT IN GIGASCALE DRAM CACHES ISCA 2015 Portland, OR June 15 , 2015 Chiachen Chou, Georgia Tech Aamer Jaleel, NVIDIA* Moinuddin K. Qureshi, Georgia Tech 3D DRAM HELPS MITIGATE BANDWIDTH WALL 3D DRAM: Hybrid Memory Cube (HMC), High Bandwidth Memory (HBM) Intel Xeon Phi NVIDIA Pascal Stacked DRAM provides 4-8X bandwidth, but has limited capacity courtesy: Micron, JEDEC, Intel, NVIDIA 2 Memory Hierarchy 3D DRAM IS USED AS A CACHE (DRAM CACHE) fast CPU CPU L1$ L2$ L1$ L2$ L3$ DRAM Cache Off-chip DRAM slow 1GB 3DDRAM$ DRAM 16M Cache Lines 4B Tags 64MB Tag Storage DRAM$ stores tags in 3D DRAM for scalability 3 CAN DRAM CACHE PROVIDE 4X BANDWIDTH? DRAM Cache (Tags + Data) TAGDATA ✔ Hit (Good Use of BW) 4X DATA CPU 1X Memory ✘ Miss Detection ✘ Miss Fill ✘ Writeback Detection ✘ Writeback Fill Secondary Operations (Waste BW) DRAM$ does not utilize full bandwidth 4 AGENDA • Introduction • Background – DRAM Cache Designs – Secondary Operations – Bloat Factor • BEAR • Results • Summary 5 DRAM CACHE HAS NARROW BUS CPU 16-byte buses Alloy Cache 2KB Row Buffer 8B Tag 64B Data DRAM Cache DRAM$ accesses tag and data via a narrow bus [Qureshi and Loh MICRO’12] 6 CACHE REQUIRES MAINTENANCE OPERATIONS L3$ Hit Line X Useful Hit (HIT) Line X Miss Dirty Line Y WB Detection/Fill DRAM Cache Miss Fill Secondary Miss Detection (MD), Miss Fill (MF), Memory WB Detection (WD), WB Fill (WF) DRAM$ bandwidth is used for secondary operations 7 QUANTIFYING THE BANDWIDTH USAGE Total Bytes Transferred å Bloat Factor = Useful åUseful Bytes Transferred HIT WF WD HIT MF MD HIT Transfer on Bus HIT HIT HIT 7 = 3 =1 Bloat Factor indicates the bandwidth inefficiency 8 BLOAT FACTOR BREAKDOWN 8-core, 8MB shared L3$, 1GB DRAM$, 16GB memory SPEC2006: 16 rate and 38 mix workloads WF 0.6 WD 0.6 MF 0.7 MD HIT 0.7 1.25 Bloat Factor Bloat Factor Breakdown 4 3.5 3 2.5 2 1.5 1 0.5 0 WB Fill WB Detection Miss Fill Miss Detection Hit (Tag+Data) Hit Baseline Ideal Baseline has a Bloat Factor of 3.8 9 POTENTIAL PERFORMANCE OF 22% 8-core, 8MB shared L3$, 1GB DRAM$, 16GB memory SPEC2006: 16 rate and 38 mix workloads Hit Latency Performance 1.3 250 200 150 Baseline 100 Ideal Speedup Hit Latenc (cycle) 300 1.2 1.1 50 0 1 Hit Latency Ideal Reducing Bloat Factor improves performance 10 NOT ALL OPERATIONS ARE CREATED EQUAL Opportunities to remove Secondary Operations 1. Operations to improve cache performance 2. Operations to ensure correctness Request DATA Insert Exist? DRAM Cache We propose BEAR to exploit these opportunities 11 AGENDA • Introduction • Background • BEAR: Bandwidth-Efficient ARchitecture 1. Bandwidth-Efficient Miss Fill 2. Bandwidth-Efficient Writeback Detection 3. Bandwidth-Efficient Miss Detection • Results • Summary 12 1-P Insert Hit Rate Change (%) DRAM$ 10 20 AVG mix7 mix6 mix4 mix3 mix2 milc 0 mix1 Throw away 12% soplex P Insert 20 lbm returns from memory 30 mcf P=90% Performance Improvement Line X Hit Latency Reduction (%) BANDWIDTH-EFFICIENT MISS FILL +10% 10 0 -10 10 -5% 0 -10 -20 How to enable bypass without hit rate degradation? 13 BAB LIMITS THE HIT RATE LOSS Bandwidth-Aware Bypass (BAB) no bypass Insert Set Hit Rate No Bypass False + + X-Y<Δ Bypass Set Hit Rate probabilistic bypass 90% bypass DRAM$ True Probabilistic Bypass Use Probabilistic Bypass when hit rate loss is small 14 BAB IMPROVE PERFORMANCE BY 5% Bloat Factor Breakdown Performance 4 0.6 0.6 MF 0.7 0.1 MD HIT 0.7 1.25 3 2.5 Speedup WD 3.5 Bloat Factor WF 1.1 2 1.5 1.05 BAB 1 0.5 0 1 Baseline BAB BEAR Hit Rate: Alloy 64%, BAB 62% BAB trades off small hit rate for 5% improvement 15 WHAT IS A WRITEBACK DETECTION? L3$ Dirty Line Ynew (WB Detection) DRAM Cache Line Yold Exist? How can we remove Writeback Detection? 16 DRAM CACHE PRESENCE FOR WB DETECTION DRAM Cache Presence (DCP) L3$ Dirty Line Ynew V D ? True False Only WB Fill WB Detection + WB Fill Exist? DRAM Cache Line Yold DRAM Cache Presence reduces WB Detection 17 DCP IMPROVES PERFORMANCE BY 4% Bloat Factor Breakdown WF Performance 4 0.6 1.1 3.5 WD 0.6 0.1 MD 0.7 2.5 Speedup 0.1 Bloat Factor MF 3 2 1.5 BAB+ DCP 1.05 BAB 1 HIT 1.25 0.5 0 1 Baseline BAB BAB+DCP BEAR DCP provides 4% improvement in addition to BAB 18 WHAT IS A MISS DETECTION? L3$ Missing Line X (Miss Detection) DRAM Cache (Tag + Data) Line X Exist? Can we detect a miss w/o using BW? 19 NEIGHBOR’S TAG COMES FREE WITH DEMAND Address X DRAM Row Buffer 2KB X TAD TAD Demand Tag+Data+Tag (8+64+8=80Bytes) Neighbor Neighboring Tag Cache (NTC) Neighboring Tag Cache saves Miss Detection 20 NTC SHOWS 2% PERFORMANCE IMPROVEMENT Bloat Factor Breakdown WD 0.1 MF 0.1 MD 0.7 0.5 HIT 1.25 1.15 3.5 BAB+ DCP+ NTC BAB+ DCP 3 2.5 Speedup 0.6 4 Bloat Factor WF Performance 2 1.5 1.1 1.05 1 BAB 0.5 0 1 Baseline BAB BAB+DCP BEAR BEAR NTC improves performance by additional 2% 21 AGENDA • • • • • Introduction Background BEAR Results Summary 22 METHODOLOGY CPU Stacked DRAM Core Chips: • 8 cores 3.2 GHz • 2-wide OOO • 8MB 16-way L3 shared cache Off-chip DRAM DRAM Cache Off-chip DRAM Capacity 1GB 16GB Bus DDR3.2GHz, 128-bit DDR1.6GHz, 64-bit Channel 4 channels, 16 banks/ch 2 channels 8 banks/ch • Baseline: Alloy Cache [MICRO’12] • SPEC2006 (16 memory intensive apps): 16 rate and 38 mix workloads 23 BEAR REDUCES BLOAT FACTOR BY 32% Bloat Factor Baseline BEAR Performance Ideal BEAR 4 Ideal 1.3 3 Speedup Bloat Factor 3.5 2.5 2 1.5 1.2 1.1 1 0.5 0 1 ALL54 ALL54 BEAR improves performance by 11% 24 BW BLOAT IN TAGS-IN-SRAM DESIGNS Tags-In-SRAM (TIS) Designs: (1) storage overhead (64MB) and (2) access latency Bloat Factor CPU 4 Hit Tags in SRAM 64MB 3.5 MF WF Bloat Factor 3 2.5 2 1.5 1 DRAM$ 0.5 0 Alloy BEAR TIS (64MB) Tags-in-SRAM also has bandwidth bloat problem 25 TAGS-IN-SRAM PERFORMS SIMILAR TO BEAR Performance 1.7 1.6 Speedup 1.5 1.4 1.3 1.2 1.1 1 Alloy BEAR TIS (64MB) BEAR can be applied to reduce BW bloat in Tags-in-SRAM DRAM$ designs 26 SUMMARY • 3D DRAM as a cache mitigates the memory wall. • In DRAM caches, secondary operations cause slow down to the critical data. • We propose BEAR, which targets three sources of bandwidth bloat in DRAM cache. 1. Bandwidth-Efficient Miss Fill 2. Bandwidth-Efficient Writeback Detection 3. Bandwidth-Efficient Miss Detection • Overall, BEAR reduces the bandwidth bloat by 32%, and improves the performance by 11% 27 THANK YOU Computer Architecture and Emerging Technologies Lab, Georgia Tech Backup Slides 29 THE OVERHEAD OF BEAR IS NEGLIGIBLE SMALL Design Cost Total Bandwidth-Aware Bypass 8 bytes per thread 64 bytes DRAM Cache Presence One bit per line in LLC 16K bytes Neighboring Tag Register 44 bytes per bank 3.2K bytes Total Cost 19.2K bytes Overall, BEAR incurs HW overhead of 19.2KB 30 COMPARISON TO OTHER DRAM$ DESIGNS Tags-In-DRAM Designs Performance Speedup (w.r.t No L4) 2 1.8 28% 11% LH-cache 1.6 Alloy 1.4 Incl-Alloy BEAR 1.2 1 RATE MIX ALL BEAR outperforms other DRAM$ designs 31