Meeting Midway: Improving CMP Performance with Memory-Side Prefetching Praveen Yedlapalli, Jagadish Kotra, Emre Kultursay, Mahmut Kandemir, Chita R. Das and Anand Sivasubramaniam The Pennsylvania State University Summary • In modern multi-core systems, increasing number of cores share common resources – “Memory Wall” • Application/Core contention Interference Proposal A novel memory-side prefetching scheme Mitigates interference while exploiting row buffer locality • Average 10% improvement in application performance Outline • • • • • Background Motivation Memory-Side Prefetching Evaluation Conclusion Network On-Chip based CMP MC0 MC1 Request Message Response Message MC2 MC3 L1 L2 C R Memory Controller Row RowBuffer Buffer Precharge Activate Conflict Hit row rowAB F21 G12 C41 B5 H22 B4 B4 A Bank 0 MC Bank 1 CPU DRAM B B Outline • • • • • Background Motivation Memory-Side Prefetching Evaluation Conclusion Row Buffer Hit Rate Impact of Interference 100 90 80 70 60 50 40 30 20 10 0 Individual Mix-8 Latency Breakdown of L2 Miss High MPKI 22% Moderate MPKI 18% 35% 46% 60% Low MPKI 19% On-chip Off- chip Queueing 43% 49% 8% Off-chip Access Observations • Memory requests from multiple cores interleave at the memory controllers – Row buffer locality of individual apps is lost • Off-chip latency is the majority part in a memory access • On-chip network and caches are critical – Cannot afford to pollute them What about Cache Prefetching? • Not effective for large CMPs • Agnostic to memory state – Gap between caches and memory (62% latency increase) • On-chip resource pollution – Both caches and network (22% network latency increase) • Difficulty of stream detection in S-NUCA – Each L2 bank caters to only a portion of the address space – Each L2 bank gets requests from multiple L1s • Our memory-side prefetching scheme can work along with core-side prefetching Outline • • • • • Background Motivation Memory-Side Prefetching Evaluation Conclusion Memory-Side Prefetching • Objective 1 – Reduce off-chip access latency • Objective 2 – With out increasing on-chip resource contention Memory-Side Prefetching What to Prefetch? When to Prefetch? Where to Prefetch? What to Prefetch? • Prefetch from an open row – Minimizes overhead • Looked at the line access patterns within a row Line 0 Line 4 Line 8 Line 12 Line 16 Line 20 Line 24 Line 28 Line 32 Line 36 Line 40 Line 44 Line 48 Line 52 Line 56 Line 60 % of Accesses What to Prefetch? milc 50 40 30 20 10 0 Line 52 Line 39 Line 26 Line 13 Line 0 0 Line 44 Line 22 Line 0 Line 60 20 Line 50 40 Line 40 libquantum Line 30 60 Line 20 80 % of Accesses 100 Line 0 Line 10 Line 60 Line 50 Line 0 Line 10 Line 20 Line 30 Line 40 % of Accesses What to Prefetch? omnetpp 20 15 10 5 0 Line 44 Line 22 Line 0 When to Prefetch? Idle Periods Prefetch at RBC 1000000 Critical Path Locality # of Prefetches Yes No High No Yes Low 5618579 500000 Prefetch at RBH 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33+ Prefetch at Row ACT No NoCycles High Prefetch at Idle No Yes High Where to Prefetch? • Should be stored on-chip • Prefetch buffers in the memory controllers – To avoid on-chip resource pollution • Organization – Per-core – Shared Memory-Side Prefetching Optimizations • Applications vary in memory behavior • Prefetch Throttling – Feedback • Precharge on Prefetch – Less likely to get a request • Avert Costly Prefetchets – Waiting demand requests Memory-Side Prefetching: Example A11 Core 0 Core 1 C32, C33, C34, C36 Core 2 R12, R13, R14, R15 Core 3 F20, F21, F22, F23 Row Prefetch Buffer from Hit A A11, A12, A13, A14A10 F21 G12 C41 C26 H22 A10 A Bank 0 A11 MC Bank 1 CPU DRAM B Memory-Side Prefetching: Comparison Cache Prefetcher Existing Memory [Lui et al. ILP ‘11] Prefetchers [Lin HPCA ‘01] Our Memoryside Prefetcher No Yes Yes On-chip resource Yes pollution Yes No Accuracy No Yes Memory State Aware Yes Implementation • Prefetch Buffer Implementation – Organized as n per-core prefetch buffers – 256 KB per Memory Controller (<3% compared to LLC) – < 1% Area and Power overhead • Prefetch Request Timing – Requests are generated internally by the memory controller along with a read row buffer hit request Outline • • • • • Background Motivation Memory-Side Prefetching Evaluation Conclusion Evaluation Platform • • • • Cores: 32 at 2.4 GHz Network: 8x4 2D mesh Caches: 32KB L1I; 32KB L1D; 1MB L2 per core Memory: 16GB DDR3-1600 with 4 Memory Channels • GEMS simulator with GARNET Evaluation Methodology • Benchmarks: – Multi-programmed: SPEC 2006 (WL1 to WL5) – Multi-threaded: SPECOMP 2001 (WL6 & WL7) • Metrics: – Harmonic IPC – Off-chip and On-chip Latencies IPC IPC Improvement 20 33.2 10% 15 10 5 0 -5 WL1 WL2 WL3 WL4 WL5 WL6 WL7 -10 CSP MSP MSP-PUSH IDLE-PUSH CSP+MSP AVG Latency 600 500 Cycles 400 300 -48.5% 200 100 0 WL1 WL2 No Pref WL3 CSP WL4 MSP WL5 IDLE-PUSH WL6 CSP+MSP WL7 AVG Latency 600 500 Cycles 400 300 -48.5% 200 100 0 WL1 WL2 No Pref WL3 CSP MSP WL4 WL5 MSP-PUSH WL6 IDLE-PUSH WL7 CSP+MSP AVG L2 Hitrate 100 L2 Hit Rate 80 60 40 20 0 WL1 WL2 WL3 WL4 WL5 WL6 No Pref CSP MSP CSP+MSP WL7 AVG Row Buffer Hitrate Row Buffer Hitrate 80 70 60 50 40 30 20 10 0 WL1 WL2 WL3 No Pref WL4 CSP WL5 MSP WL6 CSP+MSP WL7 AVG Outline • • • • • Background Motivation Memory-Side Prefetching Evaluation Conclusion Conclusion • Proposed a new memory-side prefetcher – Opportunistic – Instantaneous knowledge of memory state • Prefetching Midway – Doesn’t pollute on-chip resources • Reduces the off-chip latency by 48.5% and improves performance by 6.2% on average • Our technique can be combined with coreside prefetching to amplify the benefits Thank You • Questions?