Improving CMP Performance with Memory

Meeting Midway: Improving CMP Performance with Memory-Side Prefetching Praveen Yedlapalli, Jagadish Kotra, Emre Kultursay, Mahmut Kandemir, Chita R. Das and Anand Sivasubramaniam The Pennsylvania State University Summary • In modern multi-core systems, increasing number of cores share common resources – “Memory Wall” • Application/Core contention Interference Proposal A novel memory-side prefetching scheme Mitigates interference while exploiting row buffer locality • Average 10% improvement in application performance Outline • • • • • Background Motivation Memory-Side Prefetching Evaluation Conclusion Network On-Chip based CMP MC0 MC1 Request Message Response Message MC2 MC3 L1 L2 C R Memory Controller Row RowBuffer Buffer Precharge Activate Conflict Hit row rowAB F21 G12 C41 B5 H22 B4 B4 A Bank 0 MC Bank 1 CPU DRAM B B Outline • • • • • Background Motivation Memory-Side Prefetching Evaluation Conclusion Row Buffer Hit Rate Impact of Interference 100 90 80 70 60 50 40 30 20 10 0 Individual Mix-8 Latency Breakdown of L2 Miss High MPKI 22% Moderate MPKI 18% 35% 46% 60% Low MPKI 19% On-chip Off- chip Queueing 43% 49% 8% Off-chip Access Observations • Memory requests from multiple cores interleave at the memory controllers – Row buffer locality of individual apps is lost • Off-chip latency is the majority part in a memory access • On-chip network and caches are critical – Cannot afford to pollute them What about Cache Prefetching? • Not effective for large CMPs • Agnostic to memory state – Gap between caches and memory (62% latency increase) • On-chip resource pollution – Both caches and network (22% network latency increase) • Difficulty of stream detection in S-NUCA – Each L2 bank caters to only a portion of the address space – Each L2 bank gets requests from multiple L1s • Our memory-side prefetching scheme can work along with core-side prefetching Outline • • • • • Background Motivation Memory-Side Prefetching Evaluation Conclusion Memory-Side Prefetching • Objective 1 – Reduce off-chip access latency • Objective 2 – With out increasing on-chip resource contention Memory-Side Prefetching What to Prefetch? When to Prefetch? Where to Prefetch? What to Prefetch? • Prefetch from an open row – Minimizes overhead • Looked at the line access patterns within a row Line 0 Line 4 Line 8 Line 12 Line 16 Line 20 Line 24 Line 28 Line 32 Line 36 Line 40 Line 44 Line 48 Line 52 Line 56 Line 60 % of Accesses What to Prefetch? milc 50 40 30 20 10 0 Line 52 Line 39 Line 26 Line 13 Line 0 0 Line 44 Line 22 Line 0 Line 60 20 Line 50 40 Line 40 libquantum Line 30 60 Line 20 80 % of Accesses 100 Line 0 Line 10 Line 60 Line 50 Line 0 Line 10 Line 20 Line 30 Line 40 % of Accesses What to Prefetch? omnetpp 20 15 10 5 0 Line 44 Line 22 Line 0 When to Prefetch? Idle Periods Prefetch at RBC 1000000 Critical Path Locality # of Prefetches Yes No High No Yes Low 5618579 500000 Prefetch at RBH 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33+ Prefetch at Row ACT No NoCycles High Prefetch at Idle No Yes High Where to Prefetch? • Should be stored on-chip • Prefetch buffers in the memory controllers – To avoid on-chip resource pollution • Organization – Per-core – Shared Memory-Side Prefetching Optimizations • Applications vary in memory behavior • Prefetch Throttling – Feedback • Precharge on Prefetch – Less likely to get a request • Avert Costly Prefetchets – Waiting demand requests Memory-Side Prefetching: Example A11 Core 0 Core 1 C32, C33, C34, C36 Core 2 R12, R13, R14, R15 Core 3 F20, F21, F22, F23 Row Prefetch Buffer from Hit A A11, A12, A13, A14A10 F21 G12 C41 C26 H22 A10 A Bank 0 A11 MC Bank 1 CPU DRAM B Memory-Side Prefetching: Comparison Cache Prefetcher Existing Memory [Lui et al. ILP ‘11] Prefetchers [Lin HPCA ‘01] Our Memoryside Prefetcher No Yes Yes On-chip resource Yes pollution Yes No Accuracy No Yes Memory State Aware Yes Implementation • Prefetch Buffer Implementation – Organized as n per-core prefetch buffers – 256 KB per Memory Controller (<3% compared to LLC) – < 1% Area and Power overhead • Prefetch Request Timing – Requests are generated internally by the memory controller along with a read row buffer hit request Outline • • • • • Background Motivation Memory-Side Prefetching Evaluation Conclusion Evaluation Platform • • • • Cores: 32 at 2.4 GHz Network: 8x4 2D mesh Caches: 32KB L1I; 32KB L1D; 1MB L2 per core Memory: 16GB DDR3-1600 with 4 Memory Channels • GEMS simulator with GARNET Evaluation Methodology • Benchmarks: – Multi-programmed: SPEC 2006 (WL1 to WL5) – Multi-threaded: SPECOMP 2001 (WL6 & WL7) • Metrics: – Harmonic IPC – Off-chip and On-chip Latencies IPC IPC Improvement 20 33.2 10% 15 10 5 0 -5 WL1 WL2 WL3 WL4 WL5 WL6 WL7 -10 CSP MSP MSP-PUSH IDLE-PUSH CSP+MSP AVG Latency 600 500 Cycles 400 300 -48.5% 200 100 0 WL1 WL2 No Pref WL3 CSP WL4 MSP WL5 IDLE-PUSH WL6 CSP+MSP WL7 AVG Latency 600 500 Cycles 400 300 -48.5% 200 100 0 WL1 WL2 No Pref WL3 CSP MSP WL4 WL5 MSP-PUSH WL6 IDLE-PUSH WL7 CSP+MSP AVG L2 Hitrate 100 L2 Hit Rate 80 60 40 20 0 WL1 WL2 WL3 WL4 WL5 WL6 No Pref CSP MSP CSP+MSP WL7 AVG Row Buffer Hitrate Row Buffer Hitrate 80 70 60 50 40 30 20 10 0 WL1 WL2 WL3 No Pref WL4 CSP WL5 MSP WL6 CSP+MSP WL7 AVG Outline • • • • • Background Motivation Memory-Side Prefetching Evaluation Conclusion Conclusion • Proposed a new memory-side prefetcher – Opportunistic – Instantaneous knowledge of memory state • Prefetching Midway – Doesn’t pollute on-chip resources • Reduces the off-chip latency by 48.5% and improves performance by 6.2% on average • Our technique can be combined with coreside prefetching to amplify the benefits Thank You • Questions?

Improving CMP Performance with Memory

Related documents

Products

Support

Improving CMP Performance with Memory

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib