Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi* Onur Mutlu‡ Chang Joo Lee* Yale N. Patt* * HPS Research Group ‡ Computer Architecture Laboratory The University of Texas at Austin Carnegie Mellon University 1 Motivation Aggressive prefetching improves memory latency tolerance of many applications when they run alone Prefetching for concurrently-executing applications on a CMP can lead to Significant system performance degradation and bandwidth waste Problem: Prefetcher-caused inter-core interference Prefetches of one application contend with prefetches and demands of other applications 2 Potential Performance 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Gmean-32 WL14 WL13 WL12 WL11 WL10 WL9 WL8 WL7 WL6 WL5 WL4 WL3 WL2 56% WL1 Perf. Normalized to No Throttling System performance improvement of ideally removing all prefetcher-caused inter-core interference in shared resources Exact workload combinations can be found in paper 3 Outline Background Shortcoming of Prior Approaches to Prefetcher Control Hierarchical Prefetcher Aggressiveness Control Evaluation Conclusion 4 Increasing Prefetcher Accuracy Increasing prefetcher accuracy can reduce prefetcher-caused inter-core interference Single-core prefetcher aggressiveness throttling (e.g., Srinath et al., HPCA ’07) Filtering inaccurate prefetches (e.g., Zhuang and Lee, ICPP ’03) Dropping inaccurate prefetches at memory controller (Lee et al., MICRO ’08) All such techniques operate independently on the prefetches of each application 5 Feedback-Directed Prefetching (FDP) (Srinath et al., HPCA ’07) Uses prefetcher feedback information local to the prefetcher’s core Prefetch accuracy Prefetch timeliness Prefetch cache pollution Prefetch Distance Prefetch Degree Dynamically adapts the prefetcher’s aggressiveness Stream Prefetcher Aggressiveness Prefetch Degree A+1 Access Stream P P+1 P+2 P+3 P+4 Prefetch Distance A Shown to perform better than and consume less bandwidth than static aggressiveness configurations 6 Outline Background Shortcoming of Prior Approaches to Prefetcher Control Hierarchical Prefetcher Aggressiveness Control Evaluation Conclusion 7 High Interference caused by Accurate Prefetchers Core0 Shared Cache Dem 2 Addr:A Miss Dem 2 Core2 Core1 Pref 11 Dem Addr:A Dem X Addr: Y Dem 2 Demand Request Pref 0 Addr:B Dem 0 … From Core X For Addr Y Core3 Pref 3 Memory Controller In aDRAM CMP Row Buffer Hit Requests Being Serviced Dem 2 Pref 1 Pref 3 Bank 0 Bank 1 system, accurate prefetchers can cause significant interference with Row Pref 1 Pref 3 Addr. concurrently-executing Buffers Row Addr. Rowapplications 8 Shortcoming of Per-Core (Local-Only) Prefetcher Aggressiveness Control Core 0 Core 1 Prefetcher 4 Degree: 2 Prefetcher Degree: 4 2 Core 2 Core 3 FDP Throttle Up FDP Throttle Up Shared Cache Used_P Used_P Used_P Used_P Pref 02 Dem Pref 02 Dem Pref 13 Dem Pref 13 Dem 2 Dem 2 Dem 3 Dem 3 Set 0 Dem Used_P Used_P Dem03 Used_P Pref Dem03 Pref Pref Dem12 Dem Pref 12 Dem 3 Dem 3 Dem 3 Dem 3 Set 1 Used_P Set 2 Pref Dem02 Pref Dem02 Pref Dem02 Pref Dem13 Pref Dem13 Pref Dem13 Pref Dem13 Dem02 Pref … … Local-only prefetcher control techniques have no mechanism to detect inter-core interference 9 Shortcoming of Local-Only Prefetcher Control 4-core workload example: lbm_06 + swim_00 + crafty_00 + bzip2_00 Speedup over Alone Run 1 No Prefetching Pref. + No Throttling Feedback-Directed Prefetching HPAC 0.5 0.4 0.8 0.3 0.6 0.2 0.4 0.1 0 0.2 Hspeedup 0 bzip2_00 crafty_00 swim_00 lbm_06 Our Approach: Use both global and per-core feedback to determine each prefetcher’s aggressiveness 10 Outline Background Shortcoming of Prior Approaches to Prefetcher Control Hierarchical Prefetcher Aggressiveness Control Evaluation Conclusion 11 Hierarchical Prefetcher Aggressiveness Control (HPAC) Global Local control’s Control:goal: accepts or Global control’s goal: Keep of and control Maximize decisions the overrides made bytrack Memory Controller prefetcher-caused prefetching local controlperformance to improve of core system i independently overall performance inter-core interference in shared memory system Final Throttling Decision Pref. i Throttling Decision Local Control Accuracy Local Core i Throttling Decision Bandwidth Feedback Global Control Cache Pollution Feedback Shared Cache 12 Terminology Global feedback metrics used in our mechanism: For each core i: Core i’s prefetcher accuracy – Acc (i) Core i’s prefetcher caused inter-core cache pollution Pol (i) Demand cache lines of other cores evicted by this core’s prefetches that are requested subsequent to eviction Bandwidth consumed by core i - BW (i) Accounts for how long requests from this core tie up DRAM banks Bandwidth needed by other cores j != i - BWNO (i) Accounts for how long requests from other cores have to wait for DRAM banks because of requests from this core 13 Calculating Inter-Core Cache Pollution Prefetch from core i,aevicts a core j’s miss demand from shared cache Core j experiences demand cache Pollution Filter of core i Core id Pollution bit Missing Evicted line’s Address From core j 0 0 0 2 0 1 j 0 Increment Pol (i) . . . Hash Function 0 0 0 2 1 2 14 Hierarchical Prefetcher Aggressiveness Control (HPAC) - High accuracy - High pollution - High bandwidth consumed while other cores need bandwidth Pref. i Local Control Final Enforce ThrottlingDown Decision Throttle Memory Controller High BW (i)BW (i) BWNO (i) High BWNO (i) Global Control High Acc Acc (i) (i) Local Local ThrottlingUp Decision Core i Throttle Pol (i)Pol (i) High Pol. Filter i Shared Cache 15 Heuristics for Global Control Classification of global control heuristics based on interference severity Severe interference Action: Reduce the aggressiveness of interfering prefetcher Borderline interference Action: Prevent prefetcher from transitioning into severe interference: Allow local-control to only throttle-down No interference or moderate interference from an accurate prefetcher Action: Allow local control to maximize local benefits from prefetching 16 HPAC Control Policies Pol (i) Acc (i) Inaccurate BW (i) Low BW Consumption High BW Consumption Causing Low Pollution BWNO (i) Interference Class Action Others’ low BW need Others’ high BW need Severe interference throttle down Severe interference throttle down Severe interference throttle down Others’ low BW need Highly Accurate Inaccurate Causing High Pollution Low BW Consumption Highly Accurate High BW Consumption Others’ low BW need Others’ high BW need Others’ low BW need Others’ high BW need 17 Hardware Cost (4-Core System) Total hardware cost local-control & global control Additional cost on top of FDP 15.14 KB 1.55 KB Additional cost on top of FDP only 1.55 KB HPAC does not require any structures or logic that are on the processor’s critical path 18 Outline Background Shortcoming of Prior Approaches to Prefetcher Control Hierarchical Prefetcher Aggressiveness Control Evaluation Conclusion 19 Evaluation Methodology x86 cycle accurate simulator Baseline processor configuration Per core Shared 4-wide issue, out-of-order, 256-entry ROB Stream prefetcher with 32 streams, prefetch degree:4, prefetch distance:64 2MB, 16-way L2 cache (4MB, 32-way for 8-core) DDR3 1333Mhz 8B wide core to memory bus 128, 256 L2 MSHRs for 4-, 8-core Latency of 15ns per command (tRP, tRCD, CL) HPAC thresholds used Acc BW Pol BWNO 0.6 50k 90 75k 20 No Prefetching 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 FDP Class 2 Class 1 HPAC Class 3 Class 4 AVG-32 WL14 WL13 WL12 WL11 WL10 WL9 WL8 WL7 WL6 WL5 WL4 WL3 WL2 15% WL1 System Performance Normalized to No Throttling Performance Results Exact workload combinations can be found in paper 21 Summary of Other Results Further results and analysis are presented in the paper Results with different types of memory controllers Prefetch-Aware DRAM Controllers (PADC) First-Ready First-Come-First-Served (FR-FCFS) Effect of HPAC on system fairness HPAC performance on 8-core systems Multiple types of prefetchers per core and different local-control policies Sensitivity to system parameters 22 Conclusion Prefetcher-caused inter-core interference can destroy potential performance of prefetching When prefetching for concurrently executing applications in CMPs Did not exist in single-application environments Develop one low-cost hierarchical solution which throttles different cores’ prefetchers in a coordinated manner The key is to take global feedback into account to determine aggressiveness of each core’s prefetcher Improves system performance by 15% compared to no throttling on a 4-core system Enables performance improvement from prefetching that is not possible without it on many workloads 23 Thank you! Questions? 24