Feedback Directed Prefetching Santhosh Srinath ¥ Onur Mutlu § Hyesoon Kim Yale N. Patt ¥ § Problem Solution Prefetching can significantly improve performance When prefetches are accurate Feedback Directed Prefetching is a And timely comprehensive mechanism which reduces the negative effects of prefetching as well as improves the However, Prefetching can also positive effects significantly degrade performance HPCA-13 Due to Memory Bandwidth impact Pollution of the cache Feedback Directed Prefetching 2 Outline Background and Motivation Feedback Directed Prefetching (FDP) Metrics and How to collect How to adapt Prefetcher Aggressiveness Cache Insertion Policy for Prefetches Results HPCA-13 Feedback Directed Prefetching 3 Background (Prefetcher Aggressiveness) Prefetch Distance Access Stream Prefetch Degree XX+1 Predicted Predicted Stream Stream 123 Pmax P Pmax Very Conservative Middle of Prefetch the Very Road Aggressive Distance Pmax Pmax Prefetch Degree HPCA-13 Feedback Directed Prefetching 4 Background (Prefetcher Aggressiveness) Very Aggressive Well ahead of the load access stream Hides memory access latency better More speculative Very Conservative HPCA-13 Closer to the load access stream Might not hide memory access latency completely Reduces potential for cache pollution and bandwidth contention Feedback Directed Prefetching 5 Motivation Instructions per Cycle 5.0 No Prefetching Very Conservative Middle-of-the-Road Very Aggressive 4.0 3.0 48% 29% 2.0 1.0 n gm ea ise up w im w sw l m es a m gr id si xt ra ck lg e ga ec ke fa ce r ua ar t eq pl u ap r p am m vp x r rte vo rs e pa m cf p ga bz ip 2 0.0 Very Aggressive improves average performance by 84% However it can also significantly reduce performance on some benchmarks HPCA-13 Feedback Directed Prefetching 6 Outline Background and Motivation Feedback Directed Prefetching (FDP) Metrics and How to collect How to adapt Prefetcher Aggressiveness Cache Insertion Policy for Prefetches Results HPCA-13 Feedback Directed Prefetching 7 Feedback Directed Prefetching Comprehensive mechanism which takes in account: Prefetcher Accuracy Prefetcher Lateness Prefetcher-caused Cache Pollution Adapts HPCA-13 Prefetcher Aggressiveness Cache Insertion Policy for Prefetches Feedback Directed Prefetching 8 Metrics Prefetch Accuracy Prefetch Lateness Prefetcher-caused Cache Pollution HPCA-13 Feedback Directed Prefetching 9 Prefetch Accuracy Prefetcher Accuracy Number of Useful Prefetches Number of Prefetches Sent to Memory Useful Prefetches are referenced by the demand requests when in L2 HPCA-13 Feedback Directed Prefetching 10 Prefetch Accuracy 400% Percentage IPC change over No Pref etching 350% 300% 250% 200% 150% 100% 50% 0% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -50% -100% Pref etcher Accuracy Low Accuracy More likely that Prefetching can reduce performance HPCA-13 Feedback Directed Prefetching 11 Prefetch Accuracy Prefetcher Accuracy used_total pref_total Implementation HPCA-13 pref-bit added to each L2 tag-store entry Tracked using two counters: pref_total, used_total Feedback Directed Prefetching 12 Prefetch Lateness late_total Number of Late Prefetches Prefetch Lateness Prefetch Lateness Number ofused_total Useful Prefetches Measure of how timely prefetches are Used to determine if increasing the aggressiveness helps Implementation HPCA-13 pref-bit added to each L2 MSHR entry New counter: late_total Feedback Directed Prefetching 13 Prefetcher-caused Cache Pollution Prefetcher caused Cache Pollution Number of Demand Misses caused by the Prefetcher Number of Demand Misses Measure of the disturbance caused by prefetched data in the cache Used to determine if the prefetcher is evicting useful data from the cache HPCA-13 Feedback Directed Prefetching 14 Prefetcher-caused Cache Pollution (2) pollution_ total Prefetcher - caused Cache Pollution demand_tot al Hardware Implementation Insight – this does not need to be exact Track pollution using Pollution filter HPCA-13 Based on Bloom Filter concept Bit set when a prefetch evicts a demand miss Bit reset when a prefetch is serviced Two Counters – pollution_total, demand_total Feedback Directed Prefetching 15 Feedback Directed Prefetching Comprehensive mechanism which takes in account: Prefetcher Accuracy Prefetcher Lateness Prefetcher-caused Cache Pollution Adapts HPCA-13 Prefetcher Aggressiveness Cache Insertion Policy Feedback Directed Prefetching 16 How to adapt? Prefetcher Aggressiveness Dynamic Configuration Counter Current Aggressiveness HPCA-13 Distance Degree 1 Very Conservative 4 1 2 Conservative 8 1 3 Middle-of-the-Road 16 2 4 Aggressive 32 4 5 Very Aggressive 64 4 Feedback Directed Prefetching 17 How to adapt? Prefetcher Aggressiveness (2) High Accuracy Not-Late Late Polluting Increase Decrease Med Accuracy Not-Poll Polluting Late Decrease Increase Low Accuracy Not-Poll Decrease Not-Late No Change For Current Phase, based on static thresholds, classify Accuracy ReduceReduce memory bandwidth usage and ImproveCache Timeliness Pollution Lateness CachebyPollution Cache-Pollution caused Prefetches HPCA-13 Feedback Directed Prefetching 18 How to Adapt? Cache Insertion Policy for Prefetches Why adapt? Reduce the potential for cache pollution Classify Cache Pollution based on static thresholds: Low – Insert at MID(n/2) Position Medium – Insert at LRU-4(n/4) Position HPCA-13 Eg: For a 16-way cache, MID = 8 in LRU stack Eg: For a 16-way cache, LRU-4 = 4 in LRU stack High – Insert at LRU Position Feedback Directed Prefetching 19 Outline Background and Motivation Feedback Directed Prefetching Metrics and How to collect How to adapt Prefetcher Aggressiveness Cache Insertion Policy for Prefetches Results HPCA-13 Feedback Directed Prefetching 20 Evaluation Methodology Execution-driven Alpha simulator Aggressive out-of-order superscalar processor 1 MB, 16-way, 10-cycle unified L2 cache 500-cycle minimum main memory latency Detailed memory model Prefetchers Modeled: HPCA-13 Stream Prefetcher tracking 64 different streams Global History Buffer Prefetcher (in paper) PC-based Stride Prefetcher (in paper) Feedback Directed Prefetching 21 Results: Adjusting Only Aggressiveness 4.7% higher avg IPC over the Very Aggressive configuration Most of the performance losses have been eliminated HPCA-13 Feedback Directed Prefetching 22 Results: Adjusting Only Cache Insertion Policy 5.0 No Prefetching LRU LRU-4 MID MRU Dynamic Insertion Instructions per Cycle 4.0 Very Aggressive Prefetcher 3.0 2.0 1.0 0.0 5.1% better than inserting prefetches in MRU position 1.9% better than inserting prefetches in LRU-4 position HPCA-13 Feedback Directed Prefetching 23 Results: Putting it all together (FDP) 11% 13% 6.5% IPC improvement over Very Aggressive configuration Performance losses converted to performance gains! HPCA-13 Feedback Directed Prefetching 24 Bandwidth Impact BPKI - Memory Bus Accesses per 1000 retired Instructions Includes effects of L2 demand misses as well as pollution 6.5% 13.6% higher higher performance performance and18.7% with similar less induced misses and prefetches bandwidth bandwidth usage No. Pref. Very Cons Mid Very Aggr FDP IPC 0.85 1.21 1.47 1.57 1.67 BPKI 8.56 9.34 10.60 13.38 10.88 FDP significantly improves bandwidth efficiency HPCA-13 Feedback Directed Prefetching 25 Hardware Cost pref-bits for L2 cache 16384 blocks 16384 bits Pollution Filter 4096 entries * 1bit 4096 bits 16-bit counters 11 counters 176 bits pref-bits for MSHR 128 entries 128 bits Total hardware cost 20784 bits = 2.54 KB Percentage area overhead compared to baseline 1MB L2 cache 2.5KB/1024KB = 0.24% NOT on the critical path HPCA-13 Feedback Directed Prefetching 26 Outline Background and Motivation Feedback Directed Prefetching Metrics and collecting this information in Hardware How to adapt Results Conclusions HPCA-13 Feedback Directed Prefetching 27 Contributions Comprehensive and low-cost feedback mechanism for hardware prefetchers Uses Adapts Prefetcher Accuracy Prefetcher Lateness Prefetcher-caused Cache Pollution Aggressiveness Cache Insertion Policy for prefetches 6.5% higher performance and 18.7% less bandwidth compared to Very Aggressive Prefetching Eliminates negative impact of prefetching Applicable to any data prefetch algorithm HPCA-13 Feedback Directed Prefetching 28 Questions? HPCA-13 Feedback Directed Prefetching 29 Backups HPCA-13 Feedback Directed Prefetching 30 FDP vs Prefetch Cache Prefetch Caches eliminate prefetcher induced cache pollution However, prefetches are now limited to the size of the prefetch cache 5.3% higher perf. than Very Aggr.+32KB Within 2% of Very Aggr.+64KB Memory bandwidth of FDP is 16% less than 32KB and 9% less than 64KB. HPCA-13 Feedback Directed Prefetching 31 Performance on Other Prefetch algorithms Global History Buffer Prefetcher 20.8% less memory bandwidth than very aggressive with similar perf. 9.9% better performance than middle-of-the-road with similar bandwidth usage PC-based Stride Prefetcher HPCA-13 4% better performance than the very aggressive 24% reduction in bandwidth usage Feedback Directed Prefetching 32 IPC Performance HPCA-13 Feedback Directed Prefetching 33 Dynamic Prefetcher Accuracy HPCA-13 Feedback Directed Prefetching 34 Prefetch Lateness HPCA-13 Feedback Directed Prefetching 35 Pollution Filter HPCA-13 Feedback Directed Prefetching 36 Thresholds HPCA-13 Feedback Directed Prefetching 37 Prefetches Sent HPCA-13 Feedback Directed Prefetching 38 Distribution of dynamic aggressiveness level HPCA-13 Feedback Directed Prefetching 39 Distribution of insertion position of prefetched blocks HPCA-13 Feedback Directed Prefetching 40 Effect of FDP on memory bandwidth consumption HPCA-13 Feedback Directed Prefetching 41 Performance of Prefetch cache vs FDP HPCA-13 Feedback Directed Prefetching 42 Bandwidth consumption of prefetch cache vs. FDP HPCA-13 Feedback Directed Prefetching 43 Effect of FDP on GHB HPCA-13 Feedback Directed Prefetching 44 Effect of FDP on GHB (Bandwidth) HPCA-13 Feedback Directed Prefetching 45 Effect of varying L2 size and memory latency HPCA-13 Feedback Directed Prefetching 46 IPC on other benchmarks HPCA-13 Feedback Directed Prefetching 47 BPKI on other benchmarks HPCA-13 Feedback Directed Prefetching 48