Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems Eiman Ebrahimi* Onur Mutlu‡ Yale N. Patt* * HPS Research Group ‡ Computer Architecture Laboratory University of Texas at Austin Carnegie Mellon University 1 Motivation Prefetching can significantly reduce memory latency impact on performance Stream prefetching very useful but unable to reduce latency of many misses Access patterns that follow pointers in linked data structures (LDS) prevalent in many applications High-performance and bandwidth-efficient LDS prefetchers are needed 2 gmean-no-health gmean pfast voronoi perimeter mst 140 health bisort ammp_00 art_00 parser_00 omnetpp_06 xalancbmk_06 astar_06 mcf_06 gcc_06 perlbench_06 IPC Delta (%) Potential Performance IPC delta of ideal LDS prefetching over stream prefetching 615 130 120 110 100 90 80 70 60 50 40 30 20 10 0 3 Our Goal Develop techniques that 1) Enable low cost and bandwidthefficient prefetching of linked data structure accesses 2) Efficiently combine such prefetchers with commonly-employed stream based prefetchers 4 Our Goal Develop techniques that 1) Enable low cost and bandwidthefficient prefetching of linked data structure accesses 2) Efficiently combine such prefetchers with commonly-employed stream based prefetchers 5 Outline Background Efficient Content Directed LDS Prefetching Managing Multiple Prefetchers in a Hybrid Prefetching System Evaluation Conclusion 6 Content-Directed Prefetching (CDP) (Cooksey et al. ASPLOS ’02) Requires no state Attractive approach Searches for pointers as data is fetched from memory Virtual address predictor Compares high order bits of values within cache line with cache line’s address Generates prefetch requests on a match 7 Content-Directed Prefetching (CDP) X800 22220 [31:20] x40373551 [31:20] = [31:20] = [31:20] = x80011100 x80011100 [31:20] = [31:20] = [31:20] = [31:20] = [31:20] = Virtual Address Predictor Generate Prefetch X80022220 … L2 … DRAM 8 Shortcomings of CDP CDP prefetches all identified pointers Indiscriminate prefetching of all discovered pointers leads to Low prefetch accuracy High cache pollution High bandwidth consumption 9 Shortcomings of CDP – An example HashLookup(int Key) { … for (node = head ; node -> Key != Key; node = node -> Next; ) ; if (node) return node->D1; } Key D1 D2 Key Key D1 Key D2 Key Struct node{ int Key; int * D1_ptr; int * D2_ptr; node * Next; } D1 D1 D2 … Key D1 … D2 D2 Example from mst 10 Shortcomings of CDP – An example Cache Line Addr [31:20] Key D1_ptr [31:20] = Next D2_ptr [31:20] = [31:20] = Key [31:20] [31:20] = = Next D1_ptr D2_ptr [31:20] = [31:20] [31:20] = = Virtual Address Predictor … Key D1 Key D2 Key D1 D2 Key D1 D1 D2 … Key D1 … D2 D2 11 Shortcomings of CDP – An example HashLookup(int Key) { … for (node = head ; node -> Key != Key; node = node -> Next; ) ; if (node) return node -> D1; } Key D1 Key D2 Key D1 D2 Key D1 D1 D2 … Key D1 … D2 D2 12 Shortcomings of CDP – An example Cache Line Addr [31:20] Key [31:20] = D1_ptr D2_ptr [31:20] = [31:20] Next Key [31:20] = = [31:20] = D1_ptr D2_ptr [31:20] Next [31:20] [31:20] = = = Virtual Address Predictor … Key D1 Key D2 Key D1 D2 Key D1 D1 D2 … Key D1 … D2 D2 13 Outline Background Efficient Content Directed LDS Prefetching Managing Multiple Prefetchers in a Hybrid Prefetching System Evaluation Conclusion 14 Efficient Content Directed Prefetching (ECDP) – Basic Idea A compiler guided technique that identifies likely-useful pointer addresses to prefetch Compiler profiles and provides hints as to which pointer addresses are likely-useful to prefetch Hardware uses hints to prefetch only likely-useful pointers 15 Terminology – Pointer Group (PG) LD1: data = node -> data; … node = node -> left; struct node { int data; int key; node * left; node * right; } PG(L, X) = { all pointers at offset X from byte accessed by instruction L } P1 data key left P2 right data offset 8 LD1 key left right data key left … right offset 8 LD1 PG (LD1, 8) = {P1, P2, etc.} 16 Efficient Content-Directed Prefetching (ECDP) The PG definition naturally associates a number of PGs to each load instruction 1) Compile-time profiling classifies PGs into beneficial/harmful 2) Hardware prefetches PGs that are beneficial - Information conveyed to hardware with hint bit vector embedded into the load instruction 17 Beneficial vs Harmful PG data key P1 left { data key PG1 = {P1, P2, P3} right data key left key key left … right right data key left … right key P3 left … right 25 useful 12 useless left right data P2 key left { data right data left right data key 50 useful 9 useless left right data { 33 useful 10 useless 25 + 50 + 33 > 12 + 9 + 10 PG1’s useful prefetches > PG1’s useless prefetches A pointer group whose majority of prefetches are useful is classified as beneficial 18 ECDP mechanism - Example LD1’s associated beneficial pointer groups PG1 = {LD1, 8} PG2 = {LD1, 24} PG3 = {LD1, 44} Prefetch Don’t Prefetch Assuming 4 byte address values LD1 hint bit-vector 0 bit 2 0 1 0 8 offset 12 0 bit 6 0 1 0 offset 24 0 bit 11 0 1 0 0 0 0 0 offset 44 offset 12 8 data key left right data key left right data key left … right byte 12 19 Outline Background Efficient Content Directed LDS Prefetching Managing Multiple Prefetchers in a Hybrid Prefetching System Evaluation Conclusion 20 Managing Multiple Prefetchers in a Hybrid Prefetching System ECDP can be complementary to a stream prefetcher Multiple prefetchers can deny service to each other as they contend for Memory request buffer entries DRAM bus bandwidth and DRAM banks Cache space Unmanaged use of multiple prefetchers causes Performance degradation Inability to gain full performance benefit of using multiple prefetchers 21 Coordinated Throttling of Multiple Prefetchers – Basic Idea Dynamic feedback gathered for every prefetcher in the system Simple heuristics use feedback to adapt each prefetcher's aggressiveness 22 Adapting Stream Prefetcher Aggressiveness Stream Prefetcher Aggressiveness Prefetch Distance Prefetch Degree A A+1 P P+1 P+2 P+3 P+4 Access Stream Prefetch Distance Prefetch Degree 23 Adapting CDP Aggressiveness Each memory request assigned a depth value Demand accessed line assigned depth 0 Depth = 0 Line fetched by demand access ptr1 ptr2 ptr3 ptr4 … Depth = 1 ptr5 ptr6 ptr7 … Depth = 2 ptr8 ptr9 ptr10 … CDP Aggressiveness Maximum allowed prefetch depth 24 Coordinated Prefetcher Aggressiveness Control Policies Each prefetcher adapts its own aggressiveness Prefetches Stream Prefetcher (Deciding) (Rival) Feedback Prefetches Shared Memory Resources TheContent-Directed goal: Allow the prefetcher most likely to improve performance (Rival) Prefetcher (Deciding) to use more shared resources Feedback Deciding prefetcher adapts its own aggressiveness based on Deciding prefetcher coverage and accuracy Rival prefetcher coverage 25 Coordinated Prefetcher Aggressiveness Control Policies Deciding Prefetcher Feedback Rival Feedback Action Reason throttle down (a) avoid unnecessary bandwidth consumption and cache pollution Rival Cov High throttle down (b) give rival prefetcher chance to use more shared resources Rival Cov Low throttle up (c) give deciding prefetcher chance to improve coverage Rival Cov High do nothing (d) deciding prefetcher not causing trouble, rival performing well throttle up (e) deciding prefetcher performing well, avoid performance loss Deciding Acc Low Deciding Acc Med Deciding Cov Low Deciding Acc High Deciding Cov High 26 Hardware Cost of all Techniques Total hardware cost 2.11 KB Percentage Area Overhead (as fraction of the baseline 1MB L2 cache) 0.206% Major components ‘prefetched’ bits for each L2 line – used to account for useful/useless prefetches Eleven 16-bit counters to estimate prefetcher coverage and accuracy 27 Outline Background Efficient Content Directed LDS Prefetching Managing Multiple Prefetchers in a Hybrid Prefetching System Evaluation Conclusion 28 Evaluation Methodology x86 cycle accurate simulator Baseline processor configuration Per core 4-wide issue, out-of-order, 256-entry ROB 1MB, 8-way L2 cache Stream prefetcher with 32 streams, prefetch degree:4, prefetch distance:32 Content Directed Prefetcher, compare bits:8, max depth:4 450 cycle memory latency 8B wide core to memory bus 32, 64, 128 L2 MSHRs for 1-, 2-, 4-core Shared Coordinated prefetcher throttling thresholds Tcoverage 0.2 Alow Ahigh 0.4 0.7 29 gmean-no-health 1.3 1.2 gmean pfast voronoi perimeter mst health bisort 1.4 2.27 2.27 2.58 1.75 Str Pref. + Orig CDP Str Pref. + ECDP Str Pref. + ECDP + Coord. Thrott. ammp_00 art_00 parser_00 omnetpp_06 xalancbmk_06 astar_06 mcf_06 gcc_06 perlbench_06 IPC Normalized to Stream Prefetching Overall Performance 22.5% 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 30 gmean pfast voronoi perimeter mst health bisort ammp_00 art_00 80 parser_00 90 omnetpp_06 xalancbmk_06 astar_06 100 mcf_06 gcc_06 perlbench_06 Bus Access Per 1K Inst. Memory Bandwidth Consumption 375 Str Pref. Only Str Pref. + Orig. CDP Str Pref. + ECDP Str Pref. + ECDP + Coord. Thrott. 70 60 50 40 30 20 10 0 31 gmean-no-health gmean pfast voronoi perimeter mst health bisort ammp_00 art_00 parser_00 omnetpp_06 0.4 0.3 xalancbmk_06 0.6 0.5 astar_06 mcf_06 gcc_06 perlbench_06 2.58 2.36 1.73 1.88 1.75 IPC Normalized to Stream Prefetching Comparison to other LDS/Correlation Prefetchers 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 Str Pref. + DBP Str Pref. + Markov GHB Str Pref. + ECDP + Coord. Thrott. 0.2 0.1 0 32 Summary of Other Results Further comparisons and analysis are presented in the paper Feedback Directed Prefetching 5% avg. improvement HW Prefetch Filtering 17% avg. improvement Multi-core Results Dual Core (10.4% avg. improvement) Quad Core (9.5% avg. improvement) Effects of techniques on prefetcher accuracy and coverage 33 Outline Background Efficient Content Directed LDS Prefetching Managing Multiple Prefetchers in a Hybrid Prefetching System Evaluation Conclusion 34 Conclusion Developed a low-cost and bandwidth-efficient HW/SW cooperative linked data structure prefetcher ECDP utilizes compiler hints to prefetch only likely-useful pointers Inter-prefetcher interference can destroy potential performance Coordinated throttling manages interference between multiple prefetchers Efficient integration of ECDP with stream prefetching Improves average performance by 22% over stream prefetching alone Reduces bandwidth consumption by 25% 35 Thank you ! Questions ? 36 gmean-no-health gmean pfast voronoi Str Pref. + ECDP + Coord. Thrott. perimeter 0.2 0.1 mst Str Pref. + ECDP + FDP health 0.4 0.3 bisort ammp_00 art_00 parser_00 omnetpp_06 xalancbmk_06 astar_06 mcf_06 gcc_06 perlbench_06 IPC Normalized to Stream Prefetching Comparison to Feedback-Directed Prefetching (Srinath et al. HPCA ‘07) 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0 37 1.4 2.27 2.58 1.55 1.61 1.75 1.77 IPC Normalized to Stream Prefetching Comparison to HW Prefetch Filtering (Zhuang and Lee ICPP ‘03) 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 Str Pref. + Orig CDP + Hw-Filter Str Pref. + Orig. CDP + HW-Filter + Coord. Thrott. Str Pref + ECDP Str Pref + ECDP + Coord. Thrott. 0.4 0.3 0.2 0.1 gmean-no-health gmean pfast voronoi perimeter mst health bisort ammp_00 art_00 parser_00 omnetpp_06 xalancbmk_06 astar_06 mcf_06 gcc_06 perlbench_06 0 38 gmean GemsFDTD, h264 pfast, leslie omnetpp, perl pfast, xalanc astar, h264 astar, mcf omnetpp, soplex xalanc, namd astar, leslie gcc, milc 2.4 2.3 2.2 2.1 2 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 xalanc, astar mcf, gcc IPC Normalized to Stream Prefetching Performance on Dual-Core Str Pref. Only Str Pref. + DBP Str Pref. + Markov GHB Str Pref.+ ECDP + Coord. Thrott. 39 3.8 3.6 3.4 3.2 3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 gmean omnetpp, namd, tonto, gobmk tonto, soplex, xalanc, pfast omnetpp, gcc, h264, milc Str Pref. Only Str Pref. + DBP Str Pref. + Markov GHB Str Pref. + ECDP + Coord. Thrott. mcf, astar, xalanc, perl IPC Normalized to Stream Prefetching Performance on Quad-Core 40 Stream Prefetcher Accuracy Str Pref. + Orig. CDP 100 Str Pref. + ECDP Str Pref. + Orig. CDP + Coord. Thrott. Str Pref. + ECDP + Coord. Thrott. 80 70 60 50 40 30 20 10 amean pfast voronoi perimeter mst health bisort ammp_00 art_00 parser_00 omnetpp_06 xalancbmk_06 astar_06 mcf_06 gcc_06 0 perlbench_06 Stream Prefetcher Accuracy 90 41 CDP Accuracy Str Pref. + Orig. CDP Str Pref. + ECDP Str Pref + Orig. CDP + Coord. Thrott. Str Pref. + ECDP + Coord. Thrott. 100 90 70 60 50 40 30 20 10 amean pfast voronoi perimeter mst health bisort ammp_00 art_00 parser_00 omnetpp_06 xalancbmk_06 astar_06 mcf_06 gcc_06 0 perlbench_06 CDP Accuracy 80 42 Sensitivity Study 43