Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan Chishti, Seth Pugsley *Intel Labs Variable Length Delta Prefetcher 1 Prefetchers Confirmation Based Prefetchers • Issue predictions after a few deltas • High Accuracy • Short Streams Lose out Immediate Prefetchers • • • Aggressive Low Accuracy Waste DRAM bandwidth and cache capacity Accurate Fast Variable Length Delta Prefetcher 2 Spatial Correlation • Learn Access (Delta) Patterns • Apply patterns when similar conditions re-occur. • Eg: PC, physical address, delta patterns Delta Patterns • Regular Delta Patterns. Eg: ( +1, +1, +1)…, (+2, +2, +2, +2)… • Irregular Delta Patterns. Eg: ( +1, +2, +3 )… Variable Length Delta Prefetcher 3 Long Repeatable Streams of Irregular Deltas Delta patterns for milc Page Num: 479218 Deltas: 1, 9, -8, 1, 8, 1, -8, 1, 1, 7…….. Variable Length Delta Prefetcher 4 Long Repeatable Streams of Irregular Deltas Deltas : 1, 9, -8, 1, 8, 1, -8, 1, 1, 7, -1, -5,….. Cache Line: A+1, A+10, A+2, A+3, A+11, A+12, A+4, A+5, A+6, A+13, A+12, A+7…… Stream 1 : A+1, A+2, A+ 3, A+4, A+5, A+6, A+7 Stream2: A+10, A+11, A+12, A+13 Confirmation Prefetches Stride Prefetcher Coverage: 5/11 SandBox Prefetcher Coverage: 9/11 Neither are perfectly timely! Variable Length Delta Prefetcher 5 Variable Length Delta Prefetcher Variable Length Delta Prefetcher 6 $ Access Delta Prediction Tables Offset Prediction Tables Predicted Delta/Offset Last Level $$ Core 1 Per Page Delta History Tables Core 8 $ Access Per Page Delta History Tables Delta Prediction Tables Offset Prediction Tables Predicted Delta/Offset Structure of VLDP Variable Length Delta Prefetcher 7 Delta History Table Tracks delta within a page for (i=0;i<BIGNUM; i++) { a[i]=b[i]+c[i]; } Delta = Last Address- Current Address a, b, c can each belong to different pages So Deltas between pages is meaningless Variable Length Delta Prefetcher 8 Delta History Table Page Num. Last Add. Last 4 Deltas Last Predictor Num. Times Used Variable Length Delta Prefetcher Last Four Prefetched Offsets 9 Delta Prediction Tables Highest Priority (t=3) Lowest Priority (t=1) Delta(1) Pred. Accuracy 8b 8b 2b Deltas (3) 8b 8b 8b Pred. Accuracy 8b 2b 64 Rows per Table … Match? Match? MUX Predicted Delta Variable Length Delta Prefetcher 10 Offset Prediction Table First Page Offset Pred. Offset Accuracy 7b 7b 2b OPT is used only to predict the second access to a page Variable Length Delta Prefetcher 11 Need for Multiple Tables Repeating Delta Pattern- (1, 2, 3, 5, 2, 4)… 50% Accuracy Table 1 Delta Pred. 1 2 2 3 3 5 5 2 Table 2 Delta Pred. 1,2 3 2,3 5 3,5 2 5,2 4 Search for Delta pattern match starts from right most table Variable Length Delta Prefetcher 12 Looking farther than one Delta ahead Repeating Delta Pattern- (1, 2, 3), (1, 2, 3)……. Current Delta Delta 1 2 Pred. 2 3 Delta 1,2 2,3 Pred. 3 1 3 - 1 - 3,1 -,- 2 - Variable Length Delta Prefetcher Degree 1 Prediction 13 Looking farther than one Delta ahead Repeating Delta Pattern- 1, 2, 3, 1, 2, 3……. Current Delta Deg 1 Prediction Delta 1 Pred. 2 Delta 1,2 Pred. 3 2 3 - 3 1 - 2,3 3,1 -,- 1 2 - Degree 2 Prediction Degree 1 Prediction Use Recursive lookup to look farther than one Delta Variable Length Delta Prefetcher 14 Case Study: Streaming Workloads Repeating Delta Pattern- 1, 1, 1, 1, 1… Table 1 Delta Pred. 1 1 - Table 2 Delta Pred. -,-,-,-,- Patterns learned from one page is applied to another Variable Length Delta Prefetcher 15 Updating the Delta History Tables Evict Not Recently Used If Page not present, replace Page Num. Last Add. Last 4 Deltas Last Predictor Num. Used Last 4 Prefetches Page Num. Last Add. Last 4 Deltas Last Predictor Num. Used Last 4 Prefetches LLC Access If Page present, add Delta Variable Length Delta Prefetcher 16 Updating the Prediction Tables Last Predictor Last 3 Deltas Last Add. Page Num. B, C, D Can the current state predict Latest Delta? E Latest Delta Delta D Pred. F? Delta C,D Pred. E? Table 1 - Table 2 - - - - - Delta Pred. B,C,D E? Table 3 - Variable Length Delta Prefetcher If Prediction is Correct Increment Accuracy If Prediction of Wrong Decrement Accuracy If Accuracy==0 Update + Promote Prediction If Prediction is Missing Seed T1 with prediction 17 Populating the Prediction Tables Delta 1 Pattern Missing Pred. A Table 1 NRU Delta 1,1 Table 1 Wrong Pred. B -,Table 2 -,-,NRU Table 2 Wrong Delta Pred. 1,1,1 C Table 3 NRU If mis-predict, a longer Delta history might be needed Variable Length Delta Prefetcher 18 Evaluation Methodology • Simics + USIMM • • • • 8 RISC cores, UltraSPARC III ISA 3.2 GHz, 4-wide OoO, 128-entry RoB 32 KB I&D L1 caches, 4 cycles 8 MB shared (1MB per core) L2 cache, 10 cycles • DRAM Specifications • 2Channels, 2 Ranks per Channel, 8 Banks per Rank • 800MHz DDR3 DRAM • SPEC 2006, NPB, and Cloudsuite • Mix1- milc, astar, lbm, libq; Mix2- xalancbmk, lbm, zeusmp, milc; Variable Length Delta Prefetcher 19 VLDP Configuration • • • • • Per-Core VLDP 1 Offset Prediction Table, 64 entry 3 Delta Prediction Tables, 64 entries each 16 entry Delta History Table Only Delta Prediction Tables 2,3 contribute to multi degree prefetch Offset Prediction Table 128 B Delta History Table 222 B Delta Prediction Table 648 B Total 998 B/Core Variable Length Delta Prefetcher 20 Speedup Performance Improvement (Vs No PC) 2,0 1,8 1,6 1,4 1,2 1,0 0,8 FDP SBP AMPM VLDP VLDP is 6% better than AMPM 9% better than SBP 17% better than FDP Variable Length Delta Prefetcher 21 Speedup Performance Improvement (Vs PC) 2,0 1,8 1,6 1,4 1,2 1,0 0,8 SMS GHB_PC_DC VLDP VLDP is 7.1% better than GHB 7.6% better than SMS Variable Length Delta Prefetcher 22 Coverage Coverage 120% 100% 80% 60% 40% 20% 0% FDP NPB SMS CloudSuite FDP 16% SMS 55% SBP 40% SBP GHB_PC_DC AMPM Spec2006 Spec2006-Mix VLDP GM GHB 33% AMPM 49% VLDP 61% Variable Length Delta Prefetcher 23 Speedup Sensitivity to table size 1,03 1,02 1,01 1,00 0,99 0,98 2% increase in performance when DPT size is increased Variable Length Delta Prefetcher 24 Sensitivity number of Delta Prediction Tables 1,5 Speedup DRAM Accesses 1,4 1,3 1,2 1,1 1 1DPT_NoOPT 1DPT+OPT 2DPT+OPT 3DPT+OPT 4DPT+OPT 3DPT improves efficiency despite a modest 1% performance improvement by reducing DRAM requests by 3% Variable Length Delta Prefetcher 25 Conclusions • OPT Issues predictions without confirmation • DPT recognizes Irregular Delta Patterns • Long delta patterns provide high accuracy • Less than 1KB per core overhead • 6% better performance Variable Length Delta Prefetcher 26 Thank You Variable Length Delta Prefetcher 27