Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang Steve Carr Soner Önder Zhenlin Wang 1 Motivation Widening gap between processor and memory speed Static compiler analysis has limited capability memory wall regular array references only index arrays integer code Reuse distance prediction across program inputs number of distinct memory locations accessed between two references to the same memory location applicable to more than just regular scientific code locality as a function of data size predictable on whole program and per instruction basis for scientific codes 2 Motivation Memory distance A dynamic quantifiable distance in terms of memory reference between tow access to the same memory location. reuse distance access distance value distance Is memory distance predictable across both integer and floating-point codes? predict miss rates predict critical instructions identify instructions for load speculation 3 Related Work Reuse distance Mattson, et al. ’70 Sugamar and Abraham ’94 Beyls and D’Hollander ’02 Ding and Zhong ’03 Zhong, Dropsho and Ding ’03 Shen, Zhong and Ding ’04 Fang, Carr, Önder and Wang ‘04 Marin and Mellor-Crummey ’04 Load speculation Moshovos and Sohi ’98 Chyrsos and Emer ’98 Önder and Gupta ‘02 4 Background Memory distance Represent memory distance as a pattern can use any granularity (cache line, address, etc.) either forward or backward represented as a pattern divide consecutive ranges into intervals we use powers of 2 up to 1K and then 1K intervals Data size the largest reuse distance for an input set characterize reuse distance as a function of the data size Given two sets of patterns for two runs, can we predict a third set of patterns given its data size? 5 Background 1 d Let i be the distance of the ith bin in the first pattern and d i2 be that of the second pattern. Given the data sizes s1 and s2 we can fit the memory distances using 1 di ci ei fi ( s1 ) 2 di ci ei fi ( s2 ) Given ci, ei, and fi, we can predict the memory distance of another input set with its data size 6 Instruction Based Memory Distance Analysis How can we represent the memory distance of an instruction? For each active interval, we record 4 words of data • min, max, mean, frequency Some locality patterns cross interval boundaries • merge adjacent intervals, i and i + 1, if min i 1 max i max i min i • • merging process stops when a minimum frequency is found needed to make reuse distance predictable The set of merged intervals make up memory distance patterns 7 Merging Example 8 What do we do with patterns? Verify that we can predict patterns given two training runs coverage accuracy Predict miss rates for instructions Predict loads that may be speculated 9 Prediction Coverage Prediction coverage indicates the percentage of instructions whose memory distance can be predicted appears in both training runs access pattern appears in both runs and memory distance does not decrease with increase in data size (spatial locality) • Called a regular pattern For each instruction, we predict its ith pattern by same number of intervals in both runs curve fitting the ith pattern of both training runs applying the fitting function to construct a new min, max and mean for the third run Simple, fast prediction 10 Prediction Accuracy An instruction’s memory distance is correctly predicted if all of its patterns are predicted correctly predicted and observed patterns fall in same interval or, given two patterns A and B such that B.min A.max B.max A.max max( A.min, B.min) 0.9 max( B.max B.min, A.max A.min) 11 Experimental Methodology Use 11 CFP2000 and 11 CINT2000 benchmarks others don’t compile correctly Use ATOM to collect reuse distance statistics Use test and train data sets for training runs Evaluation based on dynamic weighting Report reuse distance prediction accuracy value and access very similar 12 Reuse Distance Prediction Suite Patterns Coverage % Accuracy % %constant %linear CFP2000 85.1 7.7 93.0 97.6 CINT2000 81.2 5.1 91.6 93.8 13 Coverage issues Reasons for no coverage 1. 2. 3. instruction does not appear in at least one test run reuse distance of test is larger than train number of patterns does not remain constant in both training runs Suite Reason 1 Reason 2 Reason 3 CFP2000 4.2% 0.3% 2.5% CINT2000 2.2% 4.4% 1.8% 14 Prediction Details Other patterns 183.equake has 13.6% square root patterns 200.sixtrack, 186.crafty all constant (no data size change) Low coverage 189.lucas – 31% of static memory operations do not appear in training runs 164.gzip – the test reuse distance greater than train reuse distance • cache-line alignment 15 Number of Patterns Suite 1 2 3 4 5 CFP2000 81.8% 10.5% 4.8% 1.4% 1.5% CINT2000 72.3% 10.9% 7.6% 4.6% 5.3% 16 Miss Rate Prediction Predict a miss for a reference if the backward reuse distance is greater than the cache size. neglects conflict misses Accurate miss rate prediction 1 actual predicted max actual, predicted 17 Miss Rate Prediction Methodology Three miss-rate prediction schemes TCS – test cache simulation • RRD – reference reuse distance • • Use the actual miss rates from running the program on a the test data for the reference data miss rates Use the actual reuse distance of the reference data set to predict the miss rate for the reference data set An upper bound on using reuse distance PRD –predicted reuse distance • Use the predicted reuse distance for the reference data set to predict the miss rate. 18 Cache Configurations config no. L1 1 32K, fully assoc. 2 3 4 32K, 2-way L2 1M fully assoc. 1M 8-way 4-way 2-way 19 L1 Miss Rate Prediction Accuracy Suite PRD RRD TCS CFP2000 97.5 98.4 95.1 CINT2000 94.4 96.7 93.9 20 L2 Miss Rate Accuracy Suite 2-way Fully Associative PRD RRD TCS PRD RRD TCS CFP2000 91% 93% 87% 97% 99.9% 91% CINT2000 91% 95% 87% 94% 99.9% 89% 21 Critical Instructions Given reuse distance for an instruction An instruction is critical if it is in the set of instructions that generate the most L2 cache misses Can we determine which instructions are critical in terms of cache performance? Those top miss-rate instructions whose cumulative total misses account for 95% of the misses in a program. Use the execution frequency of one training run to determine the relative contribution number of misses for each instruction Compare the actual critical instructions with predicted Use cache configuration 2 22 Critical Instruction Prediction Suite PRD RRD TCS %pred %act CPF2000 92% 98% 51% 1.66% 1.67% CINT2000 89% 98% 53% 0.94% 0.97% 23 Critical Instruction Patterns Suite 1 2 3 4 5 CFP2000 22.1 38.4 20.0 12.8 6.7 CINT2000 18.7 14.5 25.5 22.5 18 24 Miss Rate Discussion PRD performs better than TCS when data size is a factor TCS performs better when data size doesn’t change much and there are conflict misses PRD is much better at identifying the critical instructions than TCS these instructions should be targets of optimization 25 Memory Disambiguation Load speculation Can a load safely be issued prior to a preceding store? Use a memory distance to predict the likelihood that a store to the same address has not finished Access distance The number of memory operations between a store to and load from the same address Correlated to instruction distance and window size Use only two runs • If access distance not constant, use the access distance of larger of two data sets as a lower bound on access distance 26 When to Speculate Definitely “no” Definitely “yes” access distance less than threshold access distance greater than threshold Threshold lies between intervals compute predicted mis-speculation frequency (PMSF) • When threshold does not intersect intervals • speculate is PMSF < 5% total of frequencies that lie below the threshold Otherwise (theshold min) frequency (max min) 27 Value-based Prediction Memory dependence only if addresses and values match store a1, v1 store a2, v2 store a3, v3 load a4, v4 Can move ahead if a1=a2=a3=a4, v2=v3 and v1≠v2 The access distance of a load to the first store in a sequence of stores storing the same value is called the value distance 28 Experimental Design SPEC CPU2000 programs SPEC CFP2000 • SPEC CINT2000 • 171.swim, 172.mgrid, 173.applu, 177.mesa, 179.art, 183.equake, 188.ammp, 301.apsi 164.gzip, 175.vpr, 176.gcc, 181.mcf, 186.crafty, 197.parser, 253.perlbmk, 300.twolf Compile with gcc 2.7.2 –O3 Comparison Access distance, value distance Store set with 16KB table, also with values Perfect disambiguation 29 Micro-architecture issue width 8 fetch width 8 load 2 retire width 16 int division 8 window size 128 int multiply 4 load/store queue 128 other int 1 8 float multiply 4 multiblock gshare float addition 3 perfect float division 8 other float 2 functional units fetch data cache memory ports 2 Operation Latency 30 IPC and Mis-speculation Access Distance Store Set 16KB Table Perfect CFP2000 3.21 3.37 3.71 CINT2000 2.90 3.22 3.35 Suite Suite Mis-speculation Rate % Speculated Loads Access Store Set Access Store Set CFP2000 2.36 0.07 57.2 62.0 CINT2000 2.33 0.08 26.9 34.7 31 Value-based Disambiguation Value Distance Store Set 16KB Value CFP2000 3.34 3.55 CINT2000 3.00 3.23 Suite Mis-speculation Rate % Speculated Loads CFP2000 1.22 59.3 CINT2000 1.55 27.6 Suite 32 Cache Model Suite Access Store Set 16K CFP2000 1.55 1.61 CINT2000 1.53 1.60 Value Store Set 16K CFP2000 1.59 1.63 CINT2000 1.55 1.65 Suite 33 Summary Over 90% of memory operations can have reuse distance predicted with a 97% and 93% accuracy, for floating-point and integer programs, respectively We can accurately predict miss rates for floatingpoint and integer codes We can identify 92% of the instructions that cause 95% of the L2 misses Access- and value-distance-based memory disambiguation are competitive with best hardware techniques without a hardware table 34 Future Work Develop a prefetching mechanism that uses the identified critical loads. Develop an MLP system that uses critical loads and access distance. Path-sensitive memory distance analysis Apply memory distance to working-set based cache optimizations Apply access distance to EPIC style architectures for memory disambiguation. 35