2D-Profiling Detecting Input-Dependent Branches with a Single Input Data Set Hyesoon Kim M. Aater Suleman Onur Mutlu Yale N. Patt HPS Research Group The University of Texas at Austin Motivation Profile-guided code optimization has become essential for achieving good performance. Run-time behavior profile-time behavior: Good! Run-time behavior ≠ profile-time behavior: Bad! 2 Motivation Profiling with one input set is not enough! Because a program can show different behavior with different input data sets Example: Performance of predicated execution is highly dependent on the input data set Because some branches behave differently with different input sets 3 Input-dependent Branches Definition A branch is input-dependent if its misprediction rate differs by more than some Δ over different input data sets. Inp. 1 Inp. 2 Inp.1 – Inp. 2 Misprediction rate of Br. X 30% 29% 1% Misprediction rate of Br. Y 30% 5% 25% Input-dependent branch Input-dependent br. hard-to-predict br. 4 An Example Input-dependent Branch Example from Gap (SPEC2K): Type checking branch TypHandle Sum (A,B) If ((type(A) == INT) && (type(B) == INT)) { Result = A + B; Return Result; //input-dependent br } Return SUM(A, B); Train input set: A&B are integers 90% of the time misprediction rate: 10% Reference input set: A&B are integers 42% of the time misprediction rate: 30% (30%-10%)>Δ 5 Predicated Execution (normal branch code) (predicated code) A if (cond) { b = 0; } else { b = 1; } T N C B A B C D A B C D p1 = (cond) branch p1, TARGET A mov b, 1 jmp JOIN B TARGET: mov b, 0 C p1 = (cond) (!p1) mov b, 1 (p1) mov b, 0 Eliminate hard-to-predict branches but fetch blocks B and C all the time 6 Predicated Code Performance vs. Branch Misprediction Rate Predicated code performs better Execution time (cycles) 8 predicated code 7 normal branch code 6 5 4 run-time (input B) profile-time (input A) 3 2 0 1 2 3 4 5 6 X7 8 9 10 11 12 13 14 15 Branch misprediction rate (%) Normal branch code performs better Converting a branch to predicated code could hurt performance if run-time misprediction rate is lower than profile-time misprediction rate 7 Exec. time normalized to non-predicated code Predicated Code Performance vs. Input Set 1.2 run-time: input-A run-time: input-B run-time: input-C -16% -4% 9% 1% non-predicated 1 0.8 0.6 Predicated execution loses performance because of input-dependent branches 0.4 0.2 0 gzip vpr gcc mcf crafty parser perlbmk gap vortex bzip2 twolf Measured on an Itanium-II machine 8 If We Know a Branch is Input-Dependent May not convert it to predicated code. May convert it to a wish branch. [Kim et al. Micro’05] May not perform other compiler optimizations or may perform them less aggressively. Hot-path/trace/superblock-based optimizations [Fisher’81, Pettis’90, Hwu’93, Merten’99] 9 Our Goal Identify input-dependent branches by using a single input set for profiling 10 Talk Outline Motivation 2D-profiling Mechanism Experimental Results Conclusion 11 Key Insight of 2D-profiling Branch Prediction Accuracy Branch Phase behavior in prediction accuracy is a good indicator of input dependence phase 2 1 0.8 phase 1 0.6 0.4 input-dependent phase 3 input-independent 0.2 0 0 500 1000 1500 Time (in terms of number of executed instructions x 100M) 12 brA pr. Acc Traditional Profiling MEAN pr.Acc(brA) brB pr. Acc time MEAN pr.Acc(brB) time MEAN pr.Acc(brA) MEAN pr.Acc(brB) behavior of brA behavior of brB 13 brA pr. Acc 2D-profiling MEAN pr.Acc(brA) brB pr. Acc time STD pr.Acc(brA) MEAN pr.Acc(brB) STD pr.Acc(brB) time MEAN pr.Acc(brA) MEAN pr.Acc(brB) STD pr.Acc(brA) ≠ STD pr.Acc(brB) behavior of brA ≠ behavior of brB A: input-dependent br, B: input-independent br 14 2D-profiling Mechanism The profiler collects branch prediction accuracy information for every static branch over time slice size = M instructions Slice 1 Slice 2 … Slice N time mean Pr.Acc(brA,s1) mean Pr.Acc(brA,s2) ... mean Pr.Acc(brA,sN) mean Pr.Acc(brB,s1) mean Pr.Acc(brB,s2) ... ... ... mean Pr.Acc(brB,sN) ... PAM:50% brA Calculate MEAN (brA,mean brB, brA …), Standard deviation (brA, brB, …), brB …) PAM:Points Above Mean (brA, brB, mean brB PAM:0% 15 Input-dependence Tests STD&PAM-test: Identify branches that have large variations in accuracy over time (phase behavior) STD-test (STD > threshold): Identify branches that have large variations in the prediction accuracy over time PAM-test (PAM > threshold): Filter out branches that pass STDtest due to a few outlier samples MEAN&PAM-test: Identify branches that have low prediction accuracy and some time-variation in accuracy MEAN-test (MEAN < threshold): Identify branches that have low prediction accuracy PAM-test (PAM > threshold): Identify branches that have some variation in the prediction accuracy over time A branch is classified as input-dependent if it passes either STD&PAM-test or MEAN&PAM-test 16 Talk Outline Motivation 2D-profiling Mechanism Experimental Results Conclusion 17 Experimental Methodology Profiler: PIN-binary instrumentation tool Benchmarks: SPEC 2K INT Input sets Profiler: Train input set Input-dependent Branches: Reference input set and train/other extra input sets Input-dependent branch: misprediction rate of the branch changes more than Δ = 5% when input data set changes Different Δ are examined in our TechReport [reference 11]. Branch predictors Profiler: 4KB Gshare, Machine: 4KB Gshare Profiler: 4KB Gshare, Machine: 16KB Perceptron (in paper) 18 Evaluation Metrics Coverage and Accuracy for input-dependent branches Correctly Predicted Input-dependent br. Predicted Input-dependent br. (2D-profiler) A B AB Correctly Predicted COV A Actual Input - dependent Actual Input-dependent br. ACC AB Correctly Predicted B Predicted Input - dependent 19 % of branches that are input-dependent Input-dependent Branches 100% 2 input sets 3 input sets 4 input sets 5 input sets 6 input sets 7 input sets 8 input sets 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% bzip2 gzip twolf gap crafty gcc 20 2D-profiling Results 1 88% 93% 0.9 0.8 Fraction 0.7 76% 2 input sets 63% 3 input sets 0.6 4 input sets 0.5 5 input sets 0.4 6 input sets 0.3 7 input sets 0.2 8 input sets 0.1 0 coverage accuracy input-dependent branches coverage accuracy input-independent branches Phase behavior and input-dependence are strongly correlated! 21 The Cost of 2D-profiling? Normalized execution time 1.8 54%1% 1.6 1.4 1.2 1 0.8 0.6 Edge 0.4 Gshare 0.2 2D+Gshare 0 bzip2 gzip twolf gap crafty gcc AVG 2D-profiling adds only 1% overhead over modeling the branch predictor in software Using a H/W branch predictor [Conte’96] 22 Conclusion 2D-profiling is a new profiling technique to find inputdependent characteristics by using a single input data set for profiling 2D-profiling uses time-varying information instead of just average data Phase behavior in prediction accuracy in a profile run input-dependent 2D-profiling accurately identifies input-dependent branches with very little overhead (1% more than modeling the branch predictor in the profiler) Applications of 2D-profiling are an open research topic Better predicated code/wish branch generation algorithms Detecting other input-dependent program characteristics 23 Questions?