Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ * Outline Background and Motivation VPC (Virtual Program Counter) Prediction Results Conclusion 2 Direct vs. Indirect Branch A T TARG N A br.cond TARGET A+1 Conditional (Direct) Branch R1 = MEM[R2] branch R1 ? a b d r Indirect Branch Indirect branches are costly on processor performance Much more difficult to predict than conditional (direct) branches: multiple target addresses Indirect branch predictor requires a large structure 3 Source Code Examples Switch structures Virtual function calls Source code: Shape *s = …; a = s->area(); Static assembly code: R1 = MEM[R2] call R1 4 // virtual function call // function address lookup // a register-indirect call lo r fir er ef o vt x u cy ne gw em i n ac ac de win rore s sk ex ad to pl p- or se er a ou rch tlo o ex k si cel m w ics in am av p na sa w ida -w in or dv ld d w pp ind tv sq iew lse rv AV r G ie xp MPKI Indirect Branch Mispredictions 16 14 12 5 direct indirect 10 8 6 4 2 0 Data from Intel Core Duo processor Branch Predictor Direction Predictor GHR ..1001010 PC Addr 0x0800 Hash TARG2 TARG2 Indirect Branch Predictor T TARG1 Direct IndirectBranch? Branch? Branch Target Buffer (BTB) 6 PC+1 Predicted target Outline Background and Motivation VPC (Virtual Program Counter) Prediction Results Conclusion 7 VPC Prediction: Basic Idea Key idea: Treat an indirect branch as multiple “virtual” conditional branches Only for prediction purposes Use the conditional branch predictor 8 VPC Branch Predictor Direction Predictor GHR ..1001010 PC Addr Hash 0x0800 VPC2 VPC1 TARG2 TARG1 Branch Target Buffer 9 Predicted target VPC Prediction: Basic Idea Key idea: Treat an indirect branch as multiple “virtual” conditional branches Only for prediction purposes Use the conditional branch predictor Benefits: No separate complex structure Can be applied to any other conditional branch prediction algorithm Improve conditional branch prediction algorithm Will improve the indirect branch prediction accuracy 10 Inspiration: Static Devirtualization Source code: Shape *s = …; a = s->area(); // an indirect call Optimized source code: Shape *s = …; if (s->type == Rectangle) a = Rectangle::area(); else if (s->type == Circle) a = Circle::area(); else a = s->area(); // a conditional branch at PC: X // a conditional branch at PC: Y // an indirect call at PC: Z Small talk(’84), Calder and Grunwald (’94), Garret et al. (’94) , Ishizaki et al.(’00) 11 VPC Prediction Source code: Shape *s = …; a = s->area(); Static assembly code: R1 = MEM[R2] call R1 Dynamic virtual branches (for conditional jump TARGET1 conditional jump TARGET2 conditional jump TARGET3 conditional jump TARGET4 12 // an indirect call // PC: L prediction purposes): // virtual PC = L // virtual PC = L XOR HASHVAL[1] // virtual PC = L XOR HASHVAL[2] // virtual PC = L XOR HASHVAL[3] Virtual PC Address Generation Use original PC address and iteration counter value Hash value table 0xabcd iteration counter value 0x018a 0x7a9c 0x… PC 13 Virtual PC VPC Prediction Process-I Real Instruction call R1 Direction Predictor GHR // PC: L Virtual Instructions cond. cond. cond. cond. jump jump jump jump TARG1 TARG2 TARG3 TARG4 PC // // // // VPC: VPC: VPC: VPC: L VL2 VL3 VL4 Next iteration 14 not taken 1111 L BTB TARG1 VPC Prediction Process-II Real Instruction call R1 Direction Predictor VGHR // PC: L 1110 Virtual Instructions cond. cond. cond. cond. jump jump jump jump TARG1 TARG2 TARG3 TARG4 VPC // // // // VPC: VPC: VPC: VPC: L VL2 VL2 VL3 VL4 Next iteration 15 not taken BTB TARG2 VPC Prediction Process-III Real Instruction call R1 VGHR // PC: L 1100 Virtual Instructions cond. cond. cond. cond. jump jump jump jump TARG1 TARG2 TARG3 TARG4 Direction Predictor taken VPC // // // // VPC: VPC: VPC: VPC: L VL3 VL2 VL3 VL4 BTB Predicted Target = TARG3 TARG3 16 VPC Prediction Algorithm Access the conditional branch predictor and the BTB with VPCA and VGHR Compute VPCA and VGHR for the next iteration VPCA = PC XOR HASHVAL[iter] VGHR = VGHR << 1 Predicted not taken: Move to the next iteration Predicted taken: Use the target in the BTB as the target of an indirect branch Give up and stall if Iteration count > MAX_ITER or BTB miss 17 VPC Training Algorithm An iterative process when an indirect branch is retired (not on the critical path) Update the conditional branch predictor Virtual branch has a correct target: Taken Virtual branch has a wrong target: Not-taken Update replacement policy bits of the correct target in the BTB Insert the correct target into the BTB Conditional branch predictor: taken Replace the least frequently used target (LFU) 18 Hardware Cost and Complexity GHR VGHR Branch Direction Predictor (BP) Taken/Not Taken Predict? PC + Direct/Indirect VPCA BTB Target Address Hash Function Iteration counter 19 Outline Background and Motivation VPC Prediction Results Conclusion 20 Simulation Methodology Pin-based x86 Simulator Processor configuration 4K-entry BTB 64KB perceptron conditional branch predictor Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window Less aggressive processor (in the paper) Gshare, O-GEHL conditional branch predictors Indirect branch intensive benchmarks 5 SPEC CPU2000, 5 SPEC CPU 2006, 2 other C++ IBM server benchmarks (OLTP) (in the paper) 21 22 16 baseline baseline VPC-ITER-2 VPC-ITER-2 VPC-ITER-4 VPC-ITER-4 VPC-ITER-6 VPC-ITER-6 VPC-ITER-8 VPC-ITER-8 VPC-ITER-10 VPC-ITER-10 VPC-ITER-12 VPC-ITER-12 VPC-ITER-14 VPC-ITER-16 VPC-ITER-16 14 12 10 8 6 4 2 AV G ix x ga p pe rlb en ch gc c0 6 sj en g na m d po vr ay ric ha rd s pe rl b m k eo n cr af ty 0 gc c Indirect branch Mispredictions (MPKI) (MPKI) VPC MPKI 23 G AV ix x lb en ch gc c0 6 ga p pe r k lb m pe r eo n af ty cr sj en g na m d po vr a ric y ha rd s VPC-ITER-2 VPC-ITER-4 VPC-ITER-6 VPC-ITER-8 VPC-ITER-10 VPC-ITER-12 VPC-ITER-14 VPC-ITER-16 110 100 90 80 70 60 50 40 30 20 10 0 gc c % IPC improvement over baseline VPC Performance IPC improvement (%) 35 98% 98.3% 99% gshare perceptron O-GEHL 30 25 20 15 10 5 0 Improving conditional branch prediction accuracy also improves indirect branch prediction accuracy! 24 Conditional branch accuracy (%) Different Direction Predictors VPC vs. Static Devirtualization Advantages Enables other compiler optimizations (function inlining) Can reduce the number of mispredictions Disadvantages/Limitations Not all indirect branches can be statically devirtualized Extensive static analysis/profiling Lack of adaptivity to run-time input set and phase behavior VPC prediction can be used with statically devirtualized binaries 10% improvement on top of static devirtualization 25 Outline Background and Motivation VPC Prediction Results Conclusion 26 Conclusion VPC dynamically converts indirect branches into multiple conditional branches; uses the existing conditional branch prediction hardware VPC prediction reduces the branch misprediction penalty without significant extra hardware storage. Baseline: 26% IPC improvement O-GEHL: 31% IPC improvement VPC can be an enabler encouraging programmers to use object-oriented programming styles 27 Thank you! Questions? VPC vs. Cascaded IBP cascaded-704B cascaded-1.4KB cascaded-2.8KB cascaded-5.5KB cascaded-11KB cascaded-22KB cascaded-44KB cascaded-88KB cascaded-176KB VPC-ITER-12 100 80 60 40 20 29 G AV ix x sj en g na m d po vr ay ric ha rd s lb en ch gc c0 6 ga p pe r k lb m pe r eo n cr -20 af ty 0 gc c % IPC improvement over baseline 120 VPC vs. Other Indirect BP gcc crafty eon perlbmk Target Tag Cache 12KB 1.5KB >192KB 1.5KB Cascaded >176KB 2.8KB >176KB 2.8KB TTC: Chang et al. (’96) Cascaded: Driesen and Holzle(’98) 30 Iterative prediction It doesn’t hurt performance significantly Results Why? Most prediction is within a few iterations. Results 31 32 ix x AV G gc c cr af ty eo pe n rlb m k pe g a p rlb en ch gc c0 6 sj en g na m po d vr ric ay ha rd s VPC Hit Iteration Counter 100% 11-12 80% 60% 9-10 7-8 5-6 40% 4 3 20% 2 1 0% Can the BTB be pipelined? Yes The next iteration of VPC can be started without knowing the previous iteration in the pipeline. Consecutive VPC prediction iterations can be simply pipelined. If the iteration is not needed then simply discard the prediction. 33 Is 4K-entry BTB too large? Pentium 4 has a 4K-entry BTB IBM Z series (z990) has an 8K-entry BTB AMD Athlon and Hammer have 2Kentry BTBs 34 8 40 base vpc IPC improvement 7 6 30 5 25 4 20 3 15 2 10 1 5 0 0 512 35 35 1024 2048 4096 % IPC improvement over baseline Indirect branch Mispredictions (MPKI) BTB Size Effects 36 po 6 ha ix x rd s vr ay m d en g na sj c0 h p en c gc rlb m k ga rlb n 20% ric pe pe ty eo cr af c gc VPC access (%) VPC Prediction Accuracy 100% 80% 60% 40% no target wrong target correct 0% 37 AV G ix x gc c cr af ty e pe on rlb m k pe ga rlb p en ch gc c0 6 sj en g na m d po v ric ray ha rd s Target Distribution 100% 16+ 80% 11-15 6-10 60% 5 40% 4 3 20% 2 1 0% VPC vs. Tagged Target Cache TTC-384B TTC-768B TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB TTC-96KB TTC-192KB VPC-ITER-12 38 100 80 60 40 20 AV G ix x sj en g na m d po vr ay ric ha rd s ga pe p rlb en ch gc c0 6 k lb m pe r eo n cr af ty 0 gc c % IPC improvement over baseline 120 120 39 1br/cycle 2br/cycle 4br/cycle 6br/cycle 8br/cycle 10br/cycle 100 80 60 40 20 G AV ix x sj en g na m d po vr ay ric ha rd s ga p pe rlb en ch gc c0 6 k pe rlb m eo n cr af ty 0 gc c % IPC improvement over baseline VPC Prediction Delay Effects VPC with O-GEHL BP TTC-384B TTC-768B TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB VPC-ITER-12 40 100 80 60 40 20 G AV ix x sj en g na m d po vr ay ric ha rd s lb en ch gc c0 6 ga p pe r k lb m pe r eo n cr af ty 0 gc c % IPC improvement over baseline 120 VPC with a Less Aggressive Processor TTC-384B TTC-768B TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB VPC-ITER-12 60 50 40 30 20 10 41 AV G x ix rd s ha ric vr ay po m d na en g sj c0 6 gc h en c p pe rlb ga m k rlb n pe eo ty cr af c 0 gc % IPC improvement over baseline 70 Server Benchmarks Indirect branch Mispredictions (MPKI) 16 14 12 10 baseline VPC-ITER-2 VPC-ITER-4 VPC-ITER-6 VPC-ITER-8 VPC-ITER-10 VPC-ITER-12 VPC-ITER-14 VPC-ITER-16 8 6 4 2 0 OLTP1 42 OLTP2 OLTP3 AVG Server Benchmarks (VPC vs. TTC) Indirect branch Mispredictions (MPKI) 18 16 14 12 10 baseline TTC-384B TTC-768B TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB VPC-ITER-10 8 6 4 2 0 OLTP1 43 OLTP2 OLTP3 AVG VPC Prediction vs. Compiler-Based Devirtualization (With TTC) TTC-384B TTC-768B TTC-1.5KB TTC-3KB TTC-6KB TTC-12KB TTC-24KB TTC-48KB VPC-ITER-12 90 % IPC improvement over baseline 80 44 70 60 50 40 30 20 10 0 -10 c gc c fty a r n eo pe m rlb k p ga pe ch n e rlb 06 c gc ng e sj d m a n ay r v po AV G Conditional Br. MPKI Conditional Br. Prediction Effects 4 3.5 3 2.5 2 1.5 1 0.5 0 Base VPC gshare perceptron O-GEHL VPC Prediction reduces the accuracy of direction branch prediction but not that much! 45 46 ic ex s cy cel g sq win w ls in e ex rv pl r ie or xp er lo em rer a fir cs ef o na vt x sa pp un -w t v e or iew ld de w sk o ind to utl p- oo se k ar ch a ac vi ro da r w ead in a w mp in dv d AV G si m Percentage of all mispredicted branches(%) Indirect Branch Mispredictions 60 50 indirect branches 40 30 20 10 0 VPC Prediction with Static Devirtualization VPC-ITER-4 50 VPC-ITER-6 VPC-ITER-8 40 VPC-ITER-10 30 VPC-ITER-12 20 10 AV G vr ay po m d na en g sj c0 gc rlb 6 h en c p pe VPC prediction can be used with static devirtualized binaries. 47 ga m k pe rlb n eo ty cr af c 0 gc % IPC improvement over baseline 60 Not all indirect branches could be devirtualized VPC Training: Correct Prediction Retirement: Real Instruction call R1 // PC: L Known: Correct predicted, predicted iter = 3 48 Iter VPCA VGHR Direction BP BTB 1 L GHR Not-taken - 2 VL2 GHR<<1 Not-taken - 3 VL3 GHR<<2 Taken Update replacement VPC Training: Misprediction Retirement: Real Instruction call R1 // PC: L Known: Mispredicted, correct target address 49 Iter VPCA VGHR BTB Access Train Direction BP Train BTB 1 L GHR TARG != Correct Not-taken - 2 VL2 GHR<<1 TARG != Correct Not-taken - 3 VL3 GHR<<2 Target = Correct Taken Update replacement VPC Training: Misprediction Retirement: Real Instruction call R1 // PC: L Known: Mispredicted, correct target address No Target 50 Iter VPCA VGHR BTB Access Train Direction BP Train BTB 1 L GHR TARG != Correct Not-taken - 2 VL2 GHR<<1 TARG != Correct Not-taken - 3 VL3 GHR<<2 TARG != Correct Not-taken - VPC Training: Misprediction Retirement: Real Instruction call R1 // PC: L Known: Mispredicted, correct target address Replacement 51 Iter VPCA VGHR BTB Access Repl. counter Train BP Train BTB 1 L GHR TARG != Correct 3 Nottaken - 2 VL2 GHR<< 1 TARG != Correct 10 NotTaken taken Insert Nothing 3 VL3 GHR<< 2 TARG != Correct 8 Nottaken - Does VPC need an extra BTB port? No A read from the BTB is only needed when a branch is mispredicted. 95% branches are correctly predicted with VPC. The read is performed only there is a available BTB port. 52