Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group University of Texas at Austin *Microsoft Research Outline Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion 2 Predicated Execution (normal branch code) (predicated code) A if (cond) { b = 0; } else { b = 1; } T N C B A B C D D A B C p1 = (cond) branch p1, TARGET A mov b, 1 jmp JOIN B C TARGET: mov b, 0 p1 = (cond) (!p1) mov b, 1 (p1) mov b, 0 Convert control flow dependence to data dependence 3 Benefit of Predicated Execution Predicated Execution can be high performance and energy-efficient. Predicated Execution Fetch Decode Rename Schedule RegisterRead Execute A F E A D B C C F D E C A B F E C D B A A D B C E F C A B D E F B A D C E F A E F C D B D E B C A F C D A B E B C A D A B C A B A B Branch Prediction D Fetch Decode Rename Schedule RegisterRead Execute F E Pipeline flush!! F 4 E D B A Limitations/Problems of Predication ISA: Predicate registers and predicated instructions Dynamic-Hammock Predication[Klauser’98] can solve this problem but it is only applicable to simple hammocks. Adaptivity: Static predication is not adaptive to run-time branch behavior. Branch behavior changes based on input set, phase, control-flow path. Wish Branches[Kim’05] Complex CFG: A large subset of control-flow graphs is not converted to predicated code. Function calls, loops, many instructions inside a region, and complex CFGs Hyperblock[Mahlke’92] cannot adapt to frequently-executed paths dynamically. 5 Outline Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion 6 Diverge-Merge Processor (DMP) DMP can dynamically predicate complex branches (in addition to simple hammocks). The compiler identifies Diverge branches Control-flow merge (CFM) points The microarchitecture decides when and what to predicate dynamically. 7 Dynamic Predication Low-confidence A T N C B A (mov R1, 1) PR10 = 1 B H A B C (mov R1, 0) PR11 = 0 C p1 = (cond) branch p1, TARGET mov R1, 1 jmp JOIN TARGET: mov R1, 0 H JOIN: add R5, R1, 1 select-µops (φ-nodes in SSA) PR12 = (cond) ? PR11 : PR10 H Klauser et al.[PACT’98]: Dynamic-hammock predication 8 Diverge-Merge Processor A C Diverge Branch B B D F A C E E G H Insert select-µops H CFM point Frequently executed path Not frequently executed path 9 Diverge-Merge Processor A C A A A A A B D F E A G H Frequently executed path diverge-branch Not frequently executed path 10 executed block CFM point Control-Flow Graphs A A A A A . . . . . . . . . . . simple hammock nested hammock DMP Dynamic Hammock SW pred Wish br. Dual-path 11 frequently-hammock loop non-merging Dual-path Execution vs. DMP Dual-path A C Low-confidence B D E F path 1 path 2 DMP path 1 path 2 C B C B D D CFM CFM E F E F D E F 12 Control-Flow Graphs A A A A A . . . . . . . . . . . simple hammock nested hammock frequently-hammock DMP Dynamichammock SW pred sometimes Wish br. sometimes Dual-path 13 loop non-merging Distribution of Mispredicted Branches 66% of mispredicted branches can be dynamically predicated in DMP. non-merging 12 loop 10 frequently nested 8 simple 6 4 2 14 88 li ks im am ea n m go ij p eg af pa ty rs er e pe on rlb m k ga vo p rte x bz ip 2 tw ol co f m p cr m cf 0 gz ip vp r gc c Mispredictions per kilo instructions (MPKI) Distribution of Mispredicted Branches 66% of mispredicted branches can be dynamically predicated in DMP. non-merging 12 loop 10 frequently nested 8 simple 6 4 2 15 88 li ks im am ea n m go ij p eg af t pa y rs er e pe on rlb m k ga vo p rte x bz ip 2 tw ol co f m p cr m cf 0 gz ip vp r gc c Mispredictions per kilo instructions (MPKI) Outline Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion 16 Fetch Mechanism Low Confidence A C Diverge Branch B A B Round-robin fetch D F E C E G H CFM point H predicted path 17 Dynamic Predication A B C E branch r0, C add r1 r3, #1 add r1 r2, # -1 branch pr10,C p1 = pr10 add pr21 pr13, #1 (p1) add pr31 pr12, # -1(!p1) select-µop pr41 = p1? pr21 : pr31 H add r4 r1, r3 add pr24 pr41, pr13 Arch. Phy. M R1 PR11 PR41 PR21 1 R2 PR12 R3 PR13 RAT1 Arch. Phy. M R1 PR11 PR31 1 R2 PR12 R3 PR13 RAT2 Forks RAT, RAS, and GHR 18 DMP Support ISA Support Mark diverge branches/CFM points. Compiler Support [CGO’07] The compiler identifies diverge branches and the corresponding CFM points. Hardware Support Confidence estimator Fetch mechanisms Load/store processing Instruction retirement Dynamic predication 19 Hardware Complexity Analysis DMP Dyn. Dual ham. path Multi path SW Wish pred. br. Front-End Confidence Estimator Rename Support Predicate Registers Select-Uop Gen. ST-LD Forwarding Check Flush/no Flush 20 Outline Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion 21 Simulation Methodology 12 SPEC 2000 INT, 5 SPEC 95 INT Different input sets for profiling and evaluation Alpha ISA execution driven simulator Baseline processor configuration 64KB perceptron predictor/O-GEHL (paper) Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window 2 KB 12-bit history enhanced JRS confidence estimator Less aggressive processor (paper) Power model using Wattch 22 Different CFG types simple simple,nested simple,nested,frequently simple,nested,frequently,loop 50 40 30 20 10 23 hmean m88ksim li ijpeg go comp twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc vpr 0 gzip IPC improvement (%) 60 Performance Improvement Performance Improvement (%) 25 DMP dynamic-hammock dual-path multipath limited software predication wish branches 20 15 10 5 0 24 Energy Consumption Reduction (%) 10 DMP dynamic-hammock dual-path multipath limited software predication wish branches 5 0 -5 25 Outline Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion 26 Conclusion DMP introduces the concept of frequently-hammocks and it dynamically predicates complex CFGs. DMP can overcome the three major limitations of software predication: ISA support, adaptivity, complex CFG. DMP reduces branch mispredictions energy efficiently 19% performance improvement, 9% less energy DMP divides the work between the compiler and the microarchitecture: The compiler analyzes the control-flow graphs. The microarchitecture decides when and what to predicate dynamically. 27 Thank You!! Questions? Handling Mispredictions Diverge Br. A C D F B Misprediction! E G H A A CFM point branch pr10,C p1 = pr10 B add pr21 pr13, #1 (p1) (0) B C C E E (1) add pr31 pr12, # -1(!p1) add pr44 pr34, # -1(!p1) (1) select-µop pr41 = p1? pr21 : pr31 D add pr34 pr31, pr13 H add pr24 pr41, pr13 D H predicted path 30 Flush Loop Branches Exit Condition The loop branch is predicted to exit the loop. Benefit Reduced pipeline flushes: when the predicated loop is iterated more times than it should be. Instructions in the extra iterations of the loop become NOPs. Instructions after loop-exit can still be executed. Negative Effects Increased execution delay of loop-carried dependencies The overhead of select-µops 31 Loop Branches Predicate each loop iteration separately A B A add r1 r1, #1 r0 = (cond1) branch A, r0 A A A add r1 r1, #1 r0 = (cond1) branch A, r0 A add r1 r1, #1 r0 = (cond1) branch A, r0 branch A, pr10 add pr21 pr11, #1 pr20 = (cond1) branch A, pr20 p1 = pr10 (p1) (p1) (p1) p2 = pr20 select-uop pr22 = p1 ? pr21: pr11 select-uop pr23 = p1? pr20: pr10 A add pr31 pr22, #1 pr30 = (cond1) branch A, pr30 B add r7 r1, #10 (p2) (p2) (p2) select-uop pr32 = p2 ? pr31: pr22 select-uop pr33 = p2 ? pr30: pr23 Loop br. is predicted to exit the loop B add pr7 pr32, #10 32 Enhanced Mechanisms Multiple CFM points The hardware chooses one CFM point for each instance of dynamic predication. Exit Optimizations Counter Policy: What if one path does not reach the CFM point? Number of fetched instructions > Threshold Yield Policy: What if another low confidence diverge branch is encountered in dynamic predication mode? Later low confidence branch is more likely mispredicted. 33 A B G H C D E F Detailed DMP Support 32 Predicate register ids Fetch mechanism High performance I-Cache Fetch two cache lines Predict 3 branches Fetch stops at the first taken branch 34 35 amean m88ksim li ijpeg go comp twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc vpr gzip Merge (%) Diverge and Merge? 100% 80% 60% 40% 20% 0% 36 amean m88ksim li ijpeg go comp twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc vpr gzip Diverge branch actually mispredicted (%) Useful Dynamic Predication Mode 30 25 20 15 10 5 0 Perfect Branch Prediction 4 wide-20 stages-128 window 0 delta (%) 8 wide-30 stages-512 window -5 100 90 80 70 60 50 40 30 20 10 0 0 Energy EDP -10 -10 -20 -15 -30 -20 -40 -25 -50 -30 Performance -35 -60 -40 -70 37 Maximum Power Maximum Power Increament (%) 8 DMP dynamic-hammock dual-path 6 multipath software predication wish branches 4 2 0 38 Branch Predictor Effects 35 IPC delta (%) 30 25 perceptron-dynamic-hammock perceptron-dual-path perceptron-multipath perceptron-DMP OGEHL-base OGEHL-dynamic-hammock OGEHL-dual-path OGEHL-multipath OGEHL-DMP 20 15 10 5 0 39 Confidence Estimator Effects 35 IPC delta (%) 30 25 20 512B 2KB 4KB 16KB perfect 15 10 5 0 dynamic-hammock multipath dual-path 40 DMP ip vp r gc c m c cr f af p a ty rs er eo pe n rlb m k ga vo p rte x bz ip 2 tw ol co f m p go ijp eg m 88 li ks i hm m ea n gz -5 gz ip vp r gc c m c cr f af pa ty rs er e pe on rlb m k ga vo p rte bz x ip 2 tw ol co f m p go ijp eg m 88 li ks am im ea n Execution time normalized to the baseline IPC delta (%) Results in Less Aggressive Processors 35 30 dynamic-hammock 25 dual-path multi-path 20 15 dmp 10 5 0 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 limited software predication wish branches dmp 41 42 hmean 120 m88ksim 227 li ijpeg 229 go comp twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc 140 vpr gzip IPC delta (%) DMP vs. Perfect Conditional BP dmp Perf BP 100 80 60 40 20 0 Enhanced DMP Mechanisms 60 single-cfm multiple-cfm mcfm-counter mcfm-counter-yield 40 30 20 10 43 hmean m88ksim li ijpeg go comp twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc -10 vpr 0 gzip IPC delta (%) 50 IPC improvement (%) 40 35 30 25 20 15 10 5 0 -5 47% mcf 44 hmean m88ksim li ijpeg go comp twolf bzip2 vortex gap perlbmk eon parser crafty gcc 58% vpr gzip DMP vs. Other Mechanisms dynamic-hammock dual-path multipath DMP Comparisons with Predication/Wish Branches non-predicated 0.8 0.6 0.4 limited software predication wish branches DMP 0.2 45 amean m88ksim li ijpeg go comp twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc vpr 0 gzip Normalized execution time 1 Average overhead: Dynamic-hammock: 4 instructions/entry Dual-path: 150 instructions/entry Multipath: 200 instructions/entry DMP: 20 instructions/entry 46 amean m88ksim li ijpeg go comp twolf bzip2 vortex gap perlbmk eon parser crafty gcc vpr mcf dynamic-hammock dual-path multipath DMP 80 70 60 50 40 30 20 10 0 gzip Reduction in pipeline flushes (%) Reduction in Pipeline Flushes Handling Nested Diverge Branches Diverge Br. Basic DMP A C Ignore other low confidence div. branches B Enhanced DMP D F E Exit dynamic predication mode and re-enter from the younger low confidence branch on predicted path (Yield policy) G H CFM point 47 Compiler Support [CGO’07] Compiler analyzes the control flow and the profile data Step1: Identify diverge branch candidates and CFM points. Step2: Select diverge branches based on (1) the number of instructions between a branch and the CFM point (2) the probability of merging at the CFM point Heuristics or a cost-benefit model Step3: Mark the selected branches/CFM points. 48 Future Research Hardware Support Better confidence estimators Efficient hardware mechanism to detect diverge branches and CFM points Increase hardware complexity but eliminate the need for ISA/compiler support Compiler Support Better compiler algorithms [CGO’07] 49 Power Measurement Configurations 100 nm Technology Baseline processor 4GHZ Less aggressive processor 1.5GHz CC3 clock-gating model in Wattch: unused units dissipate only 10% of their maximum power DMP: one more RAT/RAS/GHR, select-uop generation module, additional fields in BTB, predicate registers, CFM registers, loadstore forwarding, instruction retirement 50 350 dynamic-hammock dual-path multipath dmp 300 250 200 150 100 50 51 amean m88ksim li ijpeg go comp twolf bzip2 vortex gap perlbmk eon parser crafty mcf gcc vpr 0 gzip Wrong-path instructions per entry Fetched wrong-path instructions per entry into dynamic-predication/dual-path mode Fetched/Executed Instructions 5 0 baseline less-aggressive delta (%) -5 -10 fetched instructions executed instructions max power energy energy-delay product -15 -20 -25 52 ISA Support Example of Diverge Br and CFM markers OPCODE TARGET 00 : normal branch 10 : diverge forward branch 11 : diverge loop branch CFM = CFM rel address + PC 53 CFM rel address Entering Dynamic Predication Mode Entry condition When a diverge branch has low confidence. The Front-end Stores the address of the CFM point to the CFM register. Forks the RAS, GHR, and RAT. Allocates a predicate register. Fetch Mechanisms Round-robin fetch from two paths The processor follows the branch predictor until it reaches the corresponding CFM point. 54 Exiting Dynamic Predication Mode Exit condition Both paths of a diverge branch have reached the corresponding CFM point. A diverge branch is resolved. Select-µop mechanism Similar to φ-node in SSA Merges register values from two paths. 55 Multipath Execution Low-confidence A path 2 C path 3 B path 4 D E F G H H H H H I I I I I C D path 1 B E F Low-confidence G Instructions after the control-flow merge point are fetched multiple times. Waste of resources and energy. 56 Modeling Software Predication Mark using a binary instrumentation tool All simple and nested hammocks can be predicated. All instruction between a branch and the control-flow merge point are fetched. All nested branches are predicated. 57