Program Demultiplexing: Data-flow based Speculative Parallelization Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison Speculative Parallelization 2 • Construct threads from sequential program – Loops, methods, … • Execute threads speculatively – Hardware support to enforce program order • Application domain – Irregularly parallel • Importance now – Single-core performance incremental Speculative Parallelization Execution 3 • Execution model – Fork threads in program order for execution – Commit tasks in that order Control-flow Speculative Parallelization • Limitation – Reaching distant parallelism T1 T2 T3 T4 Outline 4 • Program Demultiplexing Overview • Program Demultiplexing Execution Model • Hardware Support • Evaluation Program Demultiplexing Framework 5 Trigger Handler M() – Begins execution of Handler • Handler – Setup execution, parameters • Demultiplexed execution M() Call Site Sequential Execution • Trigger EB PD Execution – Speculative – Stored in Execution Buffer • At call site – Search EB for execution • Dependence violations – Invalidate executions Program Demultiplexing Highlights 6 • Method granularity – Well defined • Parameters • Stack for local communication • Trigger forks execution – Means for reaching distant method – Different from call site • Independent speculative executions – No control dependence with other executions – Triggers lead to unordered execution • Not according to program order Outline 7 • Program Demultiplexing Overview • Program Demultiplexing Execution Model • Hardware Support • Evaluation Example: 175.vpr, update_bb () 8 .. x_from = block [b_from].x; y_from = block [b_from].y; find_to (x_from, y_from, block [b_from].type, rlim, &x_to, &y_to); .. .. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue; .. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) Call Site 2 update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else Call Site 1 update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); } .. .. bb_index++ }` Handlers 9 • Provides parameters to execution update_bb (inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to); • Achieves separation of call site and execution • Handler code – Slice of dependent instructions from call site – Many variants possible Handlers Example 10 H1 H2 .. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to ); .. .. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue; .. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); } .. .. bb_index++ } Triggers 11 • Fork demultiplexed execution – Usually when method and handler are ready • i.e. when data dependencies satisfied • Begins execution of the handler Identifying Triggers • Generate memory profile • Identify trigger point Program state for H + M available • Collect for many executions – Good coverage • Represent trigger points – Use instruction attributes • PCs, Memory write address Program state for H+M Sequential Exec. 12 Handler M() Triggers Example 13 .. T1 T2 H1 H2 M M M M M x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to ); .. Minimum of 400 cycles .. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue; .. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) 90 cycles per execution update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); } .. .. bb_index++ } H1 H2 Handlers Example … (2) 14 .. T1 T2 x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to ); .. .. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue; .. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) Stack references update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); } .. .. bb_index++ } Outline 15 • Program Demultiplexing Overview • Program Demultiplexing Execution Model • Hardware Support • Evaluation Hardware Support Outline 16 • Support for triggers • Demultiplexed execution • Maintaining executions – Storage – Invalidation – Committing Dealt in other spec. parallelization proposals Support for Triggers 17 • Triggers are registered with hardware – ISA extensions – Similar to debug watchpoints • Evaluation of triggers – Only by Committed instructions • PC, address – Fast lookup with filters Demultiplexed Execution 18 • Hardware: Typical MP system – Private cache for speculative data – Extend cache line with “access” bit • Misses serviced by Main processor – No communication with other executions • On completion – Collect read set (R) • Accessed lines – Collect write set (W) • Dirty lines – Invalidate write set in cache Main C C P0 P3 P1 P2 C C Auxiliary Execution buffer pool 19 Method (Parameters) • Holds speculative executions Read Set <tag> <data> • Execution entry contains – Read and write set – Parameters and return value • Alternatives – Use cache • May be more efficient – Similar to other proposals • Not the focus in this paper Write Set <tag> <data> Return value Method (Parameters) Read Set . . . Invalidating Executions 20 Method (Parameters) <tag> • For a committed store address Read Set <data> Write Set <tag> <data> – Search Read and Write sets – Invalidate matching executions Return value Method (Parameters) <tag> Read Set <data> Write Set <tag> <data> Invalidation Return value Method (Parameters) <tag> Read Set <data> Write Set <tag> <data> Return value Method (Parameters) <tag> Read Set <data> Write Set <tag> <data> Return value Method (Parameters) <tag> Read Set <data> Write Set <tag> <data> Return value Using Executions 21 Method (Parameters) <tag> Read Set <data> Write Set <tag> <data> Return value Method (Parameters) <tag> Read Set <data> Write Set <tag> <data> – Search method name, parameters – Get write and read set – Commit • If accessed by program – Use Return value Search • For a given call site Method (Parameters) <tag> Read Set <data> Write Set <tag> <data> Return value Method (Parameters) <tag> Read Set <data> Write Set <tag> <data> Return value • If accessed by another method – Nested methods Method (Parameters) <tag> Read Set <data> Write Set <tag> <data> Return value Outline 22 • Program Demultiplexing Overview • Program Demultiplexing Execution Model • Hardware Support • Evaluation Reaching distant parallelism 23 Call site 1000 Fo rk A Call Site 100 A B 10 1 vpr vortex twolf parser mcf 0.01 gzip M() gap B crafty 0.1 Performance evaluation – Performance benefits limited by • Methods in program • Handler implementation vpr 6p vortex 5p twolf 4p parser 3p mcf gap 2p gzip 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 crafty Speedup 24 Summary of other results (Refer paper) 25 • Method sizes – 10s to 1000s of instructions. Lower 100s usually • Demultiplexed execution overheads – Common case 1.1x to 2.0x • Trigger points – 1 to 3. Outliers exist: macro usage • Handler length – 10 to 50 instructions average • Cache lines – Read – 20s, Written – 10s • Demultiplexed execution – Held average of 100s of cycles Conclusions 26 • Method granularity – Exploit modularity in program • Trigger and handler to allow “earliest” execution – Data-flow based • Unordered execution – Reach distant parallelism • Orthogonal to other speculative parallelization – Use to further speedup demultiplexed execution Backup Average trigger points in call site 28 30 27 25 Avg PCs 20 18 15 14 12 10 5 Max 8 7 2 2 1 2 3 3 2 4 2 1 0 crafty gap gzip mcf parser twolf vortex • Small set of trigger points for a given call site – Defines reachability from trigger to the call site vpr Evaluation 29 • Full-system execution-based simulator – – – – Intel x86 ISA and Virtutech Simics 4-wide out-of-order processors 64K Level 1 caches (2 cycle), 1 MB Level 2 (12 cycle) MSI coherence • Software toolchain – Modified gcc-compiler and lancet tool • Debugging information, CFG, program dependence graph – Simulator based memory profile – Generates triggers and handlers • No mis-speculations occur Reaching distant parallelism 30 A = Cycles between Fork and Call Site 3000 gap parser twolf 924 392 798 228 1477 348 2880 mcf vortex 2750 1146 gzip 378 1122 crafty 9600 500 978 1000 183 1500 0 M() Max 2000 1700 A Avg 411 Cycles 2500 vpr Execution Buffer Entries 31 60 52 Entries 50 Avg 40 32 20 29 28 30 Max 23 15 13 16 12 8 10 1 8 4 3 2 4 0 crafty 900 gap 590 gzip mcf 70 520 parser 413 twolf 244 Avg. Cycles Held – Storage requirements • Max case 284 KB – Minimize entries by better scheduling vortex 160 vpr 308 Read and write set 32 Cache lines (64b) Cache lines written 25 Min Max 18 20 15 10 10 5 Avg 3 7 5 4 gap gzip 6 3 0 crafty mcf parser twolf vortex vpr Cache lines read Cache lines (64b) 60 50 40 Min 30 24 10 8 10 crafty gap gzip Max 22 20 8 Avg 13 10 7 0 mcf parser twolf vortex vpr Demultiplexed execution overheads mcf parser twolf vortex • Overheads due to – Handler – Cache misses due to demultiplexed execution • Common case – between 1.1 to 2.0x • Small methods High overheads 1.1 gzip Max 1.3 gap Avg 1.4 5.0 crafty 2.0 1.5 Min 2.2 8 7 6 5 4 3 2 1 0 1.8 Execution Time Overhead 33 vpr Length of handlers 34 Instructions 300 240 250 200 Min Avg Max 150 79 100 50 42 8 32 7 20 12 20 37 18 38 32 9 16 parser twolf vortex vpr 4% 40% 4% 9 0 crafty 14% gap 10% gzip 9% mcf 100% 16% Handler Instruction Count Overhead Method sizes 35 Instructions 1500 Min Avg Max 1200 1000 945 900 600 455 300 300 245 80 180 62 13 220 195 55 45 63 0 crafty gap gzip mcf parser twolf vortex vpr Methods 36 crafty gap gzip mcf parser twolf vortex vpr Methods 24 Call Sites 206 16 59 9 27 8 9 12 84 10 26 11 106 11 20 Exec. time 85 (%) 90 51 30 55 92 88 99 – Runtime includes frequently called methods Loop-level Parallelization 37 • Unit: Loop iterations • Live-ins from – P-slice • Similar to handler Mitosis fork • Fork instruction – Restricted • Same basic block level, method – Program order dependent – Ordered forking loop endl Method-level parallelization 38 • Unit: Method continuations – Program after the method returns • Orthogonal to PD Method-level call M() ret Reaching distant parallelism 39 10000 M1() 1000 100 B A B A M2() 10 1 0.1 0.01 B A > 1 (%) crafty gap gzip mcf pars twolf vortex vpr 60 72 30 80 70 40 63 47 Reaching distant parallelism 40 B = Call Time to Earliest execution time 2500 (1 outstanding) Avg crafty 2() M A gzip 2.5 twolf vortex vpr Multiple Executions (R1) No parameters (R2) 2 1.5 2.5 1.4 1.2 1 1.6 1.2 2.7 1.7 1.2 2.5 2.2 1.1 1.1 1.7 0 1.7 1.8 1 0.5 2500 420 178 665 1042 parser 190 547 129 1800 236 mcf C/B = R1 CNo params/C = R2 3 Multiplying ratio gap 8000 0 391 B 73 500 935 1000 1000 C Max 1500 242 M1() Cycles 2000 crafty gap gzip mcf parser twolf vortex vpr Issues with Stack 41 • Stack pointer is position dependent – Handler has to insert parameters at right position • Same stack addresses denote different variables – Affects triggers • Different stack pointers in program and execution – Stack may be discarded – To commit requires relocation of stack results • Example: parameters passed by reference Benchmarks 42 • SPECint2000 benchmarks – C programs • Did not evaluate gcc, perl, bzip2, and eon – No intention of creating concurrency – No specific/ clean Programming style • Many methods perform several tasks – May have less opportunities Hardware System 43 • Intel x86 simulation – Virtutech Simics based full-system, Bochs decoder – 4-processors at 3 GHz – Simple memory system • Micro-architecture model – – – – – 4-wide out of order without cracking into micro-ops Branch predictors 32K L1 (2-cycle), 1 MB L2 (12-cycle) MSI, 15-cycle communication cache to cache Infinite Execution buffer pool Software 44 • Modified gcc-compiler tool chain and lancet tool • Extract from compiled binary – Debugging information – CFG, Program Dependence Graph • Software – Dynamic information from simulator – Generates handler, trigger for call site as encountered • Control-flow in handler not included [ongoing work] • Perfect control transfer from trigger to method – Handler doesn’t execute if a branch leads to not calling the method Generating Handlers 45 • Cannot easily identify and demarcate code – Heuristic to demarcate – Terminate when load address is from heap • Handler has – Loads and stores to stack – No stores to heap – Limitation • Heuristic. Doesn’t always work Generating Handlers 46 • 1: Specify parameters to method – Pushed into stack by program • Introduces dependency • Prevents separation • 2: Computing parameters – Program performs it near call site – Need to identify the code – Deal with • Use of stack • Control-flow • Inter-method dependence 1: G = F (N) 2: if (…) 3: X = G + 2 4: else 5: X = G * 2 6: M (X) Control-flow in Handlers 47 • Depends on call site’s CF • Handler for D – Call site in C () BB 3 – Include Loop CFG (C), Call Graph C 1 • BB 4 to BB 1 – Include Branch 2 3 • Branch in BB 1 • Inclusion depends on trigger – Multiple iterations, diff. triggers • Ongoing work D 4 Other dependencies in Handlers 48 • C calls D, A or B calls C – Dependence (X) extends Call Graph • May need multiple handlers A(X) – If multiple call sites C (X) D(X) B(X) Buffering Handler Writes 49 • General case – Writes in handler to be buffered – Provided to execution – Discarded after execution P1 P2 P3 C C C • Current implementation – Only stack writes EB Methods for Speculative Execution 50 • Well encapsulated – Defined by parameters and return value – Stack for local computation – Heap for global state • Often performs specific tasks – Access limited global state – Limits side-effects