Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Talk Outline Background on Runahead Execution The Problem Causes of Inefficiency and Eliminating Them Evaluation Performance Optimizations to Increase Efficiency Combined Results Conclusions Efficient Runahead Execution 2 Background on Runahead Execution A technique to obtain the memory-level parallelism benefits of a large instruction window When the oldest instruction is an L2 miss: In runahead mode: Checkpoint architectural state and enter runahead mode Instructions are speculatively pre-executed The purpose of pre-execution is to generate prefetches L2-miss dependent instructions are marked INV and dropped Runahead mode ends when the original L2 miss returns Checkpoint is restored and normal execution resumes Efficient Runahead Execution 3 Runahead Example Small Window: Load 2 Miss Load 1 Miss Compute Stall Compute Miss 1 Stall Miss 2 Runahead: Load 1 Miss Compute Load 2 Miss Runahead Miss 1 Load 1 Hit Load 2 Hit Compute Saved Cycles Miss 2 Efficient Runahead Execution 4 The Problem A runahead processor pre-executes some instructions speculatively Each pre-executed instruction consumes energy Runahead execution significantly increases the number of executed instructions, sometimes without providing significant performance improvement Efficient Runahead Execution 5 The Problem (cont.) 235% 110% 100% % Increase in IPC % Increase in Executed Instructions 90% 80% 70% 60% 50% 40% 30% 22.6% 26.5% 20% 10% Efficient Runahead Execution AVG wupwise swim sixtrack mgrid mesa lucas galgel fma3d facerec equake art apsi applu ammp vpr vortex twolf perlbmk parser mcf gzip gcc gap eon crafty bzip2 0% 6 Efficiency of Runahead Execution Efficiency = % Increase in IPC % Increase in Executed Instructions Goals: Reduce the number of executed instructions without reducing the IPC improvement Increase the IPC improvement without increasing the number of executed instructions Efficient Runahead Execution 7 Talk Outline Background on Runahead Execution The Problem Causes of Inefficiency and Eliminating Them Evaluation Performance Optimizations to Increase Efficiency Combined Results Conclusions Efficient Runahead Execution 8 Causes of Inefficiency Short runahead periods Overlapping runahead periods Useless runahead periods Efficient Runahead Execution 9 Short Runahead Periods Processor can initiate runahead mode due to an already in-flight L2 miss generated by the prefetcher, wrong-path, or a previous runahead period Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Miss Compute Runahead Miss 1 Miss 2 Short periods are less likely to generate useful L2 misses have high overhead due to the flush penalty at runahead exit Efficient Runahead Execution 10 Eliminating Short Runahead Periods Mechanism to eliminate short periods: Record the number of cycles C an L2-miss has been in flight If C is greater than a threshold T for an L2 miss, disable entry into runahead mode due to that miss T can be determined statically (at design time) or dynamically T=400 for a minimum main memory latency of 500 cycles works well Efficient Runahead Execution 11 Overlapping Runahead Periods Two runahead periods that execute the same instructions Load 1 Miss Load 2 INV Compute Load 1 Hit Load 2 Miss Miss 1 OVERLAP OVERLAP Runahead Miss 2 Second period is inefficient Efficient Runahead Execution 12 Overlapping Runahead Periods (cont.) Overlapping periods are not necessarily useless The availability of a new data value can result in the generation of useful L2 misses But, this does not happen often enough Mechanism to eliminate overlapping periods: Keep track of the number of pseudo-retired instructions R during a runahead period Keep track of the number of fetched instructions N since the exit from last runahead period If N < R, do not enter runahead mode Efficient Runahead Execution 13 Useless Runahead Periods Periods that do not result in prefetches for normal mode Load 1 Miss Compute Load 1 Hit Runahead Miss 1 They exist due to the lack of memory-level parallelism Mechanism to eliminate useless periods: Predict if a period will generate useful L2 misses Estimate a period to be useful if it generated an L2 miss that cannot be captured by the instruction window Useless period predictors are trained based on this estimation Efficient Runahead Execution 14 Predicting Useless Runahead Periods Prediction based on the past usefulness of runahead periods caused by the same static load instruction Prediction based on too many INV loads If the fraction of INV loads in a runahead period is greater than T, exit runahead mode Sampling (phase) based prediction A 2-bit state machine records the past usefulness of a load If last N runahead periods generated fewer than T L2 misses, do not enter runahead for the next M runahead opportunities Compile-time profile-based prediction If runahead modes caused by a load were not useful in the profiling run, mark it as non-runahead load Efficient Runahead Execution 15 Talk Outline Background on Runahead Execution The Problem Causes of Inefficiency and Eliminating Them Evaluation Performance Optimizations to Increase Efficiency Combined Results Conclusions Efficient Runahead Execution 16 Baseline Processor Execution-driven Alpha simulator 8-wide superscalar processor 128-entry instruction window, 20-stage pipeline 64 KB, 4-way, 2-cycle L1 data and instruction caches 1 MB, 32-way, 10-cycle unified L2 cache 500-cycle minimum main memory latency Aggressive stream-based prefetcher 32 DRAM banks, 32-byte wide processor-memory bus (4:1 frequency ratio), 128 outstanding misses Detailed memory model Efficient Runahead Execution 17 Impact on Efficiency baseline runahead 35% short overlapping useless 30% Increase Over Baseline OOO short+overlapping+useless 26.5%26.5%26.5%26.5% 25% 22.6% 20.1% 20% 15.3% 15% 14.9% 11.8% 10% 6.7% 5% 0% Executed Instructions Efficient Runahead Execution IPC 18 Performance Optimizations for Efficiency Both efficiency AND performance can be increased by increasing the usefulness of runahead periods Three optimizations: Turning off the Floating Point Unit (FPU) in runahead mode Optimizing the update policy of the hardware prefetcher (HWP) in runahead mode Early wake-up of INV instructions (in paper) Efficient Runahead Execution 19 Turning Off the FPU in Runahead Mode FP instructions do not contribute to the generation of load addresses FP instructions can be dropped after decode Spares processor resources for more useful instructions Increases performance by enabling faster progress Enables dynamic/static energy savings Results in an unresolvable branch misprediction if a mispredicted branch depends on an FP operation (rare) Overall – increases IPC and reduces executed instructions Efficient Runahead Execution 20 HWP Update Policy in Runahead Mode Aggressive hardware prefetching in runahead mode may hurt performance, if the prefetcher accuracy is low Runahead requests more accurate than prefetcher requests Three policies: Do not update the prefetcher state Update the prefetcher state just like in normal mode Only train existing streams, but do not create new streams Runahead mode improves the timeliness of the prefetcher in many benchmarks Only training the existing streams is the best policy Efficient Runahead Execution 21 Talk Outline Background on Runahead Execution The Problem Causes of Inefficiency and Eliminating Them Evaluation Performance Optimizations to Increase Efficiency Combined Results Conclusions Efficient Runahead Execution 22 Efficient Runahead Execution AVG wupwise swim sixtrack mgrid mesa lucas galgel fma3d facerec equake 110% art apsi applu ammp vpr vortex twolf perlbmk parser mcf gzip 90% gcc 100% gap eon crafty bzip2 Increase in Executed Instructions Overall Impact on Executed Instructions 235% baseline runahead all techniques 80% 70% 60% 50% 40% 30% 26.5% 20% 10% 6.2% 0% 23 Efficient Runahead Execution AVG wupwise swim sixtrack mgrid mesa lucas galgel fma3d facerec equake 110% art apsi applu ammp vpr vortex twolf perlbmk parser mcf gzip 90% gcc 100% gap eon crafty bzip2 Increase in IPC Overall Impact on IPC 116% baseline runahead all techniques 80% 70% 60% 50% 40% 30% 20% 22.6% 22.1% 10% 0% 24 Conclusions Three major causes of inefficiency in runahead execution: short, overlapping, and useless runahead periods Simple efficiency techniques can effectively reduce the three causes of inefficiency Simple performance optimizations can increase efficiency by increasing the usefulness of runahead periods Proposed techniques: reduce the extra instructions from 26.5% to 6.2%, without significantly affecting performance are effective for a variety of memory latencies ranging from 100 to 900 cycles Efficient Runahead Execution 25 Backup Slides Efficient Runahead Execution AVG wupwise swim sixtrack mgrid mesa lucas galgel fma3d facerec equake art apsi applu ammp vpr vortex twolf perlbmk 4.0 parser 4.5 mcf 5.0 gzip gcc gap eon crafty bzip2 IPC Baseline IPC 5.5 no prefetcher baseline runahead perfect L2 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 27 Memory Latency (Executed Instructions) 50% 45% Increase in Executed Instructions 40% baseline runahead all techniques 35% 30% 25% 20% 15% 10% 5% 0% 100 300 500 700 900 Memory Latency Efficient Runahead Execution 28 Memory Latency (IPC Delta) 50% Increase in IPC 45% 40% baseline runahead 35% all techniques 30% 25% 20% 15% 10% 5% 0% 100 300 500 700 900 Memory Latency Efficient Runahead Execution 29 Cache Sizes (Executed Instructions) 45% Increase in Executed Instructions 40% 35% baseline runahead all techniques 30% 25% 20% 15% 10% 5% 0% 512 KB Efficient Runahead Execution 1 MB 2 MB 4 MB 30 Cache Sizes (IPC Delta) 45% 40% 35% baseline runahead all techniques Increase in IPC 30% 25% 20% 15% 10% 5% 0% 512 KB Efficient Runahead Execution 1 MB 2 MB 4 MB 31 INT (Executed Instructions) 40% Increase in Executed Instructions 35% 30% 25% runahead (INT) all techniques (INT) 20% 15% 10% 5% 0% 100 300 500 700 900 Memory Latency Efficient Runahead Execution 32 INT (IPC Delta) 40% 35% Increase in IPC 30% 25% runahead (INT) all techniques (INT) 20% 15% 10% 5% 0% 100 300 500 700 900 Memory Latency Efficient Runahead Execution 33 FP (Executed Instructions) 65% Increase in Executed Instructions 60% 55% runahead (FP) 50% all techniques (FP) 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% 100 300 500 700 900 Memory Latency Efficient Runahead Execution 34 FP (IPC Delta) 65% 60% 55% runahead (FP) 50% all techniques (FP) Increase in IPC 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% 100 300 500 700 900 Memory Latency Efficient Runahead Execution 35 Early INV Wake-up Keep track of INV status of an instruction in the scheduler. Scheduler wakes up the instruction if any source is INV. + Enables faster progress during runahead mode by removing the useless INV instructions faster. - Increases the number of executed instructions. - Increases the complexity of the scheduling logic. Not worth implementing due to small IPC gain Efficient Runahead Execution 36