Efficient Runahead Execution Processors A Power-Efficient Processing Paradigm for Tolerating Long Main Memory Latencies Onur Mutlu PhD Defense 4/28/2006 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Efficient Runahead Execution Address-Value Delta (AVD) Prediction Summary of Contributions Future Work 2 Motivation Memory latency is very long in today’s processors And, it continues to increase (in terms of processor cycles) CDC 6600: 10 cycles [Thornton, 1970] Alpha 21264: 120+ cycles [Wilkes, 2001] Intel Pentium 4: 300+ cycles [Sprangle & Carmean, 2002] DRAM latency is not reducing as fast as processor cycle time Conventional techniques to tolerate memory latency do not work well enough. 3 Conventional Latency Tolerance Techniques Caching [initially by Wilkes, 1965] Widely used, simple, effective, but inefficient, passive Not all applications/phases exhibit temporal or spatial locality Prefetching [initially in IBM 360/91, 1967] Works well for regular memory access patterns Prefetching irregular access patterns is difficult, inaccurate, and hardwareintensive Multithreading [initially in CDC 6600, 1964] Works well if there are multiple threads Improving single thread performance using multithreading hardware is an ongoing research effort Out-of-order execution [initially by Tomasulo, 1967] Tolerates cache misses that cannot be prefetched Requires extensive hardware resources for tolerating long latencies 4 Out-of-order Execution Instructions are executed out of sequential program order to tolerate latency. Instructions are retired in program order to support precise exceptions/interrupts. Not-yet-retired instructions and their results are buffered in hardware structures, called the instruction window. The size of the instruction window determines how much latency the processor can tolerate. 5 Small Windows: Full-window Stalls When a long-latency instruction is not complete, it blocks retirement. Incoming instructions fill the instruction window. Once the window is full, processor cannot place new instructions into the window. This is called a full-window stall. A full-window stall prevents the processor from making progress in the execution of the program. 6 Small Windows: Full-window Stalls 8-entry instruction window: Oldest LOAD R1 mem[R5] L2 Miss! Takes 100s of cycles. BEQ R1, R0, target ADD R2 R2, 8 LOAD R3 mem[R2] MUL R4 R4, R3 ADD R4 R4, R5 Independent of the L2 miss, executed out of program order, but cannot be retired. STOR mem[R2] R4 ADD R2 R2, 64 LOAD R3 mem[R2] Younger instructions cannot be executed because there is no space in the instruction window. The processor stalls until the L2 Miss is serviced. L2 cache misses are responsible for most full-window stalls. 7 Normalized Execution Time Impact of L2 Cache Misses 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 Non-stall (compute) time Full-window stall time L2 Misses 128-entry window 512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model 8 Normalized Execution Time Impact of L2 Cache Misses 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 Non-stall (compute) time Full-window stall time L2 Misses 128-entry window 2048-entry window 500-cycle DRAM latency, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model 9 The Problem Out-of-order execution requires large instruction windows to tolerate today’s main memory latencies. As main memory latency increases, instruction window size should also increase to fully tolerate the memory latency. Building a large instruction window is a challenging task if we would like to achieve Low power/energy consumption Short cycle time Low design and verification complexity 10 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Efficient Runahead Execution Address-Value Delta (AVD) Prediction Summary of Contributions Future Work 11 Overview of Runahead Execution [HPCA’03] A technique to obtain the memory-level parallelism benefits of a large instruction window (without having to build it!) When the oldest instruction is an L2 miss: Checkpoint architectural state and enter runahead mode In runahead mode: Instructions are speculatively pre-executed The purpose of pre-execution is to discover other L2 misses The processor does not stall due to L2 misses Runahead mode ends when the original L2 miss returns Checkpoint is restored and normal execution resumes 12 Perfect Caches: Load 1 Hit Compute Runahead Example Load 2 Hit Compute Small Window: Load 2 Miss Load 1 Miss Compute Stall Compute Miss 1 Stall Miss 2 Runahead: Load 1 Miss Compute Load 2 Miss Runahead Miss 1 Miss 2 Load 1 Hit Load 2 Hit Compute Saved Cycles Benefits of Runahead Execution Instead of stalling during an L2 cache miss: Pre-executed loads and stores independent of L2-miss instructions generate very accurate data prefetches: For both regular and irregular access patterns Instructions on the predicted program path are prefetched into the instruction/trace cache and L2. Hardware prefetcher and branch predictor tables are trained using future access information. 14 Runahead Execution Mechanism Entry into runahead mode Checkpoint architectural register state Instruction processing in runahead mode Exit from runahead mode Restore architectural register state from checkpoint 15 Instruction Processing in Runahead Mode Load 1 Miss Compute Runahead Miss 1 Runahead mode processing is the same as normal instruction processing, EXCEPT: It is purely speculative: Architectural (software-visible) register/memory state is NOT updated in runahead mode. L2-miss dependent instructions are identified and treated specially. They are quickly removed from the instruction window. Their results are not trusted. 16 L2-Miss Dependent Instructions Load 1 Miss Compute Runahead Miss 1 Two types of results produced: INV and VALID INV = Dependent on an L2 miss INV results are marked using INV bits in the register file and store buffer. INV values are not used for prefetching/branch resolution. 17 Removal of Instructions from Window Load 1 Miss Compute Runahead Miss 1 Oldest instruction is examined for pseudo-retirement Pseudo-retired instructions free their allocated resources. An INV instruction is removed from window immediately. A VALID instruction is removed when it completes execution. This allows the processing of later instructions. Pseudo-retired stores communicate their data to dependent loads. 18 Store/Load Handling in Runahead Mode Load 1 Miss Compute Runahead Miss 1 A pseudo-retired store writes its data and INV status to a dedicated memory, called runahead cache. Purpose: Data communication through memory in runahead mode. A dependent load reads its data from the runahead cache. Does not need to be always correct Size of runahead cache is very small. 19 Branch Handling in Runahead Mode Load 1 Miss Compute Runahead Miss 1 INV branches cannot be resolved. A mispredicted INV branch causes the processor to stay on the wrong program path until the end of runahead execution. VALID branches are resolved and initiate recovery if mispredicted. 20 Hardware Cost of Runahead Execution Checkpoint of the architectural register state Already exists in current processors INV bits per register and store buffer entry Runahead cache (512 bytes) <0.05% area overhead 21 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Efficient Runahead Execution Address-Value Delta (AVD) Prediction Summary of Contributions Future Work 22 Baseline Processor 3-wide fetch, 29-stage pipeline x86 processor 128-entry instruction window 512 KB, 8-way, 16-cycle unified L2 cache Approximately 500-cycle L2 miss latency Bandwidth, contention, conflicts modeled in detail Aggressive streaming data prefetcher (16 streams) Next-two-lines instruction prefetcher 23 Evaluated Benchmarks 147 Intel x86 benchmarks simulated for 30 million instructions Benchmark Suites SPEC CPU 95 (S95): Mostly scientific FP applications SPEC FP 2000 (FP00) SPEC INT 2000 (INT00) Internet (WEB): Spec Jbb, Webmark2001 Multimedia (MM): MPEG, speech recognition, games Productivity (PROD): Powerpoint, Excel, Photoshop Server (SERV): Transaction processing, E-commerce Workstation (WS): Engineering/CAD applications 24 Performance of Runahead Execution 1.3 No prefetcher, no runahead Only prefetcher (baseline) Only runahead Prefetcher + runahead 12% 1.2 Micro-operations Per Cycle 1.1 1.0 22% 0.9 12% 15% 0.8 35% 0.7 22% 0.6 16% 52% 0.5 0.4 13% 0.3 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 25 Runahead Execution vs. Large Windows 1.5 128-entry window (baseline) 128-entry window with Runahead 256-entry window 384-entry window 512-entry window 1.4 1.3 Micro-operations Per Cycle 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 26 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Efficient Runahead Execution Address-Value Delta (AVD) Prediction Summary of Contributions Future Work 27 Limitations of the Baseline Runahead Mechanism Energy Inefficiency Ineffectiveness for pointer-intensive applications A large number of instructions are speculatively executed Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06] Runahead cannot parallelize dependent L2 cache misses Address-Value Delta (AVD) Prediction [MICRO’05] Irresolvable branch mispredictions in runahead mode Cannot recover from a mispredicted L2-miss dependent branch Wrong Path Events [MICRO’04] 28 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Efficient Runahead Execution Address-Value Delta (AVD) Prediction Summary of Contributions Future Work 29 The Efficiency Problem [ISCA’05] A runahead processor pre-executes some instructions speculatively Each pre-executed instruction consumes energy Runahead execution significantly increases the number of executed instructions, sometimes without providing performance improvement 30 AVG wupwise swim sixtrack mgrid mesa lucas galgel fma3d facerec equake 110% art apsi applu ammp vpr vortex twolf perlbmk parser mcf gzip gcc 90% gap 100% eon crafty bzip2 The Efficiency Problem 235% % Increase in IPC % Increase in Executed Instructions 80% 70% 60% 50% 40% 30% 20% 22% 27% 10% 0% 31 Efficiency of Runahead Execution Efficiency = % Increase in IPC % Increase in Executed Instructions Goals: Reduce the number of executed instructions without reducing the IPC improvement Increase the IPC improvement without increasing the number of executed instructions 32 Causes of Inefficiency Short runahead periods Overlapping runahead periods Useless runahead periods 33 Short Runahead Periods Processor can initiate runahead mode due to an already in-flight L2 miss generated by the prefetcher, wrong-path, or a previous runahead period Load 1 Miss Load 2 Miss Load 1 Hit Load 2 Miss Compute Runahead Miss 1 Miss 2 Short periods are less likely to generate useful L2 misses have high overhead due to the flush penalty at runahead exit 34 Eliminating Short Runahead Periods Mechanism to eliminate short periods: Record the number of cycles C an L2-miss has been in flight If C is greater than a threshold T for an L2 miss, disable entry into runahead mode due to that miss T can be determined statically (at design time) or dynamically T=400 for a minimum main memory latency of 500 cycles works well 35 Overlapping Runahead Periods Two runahead periods that execute the same instructions Load 1 Miss Load 2 INV Compute Load 1 Hit Load 2 Miss Miss 1 OVERLAP OVERLAP Runahead Miss 2 Second period is inefficient 36 Overlapping Runahead Periods (cont.) Overlapping periods are not necessarily useless The availability of a new data value can result in the generation of useful L2 misses But, this does not happen often enough Mechanism to eliminate overlapping periods: Keep track of the number of pseudo-retired instructions R during a runahead period Keep track of the number of fetched instructions N since the exit from last runahead period If N < R, do not enter runahead mode 37 Useless Runahead Periods Periods that do not result in prefetches for normal mode Load 1 Miss Compute Load 1 Hit Runahead Miss 1 They exist due to the lack of memory-level parallelism Mechanism to eliminate useless periods: Predict if a period will generate useful L2 misses Estimate a period to be useful if it generated an L2 miss that cannot be captured by the instruction window Useless period predictors are trained based on this estimation 38 Predicting Useless Runahead Periods Prediction based on the past usefulness of runahead periods caused by the same static load instruction Prediction based on too many INV loads If the fraction of INV loads in a runahead period is greater than T, exit runahead mode Sampling (phase) based prediction A 2-bit state machine records the past usefulness of a load If last N runahead periods generated fewer than T L2 misses, do not enter runahead for the next M runahead opportunities Compile-time profile-based prediction If runahead modes caused by a load were not useful in the profiling run, mark it as non-runahead load 39 Performance Optimizations for Efficiency Both efficiency AND performance can be increased by increasing the usefulness of runahead periods Three major optimizations: Turning off the Floating Point Unit (FPU) in runahead mode Optimizing the update policy of the hardware prefetcher (HWP) in runahead mode FP instructions do not contribute to the generation of load addresses Improves the positive interaction between runahead and HWP Early wake-up of INV instructions Enables the faster removal of INV instructions 40 AVG wupwise swim sixtrack mgrid mesa lucas galgel fma3d facerec equake 110% art apsi applu ammp vpr vortex twolf perlbmk parser mcf gzip 90% gcc 100% gap eon crafty bzip2 Increase in Executed Instructions Overall Impact on Executed Instructions 235% baseline runahead all techniques 80% 70% 60% 50% 40% 30% 26.5% 20% 10% 6.2% 0% 41 AVG wupwise swim sixtrack mgrid mesa lucas galgel fma3d facerec equake 110% art apsi applu ammp vpr vortex twolf perlbmk parser mcf gzip 90% gcc 100% gap eon crafty bzip2 Increase in IPC Overall Impact on IPC 116% baseline runahead all techniques 80% 70% 60% 50% 40% 30% 20% 22.6% 22.1% 10% 0% 42 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Efficient Runahead Execution Address-Value Delta (AVD) Prediction Summary of Contributions Future Work 43 The Problem: Dependent Cache Misses Runahead: Load 2 is dependent on Load 1 Cannot Compute Its Address! Load 1 Miss Load 2 INV Compute Miss 1 Runahead Miss 2 Runahead execution cannot parallelize dependent misses Load 1 Hit Load 2 Miss wasted opportunity to improve performance wasted energy (useless pre-execution) Runahead performance would improve by 25% if this limitation were ideally overcome 44 The Goal of AVD Prediction Enable the parallelization of dependent L2 cache misses in runahead mode with a low-cost mechanism How: Predict the values of L2-miss address (pointer) loads Address load: loads an address into its destination register, which is later used to calculate the address of another load as opposed to data load 45 Parallelizing Dependent Cache Misses Cannot Compute Its Address! Load 1 Miss Load 2 INV Load 1 Hit Load 2 Miss Runahead Compute Miss 1 Miss 2 Value Predicted Can Compute Its Address Load 1 Miss Load 2 Miss Compute Runahead Load 1 Hit Load 2 Hit Saved Speculative Instructions Saved Cycles Miss 1 Miss 2 46 AVD Prediction [MICRO’05] Address-value delta (AVD) of a load instruction defined as: AVD = Effective Address of Load – Data Value of Load For some address loads, AVD is stable An AVD predictor keeps track of the AVDs of address loads When a load is an L2 miss in runahead mode, AVD predictor is consulted If the predictor returns a stable (confident) AVD for that load, the value of the load is predicted Predicted Value = Effective Address – Predicted AVD 47 Why Do Stable AVDs Occur? Regularity in the way data structures are allocated in memory AND traversed Two types of loads can have stable AVDs Traversal address loads Produce addresses consumed by address loads Leaf address loads Produce addresses consumed by data loads 48 Traversal Address Loads Regularly-allocated linked list: A A+k A traversal address load loads the pointer to next node: node = nodenext AVD = Effective Addr – Data Value Effective Addr Data Value AVD A+2k ... A+3k A A+k -k A+k A+2k -k A+2k A+3k -k Striding Stable AVD data value 49 Leaf Address Loads Sorted dictionary in parser: Nodes point to strings (words) String and node allocated consecutively m = check_match(ptr_str, input); // … A C+k C E+k D E // ... ptr_str = nodestring; B D+k A leaf address load loads the pointer to the string of each node: lookup (node, input) { A+k B+k Dictionary looked up for an input word. F+k F node } string AVD = Effective Addr – Data Value G+k Effective Addr Data Value AVD G A+k A k C+k C k F+k F k No stride! Stable AVD 50 runahead 1.0 0.9 14.3% 15.5% 0.8 0.7 0.6 0.5 0.4 Execution Time 0.3 Executed Instructions 0.2 0.1 AV G r vp f tw ol er rs pa m cf i no ro vo ts p dd ea tre et er pe rim m st al he so rt th 0.0 bi Normalized Execution Time and Executed Instructions Performance of AVD Prediction 51 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Efficient Runahead Execution Address-Value Delta (AVD) Prediction Summary of Contributions Future Work 52 Summary of Contributions Runahead execution provides the latency tolerance benefit of a large instruction window by parallelizing independent cache misses Efficient runahead execution techniques improve the energyefficiency of base runahead execution With very modest increase in hardware cost and complexity 128-entry window + Runahead ~ 384-entry window Only 6% extra instructions executed for 22% performance benefit Address-Value Delta (AVD) prediction enables the parallelization of dependent cache misses By exploiting regular memory allocation patterns A 16-entry (102-byte) AVD predictor improves the performance of runahead execution by 14% on pointer-intensive applications 53 Talk Outline Motivation: The Memory Latency Problem Runahead Execution Evaluation Limitations of the Baseline Runahead Mechanism Efficient Runahead Execution Address-Value Delta (AVD) Prediction Summary of Contributions Future Work 54 Future Work in Runahead Execution Compilation/programming techniques for runahead processors Keeping runahead execution on the correct program path Parallelizing dependent cache misses in linked data structure traversals Runahead co-processors/accelerators Evaluation of runahead execution on multithreaded and multiprocessor systems 55 Research Summary Runahead execution Original runahead proposal [HPCA’03, IEEE Micro Top Picks’03] Efficient runahead execution [ISCA’05, IEEE Micro Top Picks’06] AVD prediction [MICRO’05] Result reuse in runahead execution [Comp. Arch. Letters’05] High-performance memory system designs Pollution-aware caching [IJPP’05] Parallelism-aware caching [ISCA’06] Performance analysis of speculative memory references [IEEE Trans. on Computers’05] Latency/bandwidth tradeoffs in memory controllers [Patent’04] Branch instruction handling techniques through compiler-microarchitecture cooperation Wish branches [MICRO’05 & IEEE Micro Top Picks’06] Wrong path events [MICRO’04] Compiler-assisted dynamic predication [in progress] Efficient compile-time profiling techniques for detecting input-dependent program behavior 2D profiling [CGO’06] Fault tolerant microarchitecture design Microarchitecture-based introspection [DSN’05] 56 Thank you. Backup Slides Thesis Statement Efficient runahead execution is a cost- and complexity-effective microarchitectural technique that can tolerate long main memory latencies without requiring unreasonably large, slow, complex, and power-hungry hardware structures significant increases in processor complexity and power consumption. 59 Normalized Execution Time Impact of L2 Cache Misses 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 Non-stall (compute) time Full-window stall time L2 Misses 128-entry window 2048-entry window infinite window 500-cycle DRAM latency, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model 60 Entry into Runahead Mode Load 1 Miss Compute Miss 1 When an L2-miss load instruction is the oldest in the instruction window: Processor checkpoints architectural register state. Processor records the address of the L2-miss load. L2-miss load marks its destination register as INV (invalid) and it is removed from the instruction window. 61 Exit from Runahead Mode Load 1 Miss Compute Runahead Load 1 Re-fetched and Re-executed Compute Miss 1 When the runahead-causing L2 miss is serviced: All instructions in the machine are flushed. INV bits are reset. Runahead cache is flushed. Processor restores the architectural state as it was before the runahead-causing instruction was fetched. Architecturally, NOTHING happened. But, hopefully useful prefetch requests were generated (caches warmed up). Mode is switched to normal mode Instructions executed in runahead mode are re-executed in normal mode. 62 When to Enter Runahead Mode Why not at the time an L2 miss happens? Not guaranteed to be a valid correct-path instruction until it becomes the oldest. Limited potential (An L2-miss inst. becomes oldest instruction 10 cycles later on average) Need to checkpoint state at oldest instruction (Throw away all instructions older than the L2 miss?) Why not when the window becomes full? Delays the removal of instructions from window, which can result in slow progress in runahead mode. No significant gain (The window becomes full 98% of the time after we see an L2 miss) Why not on L1 cache misses? >50% of L1 cache misses hit in the L2 cache Many short runahead periods 63 When to Exit Runahead Mode Why not exit early to fill the pipeline and the window? How do we determine “how early”? Memory does not have fixed latency This reduces the progress made in runahead mode Exiting early using oracle information has lower performance than exiting when miss returns Why not exit late to make further progress in runahead? Not necessarily beneficial to stay in runahead longer On average, this policy hurts performance But, more intelligent runahead period extension schemes improve performance 64 Modifications to Pipeline INV Uop Queue Frontend RAT FP Sched Int Queue INT Sched Renamer Mem Queue Instruction Decoder MEM Sched FP Regfile FP Units INV Trace Cache Fetch Unit FP Queue INT Regfile CHECKPOINTED STATE Reorder Buffer INT Units ADDR GEN Units L1 Data Cache Selection Logic < 0.05% area overhead Prefetcher Store Buffer INV L2 Access Queue Backend RAT RUNAHEAD CACHE From memory L2 Cache Front Side Bus Access Queue To memory 65 Effect of a Better Front-end 1.4 1.3 12% 15% 20% 12% 13% 13% 15% 21% 29% 1.2 22% 26% 34% Micro-operations Per Cycle 1.1 1.0 Base Runahead Base w/ perf TC Runahead w/ perf TC Base w/ perf TC/BP Runahead w/ perf TC/BP 16% 27% 37% 22% 27% 31% 0.9 52% 75% 89% 0.8 35% 35% 36% 0.7 0.6 0.5 0.4 13% 13% 13% 0.3 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 66 Why is Runahead Better with a Better Front-end? A better front-end provides more correct-path instructions (hence, more and more accurate L2 misses) in runahead periods Average number of instructions during runahead: 711 before mispredicted INV branch: 431 with perfect TC/BP this average increases to 909 Average number of L2 misses during runahead: 2.6 before mispredicted INV branch: 2.38 with perfect TC/BP this average increases to 3.18 If all INV branches were resolved correctly during runahead performance gain would be 25% instead of 22% 67 Importance of Store-Load Communication 1.3 1.2 Baseline 34% Runahead, no communication through memory Micro-operations Per Cycle 1.1 Runahead with runahead cache 1.0 34% 0.9 39% 33% 0.8 90% 0.7 50% 0.6 24% 45% 0.5 0.4 0.3 57% 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 68 In-order vs. Out-of-order 1.3 1.2 in-order baseline in-order + runahead out-of-order baseline out-of-order + runahead 15% 10% 1.1 Micro-operations Per Cycle 1.0 14% 12% 0.9 20% 22% 0.8 17% 13% 0.7 39% 20% 73% 23% 0.6 0.5 28% 15% 50% 47% SERV WS 0.4 73% 16% 0.3 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD AVG 69 Sensitivity to L2 Cache Size 1.4 1.3 Micro-operations Per Cycle 1.2 No RA - 0.5 MB RA - 0.5 MB No RA - 1 MB RA - 1 MB No RA - 4 MB RA - 4 MB 6% 7% 10% 1.1 13% 20% 1 11% 0.9 12% 0.8 19% 22% 8% 10% 12% 16% 13% 0.7 23% 0.6 17% 8% 23% 30% 13% 15% 0.5 32% 40% 48% 20% 0.4 14% 16% 0.3 0.2 0.1 0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 70 Instruction vs. Data Prefetching Benefits 1.3 1.2 No RA 83% RA, no instruction prefetching benefit 1.1 RA, all benefits Micro-operations Per Cycle 1.0 0.9 91% 86% 0.8 74% 0.7 86% 97% 0.6 55% 0.5 96% 0.4 0.3 94% 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 71 Why Does Runahead Work? ~70% of instructions are VALID in runahead mode These values show periodic behavior Runahead prefetching reduces L1, L2, TC misses during normal mode Data Miss Reduction Instruction Miss Reduction 18% decrease in normal mode L1 misses (base: 13.7/1K uops) 33% of normal mode L2 data misses are fully or partially covered by runahead prefetching (base L2 data miss rate: 4.3/1K uops) 15% of normal mode L2 data misses are fully covered (these misses are never seen in normal mode) 3% decrease in normal mode TC misses 14% decrease in normal mode L2 fetch misses (some of these are only partially covered by runahead requests) Overall Increase in Data Misses L2 misses are increased by 5% (due to contention and useless prefetches) 72 Correlation Between L2 Miss Reduction and Speedup 160% 150% 140% 130% 120% % IPC increase 110% 100% 90% 80% 70% 60% R2 = 0.3893 50% 40% 30% 20% 10% 0% -20% -10% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% % normal-mode L2 misses covered (full or partial) by runahead prefetches 73 Runahead on a More Aggressive Processor 1.8 1.7 4MB L2, 256 window baseline 4% 4MB L2, 256 window runahead 1.6 Micro-operations Per Cycle 1.5 1.4 4% 1.3 1% 7% 4% 1.2 7% 1.1 1.0 2% 0.9 27% 0.8 0.7 10% 0.6 0.5 0.4 0.3 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 74 Runahead on Future Model 75 Future Model with Perfect Front-end 76 Baseline Alpha Processor Execution-driven Alpha simulator 8-wide superscalar processor 128-entry instruction window, 20-stage pipeline 64 KB, 4-way, 2-cycle L1 data and instruction caches 1 MB, 32-way, 10-cycle unified L2 cache 500-cycle minimum main memory latency Aggressive stream-based prefetcher 32 DRAM banks, 32-byte wide processor-memory bus (4:1 frequency ratio), 128 outstanding misses Detailed memory model 77 3.00 3.00 2.75 2.75 2.50 2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 Baseline 0.25 Runahead 0.00 Instructions Per Cycle Performance Instructions Per Cycle Performance Runahead vs. Large Windows (Alpha) 2.50 2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 Baseline 0.25 Runahead 0.00 64 128 256 384 512 1024 2048 4096 Instruction Window Size (mem latency = 500 cycles) 8192 64 128 256 384 512 1024 2048 4096 Instruction Window Size (mem latency = 1000 cycles) 78 8192 In-order vs. Out-of-order Execution (Alpha) 3.00 Instructions Per Cycle Performance 2.75 OOO+RA OOO IO+RA IO 2.50 2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 100 300 500 700 900 1100 1300 1500 1700 1900 Memory Latency (in cycles) 79 Comparison to 1024-entry Window 80 Runahead vs. HWP (Alpha) 81 Effect of Memory Latency (Alpha) 82 1K & 2K Memory Latency (Alpha) 83 Efficient Runahead Methods for Efficient Runahead Execution Eliminating inefficient runahead periods Increasing the usefulness of runahead periods Reuse of runahead results Value prediction of L2-miss load instructions Optimizing the exit policy from runahead mode 85 Impact on Efficiency baseline runahead 35% short overlapping useless 30% Increase Over Baseline OOO short+overlapping+useless 25% 26.5%26.5%26.5%26.5% 22.6% 20.1% 20% 15% 15.3% 14.9% 11.8% 10% 6.7% 5% 0% Executed Instructions IPC 86 Extra Instructions with Efficient Runahead 50% 45% Increase in Executed Instructions 40% baseline runahead all techniques 35% 30% 25% 20% 15% 10% 5% 0% 100 300 500 700 900 Memory Latency 87 Performance Increase with Efficient Runahead 50% Increase in IPC 45% 40% baseline runahead 35% all techniques 30% 25% 20% 15% 10% 5% 0% 100 300 500 700 900 Memory Latency 88 Cache Sizes (Executed Instructions) 45% Increase in Executed Instructions 40% 35% baseline runahead all techniques 30% 25% 20% 15% 10% 5% 0% 512 KB 1 MB 2 MB 4 MB 89 Cache Sizes (IPC Delta) 45% 40% 35% baseline runahead all techniques Increase in IPC 30% 25% 20% 15% 10% 5% 0% 512 KB 1 MB 2 MB 4 MB 90 Turning Off the FPU in Runahead Mode FP instructions do not contribute to the generation of load addresses FP instructions can be dropped after decode Spares processor resources for more useful instructions Increases performance by enabling faster progress Enables dynamic/static energy savings Results in an unresolvable branch misprediction if a mispredicted branch depends on an FP operation (rare) Overall – increases IPC and reduces executed instructions 91 HWP Update Policy in Runahead Mode Aggressive hardware prefetching in runahead mode may hurt performance, if the prefetcher accuracy is low Runahead requests more accurate than prefetcher requests Three policies: Do not update the prefetcher state Update the prefetcher state just like in normal mode Only train existing streams, but do not create new streams Runahead mode improves the timeliness of the prefetcher in many benchmarks Only training the existing streams is the best policy 92 Early INV Wake-up Keep track of INV status of an instruction in the scheduler. Scheduler wakes up the instruction if any source is INV. + Enables faster progress during runahead mode by removing the useless INV instructions faster. - Increases the number of executed instructions. - Increases the complexity of the scheduling logic. Not worth implementing due to small IPC gain 93 Short Runahead Periods 94 RCST Counter 95 Sampling for Useless Periods 96 Efficiency Techniques Extra Instructions 97 Efficiency Techniques IPC Increase 98 Effect of Memory Latency on Efficient Runahead 99 Usefulness of Runahead Periods 100 L2 Misses Per Useful Runahead Periods 101 Other Considerations for Efficient Runahead Performance Potential of Result Reuse Ideal reuse study To determine the upper bound on the performance gain possible by reusing results of runahead instructions Valid pseudo-retired runahead instructions magically update architectural state during normal mode They do not consume any resources (ROB or buffer entries) Only invalid pseudo-retired runahead instructions are re-executed They are fed into the renamer (fetch/decode pipeline is skipped) 103 Ideal Reuse of All Valid Runahead Results 1.4 1.3 Baseline Runahead Runahead + ideal reuse 3% 1.2 Micro-operations Per Cycle 1.1 1.0 5% 0.9 4% 6% 0.8 0.7 5% 2% 0.6 8% 5% 0.5 0.4 2% 0.3 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 104 Alpha Reuse IPC’s 105 Number of Reused Instructions 106 Why Does Reuse Not Work? 107 IPC Increase with Reuse 108 Extra Instructions with Reuse 109 Runahead Period Statistics 110 Mem Latency and BP Accuracy in Reuse 111 Runahead + VP: Extra Instructions 112 Runahead + VP: IPC Increase 113 Late Exit from Runahead (Extra Inst.) 114 Late Exit from Runahead (IPC Increase) 115 AVD Prediction and Optimizations Identifying Address Loads in Hardware Observation: If the AVD is too large, the value being loaded is likely NOT an address Only predict loads that have satisfied: -MaxAVD < AVD < +MaxAVD This identification mechanism eliminates almost all data loads from consideration Enables the AVD predictor to be small 117 An Implementable AVD Predictor Set-associative prediction table Prediction table entry consists of Tag (PC of the load) Last AVD seen for the load Confidence counter for the recorded AVD Updated when an address load is retired in normal mode Accessed when a load misses in L2 cache in runahead mode Recovery-free: No need to recover the state of the processor or the predictor on misprediction Runahead mode is purely speculative 118 AVD Update Logic Effective Address Data Value computed AVD = Effective Addr - Data Value >= -MaxAVD? <= MaxAVD? Confidence Update/Reset Logic Tag Conf AVD valid AVD? PC of Retired Load 119 AVD Prediction Logic Predicted? (not INV?) Predicted Value = Effective Addr - AVD ==? Tag Program Counter of L2-miss Load Conf AVD Effective Address of L2-miss Load 120 Properties of Traversal-based AVDs Stable AVDs can be captured with a stride value predictor Stable AVDs disappear with the re-organization of the data structure (e.g., sorting) A A+k A+2k A+3k A+3k A+k Sorting A A+2k Distance between nodes NOT constant! Stability of AVDs is dependent on the behavior of the memory allocator Allocation of contiguous, fixed-size chunks is useful 121 Properties of Leaf-based AVDs Stable AVDs cannot be captured with a stride value predictor Stable AVDs do not disappear with the re-organization of the data structure (e.g., sorting) A+k C+k A B+k B C Sorting C+k C Distance between node and string still constant! A+k B+k A B Stability of AVDs is dependent on the behavior of the memory allocator 122 AVD Prediction vs. Stride Value Prediction Performance: Both can capture traversal address loads with stable AVDs Stride VP cannot capture leaf address loads with stable AVDs e.g., health, mst, parser AVD predictor cannot capture data loads with striding data values e.g., treeadd Predicting these can be useful for the correct resolution of mispredicted L2-miss dependent branches, e.g., parser Complexity: AVD predictor requires much fewer entries (only address loads) AVD prediction logic is simpler (no stride maintenance) 123 AVD vs. Stride VP Performance 1.00 AVD 0.98 stride 2.5% 0.96 Normalized Execution Time hybrid 4.5% 0.94 0.92 0.90 0.88 12.1% 0.86 13.4% 12.6% 0.84 16% 0.82 0.80 entries 1616entries 4096 entries 4096 entries 124 AVD vs. Stride VP Performance 1.00 AVD Normalized Execution Time (excluding health) 0.98 stride 2.7% 0.96 0.94 hybrid 5.1% 5.5% 4.7% 6.5% 0.92 8.6% 0.90 0.88 0.86 0.84 0.82 0.80 entries 1616entries 4096 entries 4096 entries 125 AVD vs. Stream Prefetching Performance 1.00 AVD prediction Normalized Execution Time 0.95 stream prefetching AVD + stream 0.90 12.1% 12.1% 0.85 13.4% 16.5% 0.80 20.1% 22.5% 0.75 0.70 prefetch distance 32 prefetch distance 8 126 AVD vs. Stream Pref. (L2 bandwidth) 1.40 Normalized Number of L2 Accesses 1.35 35.3% AVD prediction stream prefetching AVD + stream 32.8% 1.30 1.25 24.5% 26% 1.20 1.15 1.10 1.05 5.1% 5.1% 1.00 prefetch distance 32 prefetch distance 8 127 AVD vs. Stream Pref. (Mem bandwidth) Normalized Number of Memory Accesses 1.25 AVD prediction stream prefetching AVD + stream 1.20 19.5% 16.4% 1.15 14.9% 12.1% 1.10 1.05 3.2% 3.2% 1.00 prefetch distance 32 prefetch distance 8 128 Source Code Optimization for AVD Prediction struct Dict_node { char *string; Dict_node *left, *right; // ... } string allocated FIRST struct Dict_node { char *string; Dict_node *left, *right; // ... } char *get_a_word(...) { // read a word from file s = (char *) xalloc(strlen(word) + 1); strcpy(s, word); return s; } char *get_a_word(..., Dict_node **dn) { // read a word from file *dn = (Dict_node *) xalloc(sizeof(Dict_node)); s = (char *) xalloc(strlen(word) + 1); strcpy(s, word); return s; } string allocated NEXT Dict_node *read_word_file(...) { Dict_node *read_word_file(...) { // ... char *s; Dict_node *dn; while ((s = get_a_word(...)) != NULL) { dn = (Dict_node *) xalloc(sizeof(Dict_node)); dn->string = s; // ... } return dn; Dict_node allocated NEXT } (a) Base source code that allocates the nodes of the dictionary (binary tree) and the strings Dict_node allocated FIRST // ... char *s; Dict_node *dn; while ((s = get_a_word(..., &dn)) != NULL) { dn->string = s; // ... } return dn; } (b) Modified source code (modified lines are in bold) 129 Effect of Code Optimization (parser) 1.02 1.00 Normalized Execution Time 0.98 0.96 0.94 0.92 6.4% 0.90 10.5% 0.88 0.86 base binary - runahead base binary - 16-entry AVD 0.84 0.82 modified binary - runahead modified binary - 16-entry AVD 0.80 parser 130 Accuracy/Coverage w/ Code Optimization 100 90 base_binary 80 modified_binary Percentage (%) 70 60 50 40 30 20 10 0 Coverage Accuracy 131 AVD and Efficiency Techniques 1.5 1.4 1.3 1.1 1 0.9 0.8 0.7 no runahead 0.6 runahead 0.5 AVD (16-entry) 0.4 efficiency 0.3 AVD + efficiency 0.2 0.1 AV G r vp f tw ol er pa rs m cf on oi vo r p ts d ea d tre er pe rim et m st th he al so rt 0 bi Normalized Execution Time 1.2 132 AVD and Efficiency Techniques 0.9 0.8 0.7 0.6 0.5 0.4 no runahead 0.3 runahead AVD (16-entry) 0.2 efficiency AVD + efficiency 0.1 AV G vp r f ol tw pa rs er m cf no i vo ro ts p d ea d tre er pe rim et m st he al th so rt 0 bi Normalized Number of Executed Instructions 1 133 Effect of AVD on Runahead Periods 134 AVD Example from treeadd 135 AVD Example from parser 136 AVD Example from health 137 Motivation for NULL-value Optimization 138 Effect of NULL-value Optimization 139 Related Work Methodology Future Research Directions