Runahead Execution An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu Yale N. Patt Jared Stark Chris Wilkerson 2 Outline • • • • • Motivation Overview Mechanism Experimental Evaluation Conclusions 3 Motivation • Out-of-order processors require very large instruction windows to tolerate today’s main memory latencies. – Even in the presence of caches and prefetchers • As main memory latency (in terms of processor cycles) increases, instruction window size should also increase to fully tolerate the memory latency. • Building a very large instruction window is not an easy task. 4 Small Windows: Full-window Stalls • Instructions are retired in-order from the instruction window to support precise exceptions. • When a very long-latency instruction is not complete, it blocks retirement and incoming instructions fill the instruction window if the window is not large enough. • Processor cannot place new instructions into the window if the window is already full. This is called a full-window stall. • L2 misses are responsible for most full-window stalls. 5 Impact of Full-window Stalls % cycles with full window stalls 80% 70% 60% 512KB L2, 128-entry window IPC: 0.77 perfect L2, 128-entry window 512KB L2, 2048-entry window 50% 40% IPC: 1.69 IPC: 1.15 30% 20% 10% 0% Machine Model (L2 size, instruction window size) 6 Overview of Runahead Execution • During a significant percentage of full-window stall cycles, no work is performed in the processor. • Runahead execution unblocks the full window stall caused by a long-latency L2-miss instruction. • Enter runahead mode when the oldest instruction is an L2-miss load and remove that load from the processor. • While in runahead mode, keep processing instructions without updating architectural state and without blocking the instruction window due to L2 misses. • When the original load miss returns back, resume normal-mode execution starting with the runahead-causing load. 7 Benefits of Runahead Execution • Loads and stores independent of L2-miss instructions generate useful prefetch requests: – From main memory to L2 – From L2 to L1 • Instructions on the predicted program path are prefetched into the trace cache and L2. • Hardware prefetcher tables are trained using future memory access information. The prefetcher also runs ahead along with the processor. 8 Mechanism 9 Entry into Runahead Mode When an L2-miss load instruction reaches the head of the instruction window: • Processor checkpoints architectural register state, branch history register, return address stack. • Processor records the address of the L2-miss load. • Processor enters runahead mode. • L2-miss load marks its destination register as invalid and is removed from the instruction window. 10 Processing in Runahead Mode • Two types of results are produced: INV (invalid), VALID • First INV result is produced by the L2-miss load that caused entry into runahead mode. • An instruction produces an INV result – If it sources an INV result – If it misses in the L2 cache (A prefetch request is generated) • INV results are marked using INV bits in the register file, store buffer, and runahead cache. – INV bits prevent introduction of bogus data into the pipeline. – Bogus values are not used for prefetching/branch resolution. 11 Pseudo-retirement in Runahead Mode • An instruction is examined for pseudo-retirement when it reaches the head of the instruction window. • An INV instruction is removed from window immediately. • A VALID instruction is removed when it completes execution and updates only the microarchitectural state. • Pseudo-retired instructions free their allocated resources. • Pseudo-retired runahead stores communicate their data and INV status to dependent runahead loads. 12 Runahead Cache • An auxiliary structure that holds the data values and INV bits for memory locations modified by pseudo-retired runahead stores. • Its purpose is memory communication during runahead mode. • Runahead loads access store buffer, runahead cache, and L1 data cache in parallel. • Size of runahead cache is very small (512 bytes). 13 Runahead Branches • Runahead branches use the same predictor as normal branches. • VALID branches are resolved and trigger recovery if mispredicted. • INV branches cannot be resolved. 14 Exit from Runahead Mode When the data for the instruction that caused entry into runahead returns from main memory: • All instructions in the machine are flushed. • INV bits are reset. Runahead cache is flushed. • Processor restores the state as it was before the runaheadinducing instruction was fetched. • Processor starts fetch beginning with the runahead-inducing L2-miss instruction. 15 Experimental Evaluation 16 Baseline Processor • • • • • • 3-wide fetch, 29-stage pipeline 128-entry instruction window 32 KB, 8-way, 3-cycle L1 data cache, write-back 512 KB, 8-way, 16-cycle L2 unified cache, write-back Approximately 500-cycle penalty for L2 misses Streaming prefetcher (16 streams) 17 Benchmarks • Selected traces out of a pool of 280 traces • Evaluated performance on those that gain at least 10% IPC improvement with perfect L2 cache • 147 traces simulated for 30 million x86 instructions • Trace Suites – – – – – – – – SPEC CPU95 (S95): 10 benchmarks, mostly FP SPEC FP2k (FP00): 11 benchmarks SPECint2k (INT00): 6 benchmarks Internet (WEB): 18 benchmarks: SpecJbb, Webmark2001 Multimedia (MM): 9 benchmarks: mpeg, speech rec., quake Productivity (PROD): 17 benchmarks: Sysmark2k, winstone Server (SERV): 2 benchmarks: tpcc, timesten Workstation (WS): 7 benchmarks: CAD, nastran, verilog 18 Performance of Runahead Execution 1.3 No prefetcher, no runahead Prefetcher, no runahead Runahead, no prefetcher Runahead and prefetcher 12% 1.2 1.1 Instructions Per Cycle 1.0 22% 0.9 12% 15% 0.8 35% 0.7 22% 0.6 16% 52% 0.5 0.4 13% 0.3 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 19 Effect of Frontend on Runahead • Average number of instructions during runahead: 711 – Before mispredicted INV branch: 431 • Average number of L2 misses during runahead: 2.6 – Before mispredicted INV branch: 2.38 • Runahead becomes more effective with a better frontend: – Real trace cache: 22% IPC improvement – Perfect trace cache: 27% IPC improvement – Perfect branch predictor and trace cache: 31% IPC improvement 20 Runahead vs. Large Windows 1.5 1.4 128-entry window with Runahead 256-entry window 384-entry window 512-entry window -6% 1.3 1.2 Instructions Per Cycle 1.1 4% -1% 1.0 3% 0.9 0% 0.8 3% 0.7 0.6 2% 12% 0.5 0.4 0.3 6% 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 21 Importance of Store-Load Data Communication 1.3 1.2 Baseline Runahead, no runahead cache 8% 1.1 Runahead with runahead cache Instructions Per Cycle 1.0 13% 0.9 7% 9% 0.8 3% 0.7 10% 0.6 12% 23% 0.5 0.4 0.3 5% 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 22 Conclusions • Runahead execution results in 22% IPC increase over the baseline processor with a 128-entry window and a streaming prefetcher. • This is within 1% of the IPC of a 384-entry window machine. • Runahead and the streaming prefetcher interact positively. • Store-load data communication through memory in runahead mode is vital for high performance. 23 Backup Slides Added Bits & Structures 24 25 Runahead-Prefetcher Interaction 1.4 1.3 1.2 Instructions Per Cycle 1.1 No prefetcher, no runahead Prefetcher, no runahead Runahead, no prefetcher Runahead and prefetcher 24% 36% 3% 1.0 82% 0.9 48% 55% 0.8 0.7 0.6 0.5 17% 0.4 0.3 53% 0.2 0.1 0.0 tpcc verilog nastran gcc vortex mgrid apsi ammp 26 Effect of a Better Frontend 1.4 1.3 20% Baseline w/ perf TC 13% Runahead w/ perf TC 13% Baseline w/ perf TC and BP 29% 1.2 34% 15% Runahead w/ perf TC and BP 1.1 21% Instructions Per Cycle 1.0 26% 37% 31% 0.9 89% 27% 0.8 35% 0.7 27% 75% 36% 0.6 0.5 0.4 13% 8% 0.3 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG When do we see the L2 misses in Runahead? L2 Data Miss Distances 850 800 Cycles or instructions after the runahead-causing L2 miss 750 700 650 600 550 500 450 400 350 300 250 cycles 200 instrs 150 100 50 0 miss 1 miss 2 miss 3 miss 4 miss 5 miss 6 miss 7 miss 8 miss 9 miss 10 miss 11 miss 12 miss 13 miss 14 miss 15 miss 16 nth out-of-window miss 27 28 Distance of the first L2 miss in runahead mode Where does the first L2 miss occur? 45.00% 40.00% Percentage of first out-of-window L2 misses 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 128-256 256-384 384-512 512-640 640-768 768-896 896-1024 Distance in instructions from the runahead-causing L2 miss 1024-1536 1536+ 29 Instruction vs. Data Prefetching Benefit 1.3 1.2 Baseline 87% Runahead with no inst benefit 1.1 Runahead with all benefits Instructions Per Cycle 1.0 0.9 91% 0.8 87% 74% 97% 0.7 88% 0.6 0.5 55% 96% SERV WS 0.4 0.3 94% 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD AVG 30 Runahead with a Larger L2 Cache 1.4 1.3 1.2 6% 7% 12% 1.1 27% 13% Instructions Per Cycle 1 11% 0.9 12% 0.8 16% 15% 30% 35% 0.7 19% 22% 8% 10% 12% No Runahead - 0.5 MB L2 Runahead - 0.5 MB L2 No Runahead - 1 MB L2 Runahead - 1 MB L2 No Runahead - 4 MB L2 Runahead - 4 MB L2 17% 22% 8% 30% 13% 16% 0.6 32% 40% 52% 0.5 0.4 14% 13% 0.3 0.2 0.1 0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 31 Runahead on in-order? 1.3 in-order baseline in-order baseline with runahead out-of-order baseline out-of-order baseline with runahead 12% 1.2 1.1 Instructions Per Cycle 1.0 0.9 0.8 15% 35% 0.7 22% 14% 0.6 0.5 12% 22% 17% 16% 21% 15% 52% 28% 0.4 40% 55% 73% 0.3 13% 74% 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 32 Future Model Results Future Model with Better Frontend 33