Value Prediction: Are(n’t) We Done Yet? Mikko Lipasti University of Wisconsin-Madison Definition What is value prediction? Broadly, three salient attributes: 1. 2. 3. Generate a speculative value (predict) Consume speculative value (execute) Verify speculative value (compare/recover) This subsumes branch prediction Focus here on operand values Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 2 of 38 Some History “Classical” value prediction 1. 2. 3. 4. Independently invented by 4 groups in 1995-1996 AMD (Nexgen): L. Widigen and E. Sowadsky, patent filed March 1996, inv. March 1995 Technion: F. Gabbay and A. Mendelson, inv. sometime 1995, TR 11/96, US patent Sep 1997 CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. 1995, ASPLOS paper submitted March 1996 Wisconsin: Y. Sazeides, J. Smith, Summer 1996 Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 3 of 38 Why? Possible explanations: 1. 2. 3. Natural evolution from branch prediction Natural evolution from memoization Natural evolution from rampant speculation 4. Cache hit speculation Memory independence speculation Speculative address generation Improvements in tracing/simulation technology “There’s a lot of zeroes out there.” (C. Wilkerson) Values, not just instructions & addresses TRIP6000 [A. Martin-de-Nicolas, IBM] Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 4 of 38 Publications by Year Cumulative Publications 70 60 ISCA MICRO HPCA Others Total 50 40 30 20 10 0 1996 1998 2000 2002 2004 Excludes journals, workshops, compiler conferences Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 5 of 38 What Happened? Tremendous academic interest No industry uptake Dozens of research groups, papers, proposals No present or planned CPU with value prediction Why? Meager performance benefit (< 10%) Power consumption Dynamic power for extra activity Static power (area) for prediction tables Complexity and correctness Subtle memory ordering issues [MICRO ’01] Misprediction recovery [HPCA ’04] Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 6 of 38 Performance? Relationship between timely fetch and value prediction benefit [Gabbay, ISCA] Value prediction doesn’t help when the result can be computed before the consumer instruction is fetched High-bandwidth fetch helps Wide trace caches studied in late 1990s But, these have several negative attributes Recent designs focus on frequency, not ILP High-bandwidth fetch is a red herring More important to fetch the right instructions Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 7 of 38 Future Adoption? Classical value prediction will only make it in the context of a very different microarchitecture One that explicitly and aggressively exposes ILP Promising trends Deep pipelining craze appears to be over High frequency mania appears to be over Can’t manage the design complexity Can’t afford the power Architects are pursuing ILP once again Value prediction has another opportunity Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 8 of 38 What Value Prediction Begat Value prediction catalyzed a new focus on values in computation This had not been studied before A whole new realm of research: Value-Aware Microarchitecture Spans numerous subdisciplines Significant industrial impact already Also, developments in supporting technologies Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 9 of 38 Value-Aware Microarchitecture Value-Aware Microarchitecture Memory Hierarchy Memory Hierarchy File Compression [several] •Register •Register File Compression [several] •Cache Compression [Gupta, Alameldeen] •Memory Compression [e.g. IBM MXT] •Cache Compression [Gupta, Alameldeen] •Bandwidth compression •Address and data bus encoding •Memory Compression [e.g. IBM[Rudolph] MXT] •Initialization Traffic [Lewis] •Bandwidth compression Load/Store Processing Load/Store Processing •Address and data bus encoding [Rudolph] •Loadprediction value prediction [numerous] •Load value [numerous] •Initialization [Lewis] •Fast address Traffic calculation [Austin] •Fast address •Value-aware calculation alias prediction[Austin] [Onder] •Memory consistency [Cain] Execution Core •Value-aware alias prediction [Onder] •Value Prediction •Memory consistency [Cain] Execution Core Prediction •Operand •Value Significance •Operand Significance •Low [Canal] Power [Canal] •Low Power •Execution bandwidth [Loh] [Pentium 4,[Loh] Mestan] •Execution•Bit-slicing bandwidth •Instruction reuse [Sodani] Cache Coherence •Bit-slicing [Pentium 4,Speculation] Mestan] •Carry prediction [Circuit-level •Producer-side •Instruction reuse [Sodani] •Silent stores, temporally silent stores [Lepak] Cache Coherence •Carry prediction [Circuit-level Speculation] •Producer-sidelock elision [Wisc, UIUC] •Speculative •Silent stores, temporally silent stores [Lepak] •Speculative lock elision [Rajwar] •Consumer side •Consumer side •Loadprediction value prediction using stale lines [Lepak] •Load value using stale lines [Lepak] •“Coherence decoupling” [Burger, Sohi] •“Coherence decoupling” [ASPLOS 04] Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 10 of 38 Supporting Technologies Value prediction presented some unique challenges: Relatively low correct prediction rate (initially 40-50%) Nontrivial misprediction rate with avoidable misprediction cost These drove study of: Confidence prediction/estimation Selective recovery [Sazeides Ph.D., Kim HPCA ‘04] First microarchitectural application of confidence estimation, though not widely credited or cited as such Since studied for numerous applications, e.g. gating control speculation Numerous challenges in extending recovery to entire window Both have proved to be fruitful research areas Also stimulated development of software technology: Value profiling Value-based compiler optimizations Run-time code specialization Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 11 of 38 Outline Some History Industry Trends Value-Aware Microarchitecture Case study: Memory Consistency [Trey Cain, ISCA 2004] Conventional load queue microarchitecture Value-based memory ordering Replay-reduction heuristics Performance evaluation Conclusions Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 12 of 38 Value-based Memory Consistency High ILP => Large instruction windows Larger physical register file Larger scheduler Larger load/store queues Result in increased access latency Value-based Replay If load queue scalability a problem…who needs one! Instead, re-execute load instructions a 2nd time in program order Filter replays: heuristics reduce extra cache bandwidth to 3.5% on average Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 13 of 38 Enforcing RAW dependences Program order (Exe order) 1. (1) store A 2. (3) store ? 3. (2) load A Load queue contains load addresses Memory independence speculation Check correctness at store retirement Hoist load above unknown store assuming it is to a different address One search per store address calculation If address matches, the load is squashed Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 14 of 38 Enforcing memory consistency Processor p1 Processor p2 1. (3) load A raw 2. (1) load A war 1. (2) store A Two approaches Snooping: Search per incoming invalidate Insulated: Search per load address calculation Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 15 of 38 Load queue implementation queue management address CAM load meta-data RAM load address load age squash determination external request external address store address store age # of write ports = load address calc width # of read ports = load+store address calc width ( + 1) Current generation designs (32-48 entries, 2 write ports, 2 (3) read ports) Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 16 of 38 Load queue scaling Larger instruction window => larger load queue Increases access latency Increases energy consumption Wider issue width => more read/write ports Also increases latency and energy Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 17 of 38 Related work: MICRO 2003 Park et al., Purdue Extra structure dedicated to enforcing memory consistency Increase capacity through segmentation Sethumadhavan et al., UT-Austin Add set of filters summarizing contents of load queue Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 18 of 38 Keep it simple… Throw more hardware at the problem? Need to design/implement/verify Execution core is already complicated Load queue checks for rare errors Why not move error checking away from exe? Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 19 of 38 Value-based Consistency IF1 IF2 D … S EX WB REP CMP C Almost always cache hit Reuse address calculation and translation Share cache port used by stores in commit stage Compare: compares new value to original value Q Replay: access the cache a second time -cheaply! R Squash if the values differ This is value prediction! Predict: access cache prematurely Execute: as usual Verify: replay load, compare value, recover if necessary Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 20 of 38 Rules of replay 1. All prior stores must have written data to the cache 2. 3. No store-to-load forwarding Loads must replay in program order If a load is squashed, it should not be replayed a second time Ensures forward progress Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 21 of 38 Replay reduction Replay costs Can we avoid these penalties? Consumes cache bandwidth (and power) Increases reorder buffer occupancy Infer correctness of certain operations Four replay filters These are used to avoid checking our value prediction when in fact no value prediction occurred (loaded value is known to be correct) Similar to “constant prediction” in initial work Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 22 of 38 No-Reorder filter Avoid replay if load isn’t reordered wrt other memory operations Can we do better? Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 23 of 38 Enforcing single-thread RAW dependencies No-Unresolved Store Address Filter Load instruction i is replayed if there are prior stores with unresolved addresses when i issues Works for intra-processor RAW dependences Doesn’t enforce memory consistency Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 24 of 38 Enforcing MP consistency No-Recent-Miss Filter Avoid replay if there have been no cache line fills (to any address) while load in instruction window No-Recent-Snoop Filter Avoid replay if there have been no external invalidates (to any address) while load in instruction window Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 25 of 38 Constraint graph Defined for sequential consistency by Landin et al., ISCA-18 Directed-graph represents a multithreaded execution Nodes represent dynamic instruction instances Edges represent their transitive orders (program order, RAW, WAW, WAR). If the constraint graph is acyclic, then the execution is correct Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 26 of 38 Constraint graph example - SC Proc 1 Proc 2 WAR 2. ST A LD B Program order ST B 3. RAW 4. Program order LD A 1. Cycle indicates that execution is incorrect Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 27 of 38 Anatomy of a cycle Proc 1 Incoming invalidate ST A Proc 2 WAR LD B Program order Cache miss ST B RAW LD A Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 Program order 28 of 38 Enforcing MP consistency No-Recent-Miss Filter Avoid replay if there have been no cache line fills (to any address) while load in instruction window No-Recent-Snoop Filter Avoid replay if there have been no external invalidates (to any address) while load in instruction window Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 29 of 38 Filter Summary Conservative Replay all committed loads No-Reorder Filter No-Unresolved Store/ No-Recent-Miss Filter No-Unresolved Store/ No-Recent-Snoop Filter Aggressive Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 30 of 38 Outline Some History Industry Trends Value-Aware Microarchitecture Case study: Memory Consistency [Cain, ISCA] Conventional load queue microarchitecture Value-based memory ordering Replay-reduction heuristics Performance evaluation Conclusions Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 31 of 38 Base machine model PHARMsim PowerPC execute-at-execute simulator with OOO cores and aggressive split-transaction snooping coherence protocol Out-of-order 5 GHZ, 15-stage, 8-wide pipeline execution 256 entry reorder buffer, 128 entry load/store queue core 32 entry issue queue Functional units (latency) 8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4), 4 L1 Dcache load ports in OoO window 1 L1 Dcache load/store port at commit Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB Memory system (latency) 32k DM L1 icache (1), 32k DM L1 dcache (1) 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines Memory (400 cycle/100 ns best-case latency, 10 GB/S BW) Stride-based prefetcher modeled after Power4` Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 32 of 38 %L1 DCache bandwidth increase SPECint2000 SPECfp2000 commercial multiprocessor (a) replay all (b) no-reorder filter (c) no-recent-miss filter (d) no-recent-snoop filter On average, 3.4% bandwidth overhead using no-recent-snoop filter Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 33 of 38 Value-based replay performance (relative to constrained load queue) SPECint2000 SPECfp2000 commercial multiprocessor Value-based replay 8% faster on avg than baseline using 16-entry ld queue Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 34 of 38 Does value locality help? Not much… Value locality does avoid memory ordering violations 59% single-thread violations avoided 95% consistency violations avoided But these violations rarely occur ~1 single-thread violation per 100 million instr 4 consistency violation per 10,000 instr Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 35 of 38 What About Power? Simple power model: DEnergy = # replays ( Eper cache access + Eper word comparison ) + replay overhead – ( Eper ldq search × # ldq searches ) Empirically: 0.02 replay loads per committed instruction If load queue CAM energy/insn > 0.02 × energy expenditure of a cache access and comparison: value-based implementation saves power! Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 36 of 38 Value-based replay Pros/Cons + Eliminates associative lookup hardware Load queue becomes simple FIFO Negligible IPC or L1D bandwidth impact + Can be used to fix value prediction Enforces dependence order consistency constraint [MICRO ‘01] - Requires additional pipeline stages - Requires additional cache datapath for loads Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 37 of 38 Conclusions Value prediction Continues to generate lots of academic interest Little industry uptake so far Historical trends (narrow deep pipelines) minimized benefit Sea-change underway on this front Value-Aware Microarchitecture Value prediction will be revisited in quest for ILP Power consumption is key! Multiple fertile areas of research Some has found its way into products Are we done yet? No! Questions? Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 38 of 38 Backups Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 39 of 38 Caveat: Memory Dependence Prediction Some predictors train using the conflicting store (e.g. store-set predictor) Replay mechanism is unable to pinpoint conflicting store Fair comparison: Baseline machine: store-set predictor w/ 4k entry SSIT and 128 entry LFST Experimental machine: Simple 21264-style dependence predictor w/ 4k entry history table Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 40 of 38 Load queue search energy access energy (nJ) 3.5 3 2.5 rd6wr6 2 rd4wr4 1.5 rd2wr2 1 0.5 0 16 32 64 128 256 512 number of entries Based on 0.09 micron process technology using Cacti v. 3.2 Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 41 of 38 Load queue search latency access latency (ns) 1.4 1.2 1 rd6wr6 0.8 rd4wr4 0.6 rd2wr2 0.4 0.2 0 16 32 64 128 256 512 number of entries Based on 0.09 micron process technology using Cacti v. 3.2 Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 42 of 38 Benchmarks MP (16-way) Commercial workloads (SPECweb, TPC-H) SPLASH2 scientific application (ocean) Error bars signify 95% statistical confidence UP 3 from SPECfp2000 3 commercial Selected due to high reorder buffer utilization apsi, art, wupwise SPECjbb2000, TPC-B, TPC-H A few from SPECint2000 Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 43 of 38 Life cycle of a load ST ST ? LD ST ? LD LD ? LD ST ? LD ? ST A ? ST A OoO Execution Window Blam!A ? LD Load queue Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 44 of 38 Performance relative to unconstrained load queue Good news: Replay w/ no-recent-snoop filter only 1% slower on average Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 45 of 38 Reorder-Buffer Utilization Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 46 of 38 Why focus on load queue? Load queue has different constraints that store queue More loads than stores (30% vs 14% dynamic instructions) Load queue searched more frequently (consuming more power) Store-forwarding logic performance critical Many non-scalable structures in OoO processor Scheduler Physical register file Register map Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 47 of 38 Prior work: formal memory model representations Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13) Acyclic graph representation (Landin et al., ISCA-18) Modeling memory operation as a series of suboperations (Collier, RAPA) Acyclic graph + sub-operations (Adve, thesis) Initiation event, for modeling early store-to-load forwarding (Gharachorloo, thesis) Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 48 of 38 Some History From: Larry.Widigen@amd.com (Larry Widigen) Received: by charlie (4.1) id AA00850; Wed, 14 Aug 96 10:33:12 PDT Date: Wed, 14 Aug 96 10:33:12 PDT Message-Id: <9608141733.AA00850@charlie> To: Mikko_H._Lipasti@cmu.edu Subject: www location of paper Status: RO X-Status: X-Keywords: X-UID: 1 “Classical” value prediction Independently invented by 4 groups in 1995-1996 1. AMD (Nexgen): L. Widigen and E. Sowadsky, patent filed March inv.paper, March 1995 I would like to review your 1996, forthcoming "Value Locality and Load Value Prediction." Could you provide a www address where it resides? I 2.am curious Technion: F. Gabbay andits A. title Mendelson, inv.it may as to its contents since suggests that discuss an area where I have some US work.patent Sep 1997 sometime 1995, TRdone 11/96, Cordially, 3. CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. Larry Widigen 1995, ASPLOS paper submitted March, 1996 Manager of Processor Development 4. Wisconsin: Y. Sazeides, J. Smith, Summer 1996 Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004 49 of 38