Pre-Execution via Speculative Data Driven Multithreading Amir Roth University of Wisconsin, Madison August 10, 2001 Explanation of Title Slide • pre-execution: a new way of extracting additional instructionlevel parallelism (ILP), and hence performance, from ordinary sequential programs • speculative data-driven multithreading (DDMT): an implementation of pre-execution thesis: pre-execution and DDMT are good Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 2 Summary of Contributions • pre-execution (concept) – idea: execute to get unpredictable but performance critical values – technology: proactive out-of-order sequencing, decoupling • DDMT (proposed implementation) – idea: extend superscalar design, siphon pre-execution bandwidth – technology: register integration • algorithm for selecting what to pre-execute (framework) – idea: automatically select from computations executed by program – technology: pre-execution benefit-cost function • performance evaluation (of framework and implementation) Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 3 Why Am I Doing This? • still need higher performance – new app’s, better performance on existing app’s • still need higher sequential-program performance – many out there, parallel/MT programs composed of sequential code • need parallelism to complement frequency – frequency getting harder, performance returns diminishing • need ILP to complement [program,thread,bit]LP Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 4 Outline • motivation and introduction to pre-execution – problem we are solving and potential gain of solving it (3 slides) – pre-execution basics (5 slides) • • • • • pre-execution DDMT style automated computation selection DDMT microarchitecture performance evaluation terrace Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 5 • • • Amir Roth vonNeumann: retire instructions in (program) order performance: execute useful instructions each cycle the superscalar way – – – – window retire fetch program order ILP: Incumbent Model • examine “sliding window”, dataflow execution w/in window in-order retirement implements sequential semantics out-of-order execution increases useful execution rate in-order fetch establishes data dependences performance loss: not enough ready-to-execute useful instructions in window Pre-Execution via Speculative Data-Driven Multithreading 6 Value Latency and PDIs • fetch Problem: value latency – need correct value faster than execution can supply – two important kinds: branch outcomes, load addresses exec • branch prediction/address prediction (prefetching) – faster than execution, correct ~95% of the time – last 5% are performance degrading instances (PDIs) LD BR LD BR Amir Roth • effects of PDIs – branch mis-predictions: stall fetch of useful instructions – cache misses: execution latency stalls retirement, backpressure stalls fetch Pre-Execution via Speculative Data-Driven Multithreading 7 Why Worry About 5% Perfect Memory Latency and Branch Resolution 8 IPC 7 6 BASE 5 PERFECT 4 3 2 1 0 em3d • mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r performance gain by “fixing” PDIs – 8-wide processor, 64KB L1, 1MB L2, 80 cycle memory latency – branches: not perfect prediction, perfect resolution (fixup at rename) Amir Roth not as good, but matches our implementation Pre-Execution via Speculative Data-Driven Multithreading 8 Pre-Execution • dilemma – need PDI values faster than execution – can only accurately get PDI values using execution • pre-execution: execution that is faster than execution – part I: (pre) execute PDIs faster than original program – part II: communicate pre-executed PDI values to original program – part II: cache for loads, ?? for branches (implementation dependent) – part I: the crux – important: pre-execute PDI computations, otherwise just guessing Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 9 Q: How to Execute Faster than Execution? • A part I: proactive, out-of-order sequencing (fetch) – – – – • execute (and fetch) fewer instructions (fetch >> battle/2) out-of-order: PDI computation only (not full program) proactive: “know” PDI is coming and its computation hoist computation (and its latency) arbitrary distances A part II: decoupling – pre-execute PDI computation in separate “thread” – “move” stalls to pre-execution thread (inter-thread overlapping) proactive OoO sequencing + decoupling = pre-execution Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 10 Pre-Execution: Example original program fetch exec master thread fetch exec fork DDT data-driven thread (DDT) (p-e thread) pre-executing computation fetch exec LD BR LD LD BR BR LD LD BR – OoO sequencing PDI computation fetched quickly absorb latency LD – decoupling BR overlap memory latency send with master thread result instructions – result communication BR helps with branches speedup Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 11 How does Processor “Know” PDI is Coming? MT DDT • – master thread (MT) sees trigger? forks DDT – assume: MT will execute PDI via computation == DDT – choose DDT/trigger s.t. assumption statistically true LD BR LD BR Amir Roth DDT associated wth trigger • two other possibilies – – – – (a) master may not execute computation matching DDT (b) load may hit, branch may be correctly predicted useless (actually harmful) pre-execution take these probabilities into account Pre-Execution via Speculative Data-Driven Multithreading 12 Role of Pre-Execution MT DDT • need for in-order sequencing – – – – – LD BR • OoO sequence contains all dependences? can’t tell p-e results must be speculative can’t “interleave” two OoO sequences together must in-order sequence full program once (at least) OoO sequencing (p-e) must be redundant role of pre-execution – speculative: no correctness obligations – redundant: (relatively) high cost – pre-execute anything, just so performance improves Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 13 Outline • • intro to pre-execution pre-execution DDMT style – DDMT basics (1 slides) – intro to register integration (2 slides) – implicit data-driven sequencing (2 slides) • • • automated computation selection DDMT microarchitecture experimental evaluation Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 14 Pre-Execution DDMT Style DDT$ I$ • CMIS IT RE RE ROB RS PREG + D$ extend dynamically scheduled superscalar processor – DDT$ holds static DDTs – replicate register maps, CMIS manages, schedules DDT “injection” – DDT instructions look “normal”, but not put into ROB or retired put into RS, read/write pregs, execute on FUs – IT implements register integration (result sharing via pregs) – centralized organization: no dedicated PE B/W, steal as needed Amir Roth bonus: B/W available to steal when PE is needed most (overall ILP is low) Pre-Execution via Speculative Data-Driven Multithreading 15 Register Integration • • master thread directly reuses pre-executed results – saves execution bandwidth, compresses dataflow graph – instant branch resolution: integrated mis-predicted branch resolved at register renaming time (at fetch would be better, but harder) regsiter integration – – – – – like instruction reuse [Sodani] using pregs instead of lregs/values reuse test: match PCs (op) & input pregs reuse: map output to IT entry output preg map table manipulations only, pregs not read or written pregs naturally track dependence chains (e.g., DDTs) RB Amir Roth PC Vin1 Vin2 Val x18 0x44 4 0x48 x1c - 0x48 17 IT PC pin1 pin2 preg x18 p7 p6 x1c p6 p9 Pre-Execution via Speculative Data-Driven Multithreading 16 Pre-Execution Reuse via Integration PC x18 x18 x18 x2c master thread raw inst renamed R1=R1+1 p6=p7+1 R1=R1+1 p6=p7+1 R1=R1+4 p6=p7+4 p?=p7+4 R2=ld[R1] p9=ld[p6] p?=ld[p6] fork IT PC pin preg x18 p7 p6 x2c p6 p9 map table R1 p7 DDT PC raw inst renamed x18 R1=R1+4 p6=p7+4 x2c R2=ld[R1] p9=ld[p6] – DDT allocates pregs, creates IT entries – master thread reads IT, integrates pregs allocated by DDT – recursive process: 0x18 integration (p6) sets up 0x2c integration • need two things to make this work – initial R1 mapping (p7) must match (copy map table on fork) – data-dependences must match (instructions and PCs!!) Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 17 Sequencing DDTs (OoO sequencing) • Q: how to implement out-of-order sequencing? – how to sequence instructions with non-contiguous PCs? • implicit data-driven sequencing static DDT PC inst Trig: x18 R1=R1+4 x2c R2=ld[R1] x30 bz R2, x44 – list DDT instructions, inject list as is – processor doesn’t “interpret” DDT branches – branches in DDTs only for subsequent integration • DDTs are static and finite – good: natural overhead control, no “runaway” DDTs – bad: can’t pre-execute loops (important for latency tolerance) Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 18 Faking Control-Flow in DDTs • “trick” processor into pre-executing any control flow – processor doesn’t interpret DDT branches • faking conditional control – implicit conditional: pre-execute along common path – greedy conditional: pre-execute along both paths • static DDT PC inst important for latency tolerance! Trig: x18 R1=R1+4 x18 R1=R1+4 unrolling: unroll a loop within a DDT x18 R1=R1+4 unoverlapped unrolling: powerful, difficult in DDMT x2c R2=ld[R1] x30 bz R2, x44 induction unrolling: unroll induction only faking loop control – – – – Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 19 Outline • • • intro to pre-execution pre-execution DDMT style automated DDT selection – identifying problem instructions (2 slides) – optimizing DDTs for latency tolerance and overhead (cool, 2 slides) – merging DDTs to reduce overhead (ultra boring, 0 slides) • • • DDMT microarchitecture experimental evaluation summary Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 20 Automated DDT Selection Algorithm • • • • goal: DDTs that hide most PDI latency with least overhead automated? integration requirement bounds search space – examine program traces – slice backwards from PDIs to enumerate DDTs – choose DDTs that maximize some benefit-cost function 3 steps – identify static problem instructions (PIs) – find DDTs for each PI – merge partially overlapping DDTs to reduce overhead implementation – H/W? S/W? VM? (we model S/W, but leave other possibilities open) Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 21 Problem Instructions (PIs) • impractical to pre-execute all PDIs – divide PDIs by static instruction – choose static problem instructions (PIs) with good “pre-executability” – only find-DDTs-for / pre-execute-PDIs-of PIs • good pre-executability criteria – problem ratio: high ratio of PDIs (high miss/misprediction rate) – problem contribution: high PDI representation – problem latency: high latency per PDI Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 22 Potential of Pre-Executing PIs Performance Potential of Perfecting Problem Instructions 8 IPC 7 BASE 6 PERFECT PI 5 PERFECT 4 3 2 1 0 em3d • mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r PI definition – contribution: 1 in 500 PDIs – ratio: 1 in 10 PDIs (10% miss/misprediction rate) – latency: 10 cycles for loads, 5 for branches Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 23 Selecting DDTs for a Single PI • • • simple case – PDIs computed by one slice – choose sub-slice that should be DDT in general Trig? Trig? Trig? Trig? – multiple, partially overlapping slices – choose set of non-overlapping DDTs PC x18 x18 x18 x2c x30 inst R1=R1+4 R1=R1+4 R1=R1+4 R2=ld[R1] bz R2, x44 approach – – – – mantra: maximize latency tolerance, minimize overhead aggregate (over all pre-executions) compute benefit-cost (LT-OH) for each potential DDT choose DDT with maximum benefit-cost Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 24 Pre-Execution Benefit-Cost Function • aggregate advantage: ADVAGG = LTAGG - OHAGG • LTAGG = # PDIs covered * LTPDI – don’t count pre-executions for cache hits (no latency to tolerate) – LTPDI = EXECTMT – EXECTDDT sequencing constrained dataflow-height (SCDH) to approximate EXECT – LTPDI <= problem latency (can’t tolerate more latency than there is) • OHAGG = # pre-executions * OHPE – all pre-executions count, for cache hits, for no corresponding loads – OHPE = rename b/w consumed in cycles (most direct OH measure) Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 25 Outline • • • • intro to pre-execution pre-execution DDMT style automated DDT selection DDMT microarchitecture – implementing register integration (1 slide) – other implementation notes (1 slide) • • experimental evaluation summary Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 26 Implementing Register Integration • • • more physical registers – to keep pre-executed DDT results alive longer, pipestage++? integration circuit – looks like conventional register renaming dependence x-check – IT may be M-way associative (M candidates per instruction) – circuit complexity: NM2 (M > 4 not adivsed), pipestage++? integrating loads – – – – DDTs may miss conflicting stores from master thread not detected by integration (preg-based): coherence problem solution: re-execute integrated loads & squash (alt: snoop) learn which load integrations cause squashes, don’t integrate them Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 27 Other Implementation Notes • forking – only master thread can fork, no “chaining” • injection scheduling – Q: how fast should DDTs be “injected”? – A: at dataflow speed, but no faster (approximate with DDT-1) • stores and DDT memory communication – DDTs can contain stores, can’t write DDT stores into D$ – small queue (DDSQ): write DDT stores, direct “right” DDT loads • exceptions – buffer until instruction integrated (potentially abort rest of DDT) Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 28 Outline • • • • • intro to pre-execution pre-execution DDMT style automated DDT selection DDMT microarchitecture experimental evaluation – numbers (3 slides) – more numbers (6 slides) – explanations of numbers (2 slides) Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 29 Experimental Framework • SPECint2K, 2 Olden microbenchmarks, Alpha EV6, –O3 –fast – training runs, 10% sampling • SimpleScalar-based simulation environment – – – – – • 8-wide, superscalar, out-of-order 128 ROB, 64 LDQ, 32 STQ, 80 RS Pipe: 3 fetch, 2 rename/integrate, 2 schedule, 2 reg read, 3 L1 hit 32KB IL1/64KB DL1 (2-way), 1MB L2$ (4-way), mem b/w: 8 b/cyc. 1024 pregs, 1024-entry, 4-way IT (baseline does squash reuse) methodology – DDT selection/DDMT: same input sample (for now) Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 30 DDMT Performance DDMT Performance 5 BASE DDMT 4 PERFECT PI IPC 3 2 1 0 em3d • mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r microbenchmarks: +50%, SPEC2K: +1%-10% – – – – avg. load latency reduced up to 20% avg. branch resolution latency reduced up to 40% 15%-40% of pre-executed instructions integrated inability to use unoverlapped full unrolling hurts bzip2, gzip, mcf, vpr Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 31 Importance of Integration Effect of Register Integration 5 BASE DDMT 4 NO BRANCH RESOLUTION NO PRE-EXECUTION REUSE IPC 3 2 1 0 em3d • • mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r no branch resolution: don’t integrate pre-executed branches – -3% for branch pre-execution benchmarks: crafty, eon, vpr.p, twolf no pre-execution reuse: don’t integrate any DDT result – hurts branch pre-execution & high IPC programs, DDMT < base – scheduling/RS contention important in higher IPC cases (intuitive) Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 32 Another Way of Measuring Overhead Overhead-less DDMT 5 BASE DDMT 4 OVERHEAD-LESS DDMT PERFECT PI IPC 3 2 1 0 em3d • mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r remove overhead from DDMT – renaming, scheduling re-execution B/W free, limit master RS only – +2-3% performance – current IT formulation suppresses DDMT overhead (more later) Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 33 Sensitivity: DDT Selection Implementation Stability of DDMT across DDT Selection Inputs 5 BASE LIMIT 4 OFFLINE ONLINE IPC 3 2 1 0 em3d • mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r model via relationship of DDT selection input to DDMT input – limit: same input, H/W implementation? (default) – offline: different input, S/W implementation – online: different sample within same input, VM implementation DDT selection input insensitive, DDTs=f(program structure) Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 34 Sensitivity: Integration Associativity Impact of IT Associativity 5 BASE DIRECT MAPPED 2-WAY 4-WAY (DEFAULT) 8-WAY FULLY ASSOCIATIVE 4 IPC 3 2 1 0 em3d • mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r high associativity increases successful integration rate – baseline (squash reuse only) not very sensitive to associativity – integration reduces RS/scheduler contention – low associativity interferes with unrolling (in current IT formulation) Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 35 Sensitivity: Unrolling Degree Impact of DDT Unrolling Degree 5 BASE UNROLL2 4 UNROLL4 (DEFAULT) UNROLL8 IPC 3 2 1 0 em3d • mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r max DDT length 64, fully associative IT (to avoid interference) – increased unrolling important for tolerating memory latencies – but increases OH – modulo unrolling, increased allowed DDT size/scope doesn’t help Amir Roth DDTs no longer than needed to tolerate required latency Pre-Execution via Speculative Data-Driven Multithreading 36 Sensitivity: Memory Latency Sensitivity to Memory Latency 5 BASE-ML=70 4 DDMT-ML=70 BASE-ML=140 3 IPC DDMT-ML=140 2 1 0 em3d • mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r double memory latency: 140 cycles – more latency per PDI – increase maximum unrolling degree to 8 (also 8-way associative IT) – higher relative speedups Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 37 Sensitivity: Cache Size Sensitivity to Cache Size 5 BASE-DL1=64KB-L2=1MB DDMT-DL1=64KB-L2=1MB 4 BASE-DL1=16KB-L2=256KB DDMT-DL1=16KB-L2=256KB IPC 3 2 1 0 em3d • mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r cut cache sizes by 4: DL1=16KB, L2=256KB – more PDIs – roughly same absolute speedups, higher relatively – increased contention Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 38 IT Formulation? • old: IT doubles as ledger for pregs allocated by DDTs – – – – – • if evicted from IT, preg is freed, downstream DDT destroyed keeps overhead (RS contention) down only need to re-execute integrated loads restricts effective unrolling degree to IT associativity requires incremental invalidations on IT (associative matches) new: decouple pre-execution from presence in IT – re-execute all integrated instructions – no incremental invalidations – downstream DDT not destroyed on IT eviction Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 39 Preliminary New-Formulation Results New IT Formulation 5 BASE 4 DDMT-OLD DDMT-NEW IPC 3 2 1 0 em3d • mst bzip2 crafty eon.c eon.k eon.r gap gcc gzip mcf parser perl.d perl.s twolf vortex vpr.p vpr.r 3-5% better for some, 1-2% worse for others – RS contention increases greatly (too many not-ready DDT instr’s) – DDT-1 injection too aggressive, slower policy ties up DDT contexts – different character than older formulation, needs work Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 40 Evaluation Summary • performance – does well on microbenchmarks (like it’s supposed to) – modest to moderate gains on SPEC2K + aggressive baseline – relatively better the more there is to do (PDIs or latency per PDI) • limitations (future work?) – pre-execution/branch predictor interface would be nice others and I have looked at this – difficulties with unoverlapped unrolling external unrolling mechanism, static selection framework – need more RS entries Amir Roth clustered RS queues should be good, DDTs are dependence chains Pre-Execution via Speculative Data-Driven Multithreading 41 The End • pre-execution: more ILP from sequential programs – attacks performance problems directly – key technologies: proactive out-of-order sequencing + decoupling • DDMT: a superscalar-friendly implementation – no dedicated pre-execution bandwidth – register integration for pre-execution reuse • automated DDT selection – stable, has the right knobs Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 42 Outline • • • • • • motivation and introduction to pre-execution pre-execution DDMT style automated computation selection DDMT microarchitecture performance evaluation terrace Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 43 Tuning DDT Selection • can’t change DDT structure, can control length – Longer DDTs tolerate more latency, more overhead, fewer PDIs • from below – minimal latency tolerance • from above – maximum length, slicing window size, unrolling degree • upshot: maximum ADVAGG DDT has characteristic length – loosening controls doesn’t make much difference Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 44 Related Work: Architectures • dataflow architectures [Dennis75], Manchester [Gurd+85], TTDA [Arvind+90], ETS [Culler+90] – decoupling, data-driven fetch to the limit, but no sequential interface – pre-execution: sequential interface with speculative dataflow helper • [UW-CSTR-#1411] decoupled access/execute architecture [Smith82] – decoupling, single execution, proactive ooo? – pre-execution: speculative decoupled miss/execute micro-architecture Amir Roth [MEDEA’00] Pre-Execution via Speculative Data-Driven Multithreading 45 Related Work: Microarchitectures • decoupled runahead slipstream [Rotenberg+00] – decoupled, not proactive out-of-order • speculative thread-level parallelism Multiscalar [Franklin+93], SPSM [Dubey+95], DMT [Akkary+98] – decoupled, not proactive out-of-order Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 46 Pre-Execution Geneology Dependence-Based Prefetching Dependence-Based Target Pre-Computation Speculative Dataflow Speculative Slices Amir Roth Assisted Execution SSMT Branch Flow Microarchitecture DDMT Speculative Pre-Computation Slice Processors Pre-Execution via Speculative Data-Driven Multithreading 47 Performance: Separating the Effects full DDMT no integration – loads: prefetching still OK mcf unrolling needs integration Speedup (%) • • 15 – branches: effects lost –decoup –integ 5 mcf em3d DDT prefetches too – DDTs as scheduler hints “problem aware OOO” – no speedup vpr mst Branches 15 no decoupling full DDMT 0 Speedup (%) • 10 Loads 10 5 0 eon Amir Roth gzip Pre-Execution via Speculative Data-Driven Multithreading em3d 48 Pre-Execution vs. Inlined Helper Code • must hoist computation – – – – – • must copy, straight hoist will create WAR dependences hoist past/out-of/into procedures? schedule past/out-of/into procedures? non-binding prefetches or inline stalls branch pre-execution? pre-execution vs. Itanium® – speculative loads ease hoisting past procedures – but that’s it Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 49 Loads in DDTs • missed memory dependences (store invalidations)? – problem is with integration – keep address/value pairs in integration table (shadow MOB) – snoop • multiprocessors? – – – – pre-execution is sequentially consistent (SC) just radical out-of-order actual execution, snoopable state for all loads many cases occur in uniprocessor as interactions with main thread Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 50 Contributions: Prelim vs. Thesis • Prelim – implementation: Speculative Dataflow (TTDA), DDMT (SMT base) – applications: prefetch data, pre-compute branches, “smoothe” ILP – automated pre-execution computation selection (enough to get by) • Dissertation – implementation: DDMT (superscalar based) – applications: prefetch data, pre-compute branches – automated computation selection Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 51 Superscalar: Obstacles to ILP • backpressure, backpressure… – – – – out-of-order retirement? can’t do larger window (ROB)? hard (engineering) bigger useful window? harder (P[no mis-pred] shrinks) out-of-order fetch? getting to that Amir Roth Pre-Execution via Speculative Data-Driven Multithreading 52 Calculating Execution Times • why do DDTs execute faster than MT? – sequence fewer instructions! – SCDH (Sequencing Constrained Dataflow Height): accounts for this PC x18 x18 x18 x2c x30 inst DSTtrig R1=R1+4 0 R1=R1+4 40 R1=R1+4 80 R2=ld[R1] 85 bz R2, x44 86 Amir Roth SC 0 5 10 11 11 SCDH 1 6 11 12 13 PC x18 x18 x18 x2c x30 inst DSTtrig R1=R1+4 0 R1=R1+4 1 R1=R1+4 2 R2=ld[R1] 3 bz R2, x44 4 Pre-Execution via Speculative Data-Driven Multithreading SC 0 1 2 3 4 SCDH 1 2 3 4 5 53