Amir_Defense.ppt

advertisement
Pre-Execution
via
Speculative Data Driven Multithreading
Amir Roth
University of Wisconsin, Madison
August 10, 2001
Explanation of Title Slide
•
pre-execution: a new way of extracting additional instructionlevel parallelism (ILP), and hence performance, from
ordinary sequential programs
•
speculative data-driven multithreading (DDMT): an
implementation of pre-execution
thesis: pre-execution and DDMT are good
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
2
Summary of Contributions
•
pre-execution (concept)
– idea: execute to get unpredictable but performance critical values
– technology: proactive out-of-order sequencing, decoupling
•
DDMT (proposed implementation)
– idea: extend superscalar design, siphon pre-execution bandwidth
– technology: register integration
•
algorithm for selecting what to pre-execute (framework)
– idea: automatically select from computations executed by program
– technology: pre-execution benefit-cost function
•
performance evaluation (of framework and implementation)
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
3
Why Am I Doing This?
•
still need higher performance
– new app’s, better performance on existing app’s
•
still need higher sequential-program performance
– many out there, parallel/MT programs composed of sequential code
•
need parallelism to complement frequency
– frequency getting harder, performance returns diminishing
•
need ILP to complement [program,thread,bit]LP
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
4
Outline
•
motivation and introduction to pre-execution
– problem we are solving and potential gain of solving it (3 slides)
– pre-execution basics (5 slides)
•
•
•
•
•
pre-execution DDMT style
automated computation selection
DDMT microarchitecture
performance evaluation
terrace
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
5
•
•
•
Amir Roth
vonNeumann: retire instructions in (program) order
performance: execute useful instructions each cycle
the superscalar way
–
–
–
–
window
retire
fetch
program order
ILP: Incumbent Model
•
examine “sliding window”, dataflow execution w/in window
in-order retirement implements sequential semantics
out-of-order execution increases useful execution rate
in-order fetch establishes data dependences
performance loss: not enough ready-to-execute
useful instructions in window
Pre-Execution via Speculative Data-Driven Multithreading
6
Value Latency and PDIs
•
fetch
Problem: value latency
– need correct value faster than execution can supply
– two important kinds: branch outcomes, load addresses
exec
•
branch prediction/address prediction (prefetching)
– faster than execution, correct ~95% of the time
– last 5% are performance degrading instances (PDIs)
LD
BR
LD
BR
Amir Roth
•
effects of PDIs
– branch mis-predictions: stall fetch of useful instructions
– cache misses: execution latency stalls retirement,
backpressure stalls fetch
Pre-Execution via Speculative Data-Driven Multithreading
7
Why Worry About 5%
Perfect Memory Latency and Branch Resolution
8
IPC
7
6
BASE
5
PERFECT
4
3
2
1
0
em3d
•
mst
bzip2
crafty
eon.c
eon.k
eon.r
gap
gcc
gzip
mcf
parser
perl.d
perl.s
twolf
vortex
vpr.p
vpr.r
performance gain by “fixing” PDIs
– 8-wide processor, 64KB L1, 1MB L2, 80 cycle memory latency
– branches: not perfect prediction, perfect resolution (fixup at rename)

Amir Roth
not as good, but matches our implementation
Pre-Execution via Speculative Data-Driven Multithreading
8
Pre-Execution
•
dilemma
– need PDI values faster than execution
– can only accurately get PDI values using execution
•
pre-execution: execution that is faster than execution
– part I: (pre) execute PDIs faster than original program
– part II: communicate pre-executed PDI values to original program
– part II: cache for loads, ?? for branches (implementation dependent)
– part I: the crux
– important: pre-execute PDI computations, otherwise just guessing
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
9
Q: How to Execute Faster than Execution?
•
A part I: proactive, out-of-order sequencing (fetch)
–
–
–
–
•
execute (and fetch) fewer instructions (fetch >> battle/2)
out-of-order: PDI computation only (not full program)
proactive: “know” PDI is coming and its computation
hoist computation (and its latency) arbitrary distances
A part II: decoupling
– pre-execute PDI computation in separate “thread”
– “move” stalls to pre-execution thread (inter-thread overlapping)
proactive OoO sequencing + decoupling = pre-execution
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
10
Pre-Execution: Example
original
program
fetch exec
master
thread
fetch exec
fork
DDT
data-driven thread (DDT)
(p-e thread)
 pre-executing computation
fetch exec
LD
BR
LD
LD
BR
BR
LD
LD
BR
– OoO sequencing

PDI computation fetched
quickly
absorb
latency
LD – decoupling
BR
 overlap memory latency
send
with master thread
result
instructions
– result communication
BR

helps with branches
speedup
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
11
How does Processor “Know” PDI is Coming?
MT
DDT
•
– master thread (MT) sees trigger? forks DDT
– assume: MT will execute PDI via computation == DDT
– choose DDT/trigger s.t. assumption statistically true
LD
BR
LD
BR
Amir Roth
DDT associated wth trigger
•
two other possibilies
–
–
–
–
(a) master may not execute computation matching DDT
(b) load may hit, branch may be correctly predicted
useless (actually harmful) pre-execution
take these probabilities into account
Pre-Execution via Speculative Data-Driven Multithreading
12
Role of Pre-Execution
MT
DDT
•
need for in-order sequencing
–
–
–
–
–
LD
BR
•
OoO sequence contains all dependences? can’t tell
p-e results must be speculative
can’t “interleave” two OoO sequences together
must in-order sequence full program once (at least)
OoO sequencing (p-e) must be redundant
role of pre-execution
– speculative: no correctness obligations
– redundant: (relatively) high cost
– pre-execute anything, just so performance improves
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
13
Outline
•
•
intro to pre-execution
pre-execution DDMT style
– DDMT basics (1 slides)
– intro to register integration (2 slides)
– implicit data-driven sequencing (2 slides)
•
•
•
automated computation selection
DDMT microarchitecture
experimental evaluation
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
14
Pre-Execution DDMT Style
DDT$
I$
•
CMIS
IT
RE
RE
ROB
RS
PREG
+
D$
extend dynamically scheduled superscalar processor
– DDT$ holds static DDTs
– replicate register maps, CMIS manages, schedules DDT “injection”
– DDT instructions look “normal”, but not put into ROB or retired

put into RS, read/write pregs, execute on FUs
– IT implements register integration (result sharing via pregs)
– centralized organization: no dedicated PE B/W, steal as needed

Amir Roth
bonus: B/W available to steal when PE is needed most (overall ILP is low)
Pre-Execution via Speculative Data-Driven Multithreading
15
Register Integration
•
•
master thread directly reuses pre-executed results
– saves execution bandwidth, compresses dataflow graph
– instant branch resolution: integrated mis-predicted branch resolved
at register renaming time (at fetch would be better, but harder)
regsiter integration
–
–
–
–
–
like instruction reuse [Sodani] using pregs instead of lregs/values
reuse test: match PCs (op) & input pregs
reuse: map output to IT entry output preg
map table manipulations only, pregs not read or written
pregs naturally track dependence chains (e.g., DDTs)
RB
Amir Roth
PC Vin1 Vin2 Val
x18 0x44 4 0x48
x1c - 0x48 17
IT
PC pin1 pin2 preg
x18 p7
p6
x1c p6 p9
Pre-Execution via Speculative Data-Driven Multithreading
16
Pre-Execution Reuse via Integration
PC
x18
x18
x18
x2c
master thread
raw inst renamed
R1=R1+1 p6=p7+1
R1=R1+1 p6=p7+1
R1=R1+4 p6=p7+4
p?=p7+4
R2=ld[R1] p9=ld[p6]
p?=ld[p6]
fork
IT
PC pin preg
x18 p7 p6
x2c p6 p9
map table
R1
p7
DDT
PC raw inst renamed
x18 R1=R1+4 p6=p7+4
x2c R2=ld[R1] p9=ld[p6]
– DDT allocates pregs, creates IT entries
– master thread reads IT, integrates pregs allocated by DDT
– recursive process: 0x18 integration (p6) sets up 0x2c integration
•
need two things to make this work
– initial R1 mapping (p7) must match (copy map table on fork)
– data-dependences must match (instructions and PCs!!)
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
17
Sequencing DDTs (OoO sequencing)
•
Q: how to implement out-of-order sequencing?
– how to sequence instructions with non-contiguous PCs?
•
implicit data-driven sequencing
static DDT
PC
inst
Trig: x18 R1=R1+4
x2c R2=ld[R1]
x30 bz R2, x44
– list DDT instructions, inject list as is
– processor doesn’t “interpret” DDT branches
– branches in DDTs only for subsequent integration
•
DDTs are static and finite
– good: natural overhead control, no “runaway” DDTs
– bad: can’t pre-execute loops (important for latency tolerance)
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
18
Faking Control-Flow in DDTs
•
“trick” processor into pre-executing any control flow
– processor doesn’t interpret DDT branches
•
faking conditional control
– implicit conditional: pre-execute along common path
– greedy conditional: pre-execute along both paths
•
static DDT
PC
inst
important for latency tolerance!
Trig: x18 R1=R1+4
x18 R1=R1+4
unrolling: unroll a loop within a DDT
x18 R1=R1+4
unoverlapped unrolling: powerful, difficult in DDMT x2c R2=ld[R1]
x30 bz R2, x44
induction unrolling: unroll induction only
faking loop control
–
–
–
–
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
19
Outline
•
•
•
intro to pre-execution
pre-execution DDMT style
automated DDT selection
– identifying problem instructions (2 slides)
– optimizing DDTs for latency tolerance and overhead (cool, 2 slides)
– merging DDTs to reduce overhead (ultra boring, 0 slides)
•
•
•
DDMT microarchitecture
experimental evaluation
summary
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
20
Automated DDT Selection Algorithm
•
•
•
•
goal: DDTs that hide most PDI latency with least overhead
automated? integration requirement bounds search space
– examine program traces
– slice backwards from PDIs to enumerate DDTs
– choose DDTs that maximize some benefit-cost function
3 steps
– identify static problem instructions (PIs)
– find DDTs for each PI
– merge partially overlapping DDTs to reduce overhead
implementation
– H/W? S/W? VM? (we model S/W, but leave other possibilities open)
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
21
Problem Instructions (PIs)
•
impractical to pre-execute all PDIs
– divide PDIs by static instruction
– choose static problem instructions (PIs) with good “pre-executability”
– only find-DDTs-for / pre-execute-PDIs-of PIs
•
good pre-executability criteria
– problem ratio: high ratio of PDIs (high miss/misprediction rate)
– problem contribution: high PDI representation
– problem latency: high latency per PDI
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
22
Potential of Pre-Executing PIs
Performance Potential of Perfecting Problem Instructions
8
IPC
7
BASE
6
PERFECT PI
5
PERFECT
4
3
2
1
0
em3d
•
mst
bzip2
crafty
eon.c
eon.k
eon.r
gap
gcc
gzip
mcf
parser
perl.d
perl.s
twolf
vortex
vpr.p
vpr.r
PI definition
– contribution: 1 in 500 PDIs
– ratio: 1 in 10 PDIs (10% miss/misprediction rate)
– latency: 10 cycles for loads, 5 for branches
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
23
Selecting DDTs for a Single PI
•
•
•
simple case
– PDIs computed by one slice
– choose sub-slice that should be DDT
in general
Trig?
Trig?
Trig?
Trig?
– multiple, partially overlapping slices
– choose set of non-overlapping DDTs
PC
x18
x18
x18
x2c
x30
inst
R1=R1+4
R1=R1+4
R1=R1+4
R2=ld[R1]
bz R2, x44
approach
–
–
–
–
mantra: maximize latency tolerance, minimize overhead
aggregate (over all pre-executions)
compute benefit-cost (LT-OH) for each potential DDT
choose DDT with maximum benefit-cost
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
24
Pre-Execution Benefit-Cost Function
•
aggregate advantage: ADVAGG = LTAGG - OHAGG
•
LTAGG = # PDIs covered * LTPDI
– don’t count pre-executions for cache hits (no latency to tolerate)
– LTPDI = EXECTMT – EXECTDDT

sequencing constrained dataflow-height (SCDH) to approximate EXECT
– LTPDI <= problem latency (can’t tolerate more latency than there is)
•
OHAGG = # pre-executions * OHPE
– all pre-executions count, for cache hits, for no corresponding loads
– OHPE = rename b/w consumed in cycles (most direct OH measure)
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
25
Outline
•
•
•
•
intro to pre-execution
pre-execution DDMT style
automated DDT selection
DDMT microarchitecture
– implementing register integration (1 slide)
– other implementation notes (1 slide)
•
•
experimental evaluation
summary
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
26
Implementing Register Integration
•
•
•
more physical registers
– to keep pre-executed DDT results alive longer, pipestage++?
integration circuit
– looks like conventional register renaming dependence x-check
– IT may be M-way associative (M candidates per instruction)
– circuit complexity: NM2 (M > 4 not adivsed), pipestage++?
integrating loads
–
–
–
–
DDTs may miss conflicting stores from master thread
not detected by integration (preg-based): coherence problem
solution: re-execute integrated loads & squash (alt: snoop)
learn which load integrations cause squashes, don’t integrate them
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
27
Other Implementation Notes
•
forking
– only master thread can fork, no “chaining”
•
injection scheduling
– Q: how fast should DDTs be “injected”?
– A: at dataflow speed, but no faster (approximate with DDT-1)
•
stores and DDT memory communication
– DDTs can contain stores, can’t write DDT stores into D$
– small queue (DDSQ): write DDT stores, direct “right” DDT loads
•
exceptions
– buffer until instruction integrated (potentially abort rest of DDT)
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
28
Outline
•
•
•
•
•
intro to pre-execution
pre-execution DDMT style
automated DDT selection
DDMT microarchitecture
experimental evaluation
– numbers (3 slides)
– more numbers (6 slides)
– explanations of numbers (2 slides)
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
29
Experimental Framework
•
SPECint2K, 2 Olden microbenchmarks, Alpha EV6, –O3 –fast
– training runs, 10% sampling
•
SimpleScalar-based simulation environment
–
–
–
–
–
•
8-wide, superscalar, out-of-order
128 ROB, 64 LDQ, 32 STQ, 80 RS
Pipe: 3 fetch, 2 rename/integrate, 2 schedule, 2 reg read, 3 L1 hit
32KB IL1/64KB DL1 (2-way), 1MB L2$ (4-way), mem b/w: 8 b/cyc.
1024 pregs, 1024-entry, 4-way IT (baseline does squash reuse)
methodology
– DDT selection/DDMT: same input sample (for now)
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
30
DDMT Performance
DDMT Performance
5
BASE
DDMT
4
PERFECT PI
IPC
3
2
1
0
em3d
•
mst
bzip2
crafty
eon.c
eon.k
eon.r
gap
gcc
gzip
mcf
parser
perl.d
perl.s
twolf
vortex
vpr.p
vpr.r
microbenchmarks: +50%, SPEC2K: +1%-10%
–
–
–
–
avg. load latency reduced up to 20%
avg. branch resolution latency reduced up to 40%
15%-40% of pre-executed instructions integrated
inability to use unoverlapped full unrolling hurts bzip2, gzip, mcf, vpr
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
31
Importance of Integration
Effect of Register Integration
5
BASE
DDMT
4
NO BRANCH RESOLUTION
NO PRE-EXECUTION REUSE
IPC
3
2
1
0
em3d
•
•
mst
bzip2
crafty
eon.c
eon.k
eon.r
gap
gcc
gzip
mcf
parser
perl.d
perl.s
twolf
vortex
vpr.p
vpr.r
no branch resolution: don’t integrate pre-executed branches
– -3% for branch pre-execution benchmarks: crafty, eon, vpr.p, twolf
no pre-execution reuse: don’t integrate any DDT result
– hurts branch pre-execution & high IPC programs, DDMT < base
– scheduling/RS contention important in higher IPC cases (intuitive)
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
32
Another Way of Measuring Overhead
Overhead-less DDMT
5
BASE
DDMT
4
OVERHEAD-LESS DDMT
PERFECT PI
IPC
3
2
1
0
em3d
•
mst
bzip2
crafty
eon.c
eon.k
eon.r
gap
gcc
gzip
mcf
parser
perl.d
perl.s
twolf
vortex
vpr.p
vpr.r
remove overhead from DDMT
– renaming, scheduling re-execution B/W free, limit master RS only
– +2-3% performance
– current IT formulation suppresses DDMT overhead (more later)
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
33
Sensitivity: DDT Selection Implementation
Stability of DDMT across DDT Selection Inputs
5
BASE
LIMIT
4
OFFLINE
ONLINE
IPC
3
2
1
0
em3d
•
mst
bzip2
crafty
eon.c
eon.k
eon.r
gap
gcc
gzip
mcf
parser
perl.d
perl.s
twolf
vortex
vpr.p
vpr.r
model via relationship of DDT selection input to DDMT input
– limit: same input, H/W implementation? (default)
– offline: different input, S/W implementation
– online: different sample within same input, VM implementation
DDT selection input insensitive, DDTs=f(program structure)
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
34
Sensitivity: Integration Associativity
Impact of IT Associativity
5
BASE
DIRECT MAPPED
2-WAY
4-WAY (DEFAULT)
8-WAY
FULLY ASSOCIATIVE
4
IPC
3
2
1
0
em3d
•
mst
bzip2
crafty
eon.c
eon.k
eon.r
gap
gcc
gzip
mcf
parser
perl.d
perl.s
twolf
vortex
vpr.p
vpr.r
high associativity increases successful integration rate
– baseline (squash reuse only) not very sensitive to associativity
– integration reduces RS/scheduler contention
– low associativity interferes with unrolling (in current IT formulation)
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
35
Sensitivity: Unrolling Degree
Impact of DDT Unrolling Degree
5
BASE
UNROLL2
4
UNROLL4 (DEFAULT)
UNROLL8
IPC
3
2
1
0
em3d
•
mst
bzip2
crafty
eon.c
eon.k
eon.r
gap
gcc
gzip
mcf
parser
perl.d
perl.s
twolf
vortex
vpr.p
vpr.r
max DDT length 64, fully associative IT (to avoid interference)
– increased unrolling important for tolerating memory latencies
– but increases OH
– modulo unrolling, increased allowed DDT size/scope doesn’t help

Amir Roth
DDTs no longer than needed to tolerate required latency
Pre-Execution via Speculative Data-Driven Multithreading
36
Sensitivity: Memory Latency
Sensitivity to Memory Latency
5
BASE-ML=70
4
DDMT-ML=70
BASE-ML=140
3
IPC
DDMT-ML=140
2
1
0
em3d
•
mst
bzip2
crafty
eon.c
eon.k
eon.r
gap
gcc
gzip
mcf
parser
perl.d
perl.s
twolf
vortex
vpr.p
vpr.r
double memory latency: 140 cycles
– more latency per PDI
– increase maximum unrolling degree to 8 (also 8-way associative IT)
– higher relative speedups
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
37
Sensitivity: Cache Size
Sensitivity to Cache Size
5
BASE-DL1=64KB-L2=1MB
DDMT-DL1=64KB-L2=1MB
4
BASE-DL1=16KB-L2=256KB
DDMT-DL1=16KB-L2=256KB
IPC
3
2
1
0
em3d
•
mst
bzip2
crafty
eon.c
eon.k
eon.r
gap
gcc
gzip
mcf
parser
perl.d
perl.s
twolf
vortex
vpr.p
vpr.r
cut cache sizes by 4: DL1=16KB, L2=256KB
– more PDIs
– roughly same absolute speedups, higher relatively
– increased contention
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
38
IT Formulation?
•
old: IT doubles as ledger for pregs allocated by DDTs
–
–
–
–
–
•
if evicted from IT, preg is freed, downstream DDT destroyed
keeps overhead (RS contention) down
only need to re-execute integrated loads
restricts effective unrolling degree to IT associativity
requires incremental invalidations on IT (associative matches)
new: decouple pre-execution from presence in IT
– re-execute all integrated instructions
– no incremental invalidations
– downstream DDT not destroyed on IT eviction
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
39
Preliminary New-Formulation Results
New IT Formulation
5
BASE
4
DDMT-OLD
DDMT-NEW
IPC
3
2
1
0
em3d
•
mst
bzip2
crafty
eon.c
eon.k
eon.r
gap
gcc
gzip
mcf
parser
perl.d
perl.s
twolf
vortex
vpr.p
vpr.r
3-5% better for some, 1-2% worse for others
– RS contention increases greatly (too many not-ready DDT instr’s)
– DDT-1 injection too aggressive, slower policy ties up DDT contexts
– different character than older formulation, needs work
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
40
Evaluation Summary
•
performance
– does well on microbenchmarks (like it’s supposed to)
– modest to moderate gains on SPEC2K + aggressive baseline
– relatively better the more there is to do (PDIs or latency per PDI)
•
limitations (future work?)
– pre-execution/branch predictor interface would be nice

others and I have looked at this
– difficulties with unoverlapped unrolling

external unrolling mechanism, static selection framework
– need more RS entries

Amir Roth
clustered RS queues should be good, DDTs are dependence chains
Pre-Execution via Speculative Data-Driven Multithreading
41
The End
•
pre-execution: more ILP from sequential programs
– attacks performance problems directly
– key technologies: proactive out-of-order sequencing + decoupling
•
DDMT: a superscalar-friendly implementation
– no dedicated pre-execution bandwidth
– register integration for pre-execution reuse
•
automated DDT selection
– stable, has the right knobs
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
42
Outline
•
•
•
•
•
•
motivation and introduction to pre-execution
pre-execution DDMT style
automated computation selection
DDMT microarchitecture
performance evaluation
terrace
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
43
Tuning DDT Selection
•
can’t change DDT structure, can control length
– Longer DDTs tolerate more latency, more overhead, fewer PDIs
•
from below
– minimal latency tolerance
•
from above
– maximum length, slicing window size, unrolling degree
•
upshot: maximum ADVAGG DDT has characteristic length
– loosening controls doesn’t make much difference
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
44
Related Work: Architectures
•
dataflow architectures

[Dennis75], Manchester [Gurd+85], TTDA [Arvind+90], ETS [Culler+90]
– decoupling, data-driven fetch to the limit, but no sequential interface
– pre-execution: sequential interface with speculative dataflow helper

•
[UW-CSTR-#1411]
decoupled access/execute architecture

[Smith82]
– decoupling, single execution, proactive ooo?
– pre-execution: speculative decoupled miss/execute micro-architecture

Amir Roth
[MEDEA’00]
Pre-Execution via Speculative Data-Driven Multithreading
45
Related Work: Microarchitectures
•
decoupled runahead

slipstream [Rotenberg+00]
– decoupled, not proactive out-of-order
•
speculative thread-level parallelism

Multiscalar [Franklin+93], SPSM [Dubey+95], DMT [Akkary+98]
– decoupled, not proactive out-of-order
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
46
Pre-Execution Geneology
Dependence-Based
Prefetching
Dependence-Based
Target Pre-Computation
Speculative
Dataflow
Speculative
Slices
Amir Roth
Assisted
Execution
SSMT
Branch Flow
Microarchitecture
DDMT
Speculative
Pre-Computation
Slice
Processors
Pre-Execution via Speculative Data-Driven Multithreading
47
Performance: Separating the Effects
full DDMT
no integration
– loads: prefetching still OK

mcf unrolling needs integration
Speedup (%)
•
•
15
– branches: effects lost

–decoup
–integ
5
mcf
em3d DDT prefetches too
– DDTs as scheduler hints
“problem aware OOO”
– no speedup
vpr
mst
Branches
15
no decoupling

full DDMT
0
Speedup (%)
•
10
Loads
10
5
0
eon
Amir Roth
gzip
Pre-Execution via Speculative Data-Driven Multithreading
em3d
48
Pre-Execution vs. Inlined Helper Code
•
must hoist computation
–
–
–
–
–
•
must copy, straight hoist will create WAR dependences
hoist past/out-of/into procedures?
schedule past/out-of/into procedures?
non-binding prefetches or inline stalls
branch pre-execution?
pre-execution vs. Itanium®
– speculative loads ease hoisting past procedures
– but that’s it
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
49
Loads in DDTs
•
missed memory dependences (store invalidations)?
– problem is with integration
– keep address/value pairs in integration table (shadow MOB)
– snoop
•
multiprocessors?
–
–
–
–
pre-execution is sequentially consistent (SC)
just radical out-of-order
actual execution, snoopable state for all loads
many cases occur in uniprocessor as interactions with main thread
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
50
Contributions: Prelim vs. Thesis
•
Prelim
– implementation: Speculative Dataflow (TTDA), DDMT (SMT base)
– applications: prefetch data, pre-compute branches, “smoothe” ILP
– automated pre-execution computation selection (enough to get by)
•
Dissertation
– implementation: DDMT (superscalar based)
– applications: prefetch data, pre-compute branches
– automated computation selection
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
51
Superscalar: Obstacles to ILP
•
backpressure, backpressure…
–
–
–
–
out-of-order retirement? can’t do
larger window (ROB)? hard (engineering)
bigger useful window? harder (P[no mis-pred] shrinks)
out-of-order fetch? getting to that
Amir Roth
Pre-Execution via Speculative Data-Driven Multithreading
52
Calculating Execution Times
•
why do DDTs execute faster than MT?
– sequence fewer instructions!
– SCDH (Sequencing Constrained Dataflow Height): accounts for this
PC
x18
x18
x18
x2c
x30
inst
DSTtrig
R1=R1+4
0
R1=R1+4 40
R1=R1+4 80
R2=ld[R1] 85
bz R2, x44 86
Amir Roth
SC
0
5
10
11
11
SCDH
1
6
11
12
13
PC
x18
x18
x18
x2c
x30
inst
DSTtrig
R1=R1+4
0
R1=R1+4
1
R1=R1+4
2
R2=ld[R1] 3
bz R2, x44 4
Pre-Execution via Speculative Data-Driven Multithreading
SC
0
1
2
3
4
SCDH
1
2
3
4
5
53
Download