WaveScalar Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008

advertisement
WaveScalar
Swanson et al.
Presented by Andrew Waterman
ECE259 Spring 2008
Why Dataflow?
• “[...] as wire delays grow relative to gate delays,
improvements in clock rate and IPC become
directly antagonistic” [Agarwal00]
• Large bypass networks, highly associative
structures especially problematic
– Can only ameliorate somewhat in superscalar
designs (21264 clustering, WIB, etc.)
• Shorter wires, smaller loads => higher fclk possible
with point-to-point networks, decentralized
structures
Dataflow Locality
• Def: predictability of instruction dependencies
– 3/5 source operands come from most recent producer
• Completely ignored by most superscalars
– Over-general design: large bypass networks,
regular references to huge PRF, ...
• Partial exceptions: clustering, hierarchical RFs
• P4: 1 cycle of 31 stages devoted to execution
• Can exploit to greatly cheapen communication
The von Neumann abstraction
• Elegant as it is, the von Neumann execution model
is inherently sequential
– Control dependencies limit exploitable ILP
considerably
• P4 again: 20 stage (!) branch misprediction loop
– Store/load aliasing hurts, too
Why Not Dataflow?
• Dataflow architectures may
scale further, but...
• Who the hell wants to write a
program in Id?
– For commercial adoption and
future sanity, must support von
Neumann memory semantics
– But ideally without fetch
serialization
Enter WaveScalar
• WaveScalar: dataflow's new groove
• Enabled by process improvements: can integrate 2N
processing elements (PEs) + nearby storage on-die
– “Cache-only” architecture (not in the COMA sense)
• Provides total load/store ordering
– Can be programmed conventionally
• ...without a program counter
WaveScalar ISA
• WaveScalar binary encodes the DFG
• ISA is RISCy, plus a few new primitives
• Control flow:
– ɸ insn implements the C ternary operator
• Similar to predication
– ɸ-1 insn conditionally sends data to one PE or
another based upon
– Indirect-Send(arg,addr,offset) insn implements
indirect jumps, calls, returns
WaveScalar ISA: Waves
• Wave === connected DAG, subset of DFG
– Can span multiple hyperblocks iff each insn
executed at most once (no loops)
• Easily elongated with unrolling
• To disambiguate which dynamic insn is being
executed by a PE, data values carry a wave number
– Wave numbers incremented by Wave-Advance insn
• Wave number assignment is not centralized!
WaveScalar ISA: Memory Ordering
• Wave-ordered memory
– Where possible, mem ops labeled
with location within its wave:
<predecessor,this,successor>
– Control flow may prohibit this;
when unknown, '?' used as label
– Rule: no op with ? in succ. field
may connect to an op with ? in
pred. field
• Solution: memory-nops
• Result: memory has enough info to
establish total load/store order
WaveCache: WaveScalar Implemented
• Grid of 211 PEs in clusters of 16
– On each PE: control logic, IQ/OQ, ALU, buffering for
8 static insns
• Small L1 D$ per 4 clusters
• Traditional unified L2$
• 1 StQ per 4 clusters
– Each wave bound to a
StQ dynamically
• Intra-cluster comm:
shared buses
• Inter-cluster: mesh?
WaveScalar ISA: Waves
• Wave === connected DAG, subset of DFG
– Can span multiple hyperblocks iff each insn
executed at most once (no loops)
• Easily elongated with unrolling
• To disambiguate which dynamic insn is being
executed by a PE, data values carry a wave number
– Wave numbers incremented by Wave-Advance insn
• Wave number assignment is not centralized!
Compilation
• Compilation basically same as for traditional arch.
– To the point that binary translation is possible
– Additional steps:
• inserting memory-nops, wave-advances
• converting branches to ɸ-1
• Binaries larger
– Extra insns
– Larger insns (list of target PEs)
• ...but this is OK (no repeated fetch)
Program Load/Termination
• Loading
– As usual, program loaded by setting PC & incurring
I$ miss
– Insn targets labeled "not-loaded" until those miss,
as well
– In general, hopefully I$ misses are infrequent
• Must back up evicted insn's state (queues), restore new insn's state
• Probably need to invoke OS
• Termination
– OS purges all insns from all PEs
Execution Example
void s(char in[10], char out[10])
{
int i = 0, j = 0;
do {
int t = in[i];
if(t)
out[j++] = t;
} while(i++ < 10);
}
• And it's that simple!
Just Kidding...
void s(char in[10], char out[10])
{
int i = 0, j = 0;
do {
int t = in[i];
if(t)
out[j++] = t;
} while(i++ < 10);
}
Unmapped......................and Mapped
How Well Does It Do?
• Methodology
– Benchmarks: SPEC and a few others
– Compiled for Alpha & binary-translated
• Fairness; better overall code generation
• But no WaveCache-specific optimizations
– Results reported in Alpha-equivalent IPC
• Fairness (WaveScalar has extra insns)
How Well Does It Do?
• Favorable comparison to
superscalar
– 16-wide (!!), out-of-order
– |PRF|=|IW|=1024
• Better IPC than TRIPS, but
certainly lower fclk
– TRIPS limited by smaller
execution units
(hyperblocks vs. waves)
Other performance results
• Extra instruction overhead
– In terms of static code size: 20%-140%
– In terms of execution time: 10%
• Parallelism
• Input queue size
– 8 sets of input values sufficient for most programs
• Except for victims of parallelism explosion
Performance improvements
• Control speculation
– Baseline WaveCache: no branch prediction
– 47% perf. improvement with perfect prediction
• Memory Speculation
– Baseline WaveCache: no memory disambiguation
– 62% perf. improvement with perfect memory
disambiguation
• Upshot: unrealistic, but lots of headroom
– 340% improvement with both
Analysis
• WaveScalar makes dataflow much more generalpurpose
• Seems fast enough to spend the time implementing
– Good IPC; more clock period headroom
• Why isn't this the golden standard?
– Why are Swanson, Oskin no longer into dataflow?
Questions?
Swanson et al.
Presented by Andrew Waterman
ECE259 Spring 2008
WaveScalar
Swanson et al.
Presented by Andrew Waterman
ECE259 Spring 2008
Download