WaveScalar Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008 Why Dataflow? • “[...] as wire delays grow relative to gate delays, improvements in clock rate and IPC become directly antagonistic” [Agarwal00] • Large bypass networks, highly associative structures especially problematic – Can only ameliorate somewhat in superscalar designs (21264 clustering, WIB, etc.) • Shorter wires, smaller loads => higher fclk possible with point-to-point networks, decentralized structures Dataflow Locality • Def: predictability of instruction dependencies – 3/5 source operands come from most recent producer • Completely ignored by most superscalars – Over-general design: large bypass networks, regular references to huge PRF, ... • Partial exceptions: clustering, hierarchical RFs • P4: 1 cycle of 31 stages devoted to execution • Can exploit to greatly cheapen communication The von Neumann abstraction • Elegant as it is, the von Neumann execution model is inherently sequential – Control dependencies limit exploitable ILP considerably • P4 again: 20 stage (!) branch misprediction loop – Store/load aliasing hurts, too Why Not Dataflow? • Dataflow architectures may scale further, but... • Who the hell wants to write a program in Id? – For commercial adoption and future sanity, must support von Neumann memory semantics – But ideally without fetch serialization Enter WaveScalar • WaveScalar: dataflow's new groove • Enabled by process improvements: can integrate 2N processing elements (PEs) + nearby storage on-die – “Cache-only” architecture (not in the COMA sense) • Provides total load/store ordering – Can be programmed conventionally • ...without a program counter WaveScalar ISA • WaveScalar binary encodes the DFG • ISA is RISCy, plus a few new primitives • Control flow: – ɸ insn implements the C ternary operator • Similar to predication – ɸ-1 insn conditionally sends data to one PE or another based upon – Indirect-Send(arg,addr,offset) insn implements indirect jumps, calls, returns WaveScalar ISA: Waves • Wave === connected DAG, subset of DFG – Can span multiple hyperblocks iff each insn executed at most once (no loops) • Easily elongated with unrolling • To disambiguate which dynamic insn is being executed by a PE, data values carry a wave number – Wave numbers incremented by Wave-Advance insn • Wave number assignment is not centralized! WaveScalar ISA: Memory Ordering • Wave-ordered memory – Where possible, mem ops labeled with location within its wave: <predecessor,this,successor> – Control flow may prohibit this; when unknown, '?' used as label – Rule: no op with ? in succ. field may connect to an op with ? in pred. field • Solution: memory-nops • Result: memory has enough info to establish total load/store order WaveCache: WaveScalar Implemented • Grid of 211 PEs in clusters of 16 – On each PE: control logic, IQ/OQ, ALU, buffering for 8 static insns • Small L1 D$ per 4 clusters • Traditional unified L2$ • 1 StQ per 4 clusters – Each wave bound to a StQ dynamically • Intra-cluster comm: shared buses • Inter-cluster: mesh? WaveScalar ISA: Waves • Wave === connected DAG, subset of DFG – Can span multiple hyperblocks iff each insn executed at most once (no loops) • Easily elongated with unrolling • To disambiguate which dynamic insn is being executed by a PE, data values carry a wave number – Wave numbers incremented by Wave-Advance insn • Wave number assignment is not centralized! Compilation • Compilation basically same as for traditional arch. – To the point that binary translation is possible – Additional steps: • inserting memory-nops, wave-advances • converting branches to ɸ-1 • Binaries larger – Extra insns – Larger insns (list of target PEs) • ...but this is OK (no repeated fetch) Program Load/Termination • Loading – As usual, program loaded by setting PC & incurring I$ miss – Insn targets labeled "not-loaded" until those miss, as well – In general, hopefully I$ misses are infrequent • Must back up evicted insn's state (queues), restore new insn's state • Probably need to invoke OS • Termination – OS purges all insns from all PEs Execution Example void s(char in[10], char out[10]) { int i = 0, j = 0; do { int t = in[i]; if(t) out[j++] = t; } while(i++ < 10); } • And it's that simple! Just Kidding... void s(char in[10], char out[10]) { int i = 0, j = 0; do { int t = in[i]; if(t) out[j++] = t; } while(i++ < 10); } Unmapped......................and Mapped How Well Does It Do? • Methodology – Benchmarks: SPEC and a few others – Compiled for Alpha & binary-translated • Fairness; better overall code generation • But no WaveCache-specific optimizations – Results reported in Alpha-equivalent IPC • Fairness (WaveScalar has extra insns) How Well Does It Do? • Favorable comparison to superscalar – 16-wide (!!), out-of-order – |PRF|=|IW|=1024 • Better IPC than TRIPS, but certainly lower fclk – TRIPS limited by smaller execution units (hyperblocks vs. waves) Other performance results • Extra instruction overhead – In terms of static code size: 20%-140% – In terms of execution time: 10% • Parallelism • Input queue size – 8 sets of input values sufficient for most programs • Except for victims of parallelism explosion Performance improvements • Control speculation – Baseline WaveCache: no branch prediction – 47% perf. improvement with perfect prediction • Memory Speculation – Baseline WaveCache: no memory disambiguation – 62% perf. improvement with perfect memory disambiguation • Upshot: unrealistic, but lots of headroom – 340% improvement with both Analysis • WaveScalar makes dataflow much more generalpurpose • Seems fast enough to spend the time implementing – Good IPC; more clock period headroom • Why isn't this the golden standard? – Why are Swanson, Oskin no longer into dataflow? Questions? Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008 WaveScalar Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008