Spatial Computation Computing without General-Purpose Processors Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University May 10, 2005 Outline • Intro: Problems of current architectures 100 2000 1998 1996 1994 1992 1990 1988 1986 1984 1 1982 10 1980 Performance 1000 • Compiling Application-Specific Hardware • ASH Evaluation • Conclusions 2 Resources [Intel] • We do not worry about not having hardware resources • We worry about being able to use hardware resources 3 1010 109 gate 108 wire 107 106 105 5ps 20ps 104 Complexity ALUs Cannot rely on global signals (clock is a global signal) 4 1010 109 108 107 106 105 104 gate short, Simple, wire unidirectional interconnect 5ps 20ps Automatic translation C ! HW Simple hw, mostly idle Complexity ALUs No interpretation Distributed control, Asynchronous Cannot rely on global signals (clock is a global signal) 5 Our Proposal: Application-Specific Hardware • ASH addresses these problems • ASH is not a panacea • ASH “complementary” to CPU Low ILP computation + OS + VM CPU ASH High-ILP computation $ Memory 6 Outline • Problems of current architectures • CASH: Compiling Application-Specific Hardware • ASH Evaluation • Conclusions 7 Application-Specific Hardware C program Compiler Dataflow IR HW backend Reconfigurable/custom hw 8 Computation Program IR a x = a & 7; ... Circuits a 7 & 2 y = x >> 2; x Operations Variables Dataflow >> Nodes Def-use edges No interpretation &7 >>2 Pipeline stages Channels (wires) 9 Basic Computation= Pipeline Stage + latch data ack valid 10 Asynchronous Computation + data 1 latch 5 + + + ack valid 2 + 3 + 6 4 + 7 + 8 11 Distributed Control Logic global FSM ack rdy + short, local wires 12 MUX: Forward Branches b if (x > 0) y = -x; else y = b*x; x * 0 - f > ! y SSA = no arbitration Conditionals ) Speculation Critical path 13 Control Flow ) Data Flow data f Merge (label) data data predicate Gateway p Split (branch) ! 14 0 Loops i * 0 int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return return sum; sum; +1 < 100 sum + ! ret back 15 i Pipelining * + <= pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; 100 1 sum + step 1 16 i Pipelining * 100 1 + <= sum + step 2 17 i Pipelining * 100 1 + <= sum + step 3 18 i Pipelining * 100 1 + <= sum + step 4 19 i Pipelining i=1 100 1 + <= i=0 sum + step 5 20 i Pipelining * i=1 100 1 + <= i=0 sum + back step 6 21 i Pipelining * + <= i’s loop predicate 100 1 Long latency pipe sum sum’s loop + step 7 22 i Pipelining * i’s loop critical path 100 1 + <= Predicate ack edge is on the critical path. sum sum’s loop + 23 Pipeline balancing * i 100 1 + <= i’s loop decoupling FIFO sum sum’s loop + step 7 24 i Pipeline balancing * i’s loop 100 1 + <= critical path decoupling FIFO sum sum’s loop + back back to talk 25 Procedures Caller Call Callee Argument Return Continuation 26 Memory Access LD ST pipelined arbitrated network Monolithic Memory LD local communication global structures Future work: fragment this! 27 Outline • Problems of current architectures • Compiling ASH • ASH Evaluation • Conclusions 28 Evaluating ASH C Mediabench kernels (1 hot function/benchmark) CASH core Verilog back-end commercial tools Synopsys, Cadence P/R 180nm std. cell library, 2V ModelSim Mem (Verilog simulation) ASIC ~1999 technology performance numbers 29 Compile Time C 200 lines CASH core 20 seconds Verilog back-end 10 seconds Synopsys, Cadence P/R 20 minutes 1 hour Mem ASIC 30 pe g2 _d jp eg _e pe g2 _e pe gw it_ d pe gw it_ e m m _e _d jp eg _d gs m gs m g7 21 _e 1.5 g7 21 _d _e _d 4 ad pc m ad pc m Area [sq mm] ASH Area 2 (mm ) P4: 217 4.5 Memory access Circuit 3.5 3 2.5 2 minimal RISC core 1 0.5 0 31 ASH vs 600MHz CPU [4-wide OOO, .18 mm] 2.40 2.50 1.98 1.79 1.65 1.50 1.37 1.34 1.06 1.00 1.05 0.80 0.74 0.56 0.43 0.50 0.44 av er ag e pe g2 _d m pe g2 _e pe gw it_ d pe gw it_ e m jp eg _e jp eg _d _e gs m g7 21 _d g7 21 _e gs m _d _e ad pc m _d 0.00 ad pc m Times faster 2.00 32 Bottleneck: Memory Protocol ST LSQ LD Memory • Enabling dependent operations requires round-trip to memory. • Exploring novel memory access protocols. 33 mP 4000 29 26 23 19 22 av er ag e DSP 110 pe g2 _d m pe g2 _e pe gw it_ d pe gw it_ e 10 m 50 jp eg _e 40 jp eg _d 20 _e 70 gs m 30 g7 21 _d g7 21 _e gs m _d _e _d 70 ad pc m ad pc m Power [mW] Power (mW) Xeon [+cache] 67000 60 46 38 30 22 25 10 10 0 34 171 100 36 48 ge 147 er a 389 av 1000 e m pe g2 _d m pe g2 _e pe gw it_ d pe gw it_ e eg _ jp d eg _ 285 jp gs m _e 363 gs m _d ad pc m _d ad pc m _e g7 21 _d g7 21 _e Times better than superscalar Energy-delay 10000 1524 1788 437 174 227 50 10 1 35 pe g2 _d m pe g2 _e pe gw it_ d pe gw it_ e 66 m 40 jp eg _e 52 jp eg _d _e 57 gs m _d 143 gs m 80 g7 21 _d g7 21 _e _e _d 60 ad pc m ad pc m (non-speculative arithmetic) [Operations/nJ] Energy Efficiency (op/nJ) 160 143 140 120 100 62 51 39 55 40 28 28 20 0 36 Energy Efficiency 1000x Dedicated hardware ASH media kernels Asynchronous mP FPGA General-purpose DSP Microprocessors 0.01 0.1 1 10 100 1000 Energy Efficiency [Operations/nJ] 37 Outline Problems of current architectures + Compiling ASH + Evaluation = Related work, Conclusions 38 Bilbliography • Dataflow: A Complement to Superscalar Mihai Budiu, Pedro Artigas, and Seth Copen Goldstein ISPASS 2005 • Spatial Computation Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, and Seth Copen Goldstein ASPLOS 2004 • C to Asynchronous Dataflow Circuits: An End-to-End Toolflow Girish Venkataramani, Mihai Budiu, Tiberiu Chelcea, and Seth Copen Goldstein IWLS 2004 • Optimizing Memory Accesses For Spatial Computation Mihai Budiu and Seth Copen Goldstein CGO 2003 • Compiling Application-Specific Hardware Mihai Budiu and Seth Copen Goldstein FPL 2002 39 Related Work • • • • • • Optimizing compilers High-level synthesis Reconfigurable computing Dataflow machines Asynchronous circuits Spatial computation We target an extreme point in the design space: no interpretation, fully distributed computation and control 40 ASH Design Point • Design an ASIC in a day • Fully automatic synthesis to layout • Fully distributed control and computation (spatial computation) – Replicate computation to simplify wires • Energy/op rivals custom ASIC • Performance rivals superscalar • E£t 100 times better than any processor 41 Conclusions Spatial computation strengths Feature No interpretation Advantages Energy efficiency, speed Spatial layout Short wires, no contention Asynchronous Low power, scalable Distributed No global signals Automatic compilation Designer productivity 42 Backup Slides • Absolute performance • Control logic • Exceptions • Leniency • Normalized area • ASH weaknesses • Splitting memory • Recursive calls • Leakage • Why not compare to… • Targeting FPGAs 43 back pe gw it_ e it_ d pe g2 _e pe g2 _d jp eg _e 5000 pe gw m m _e _d jp eg _d gs m gs m g7 21 _e g7 21 _d _e _d 6000 ad pc m ad pc m Millions of Operations per Second Absolute Performance 12300 MOPSall MOPSspec MOPS 4000 3000 2000 CPU range 1000 0 44 Pipeline Stage ackout C rdyin ackin rdyout = datain back Reg D dataout Exceptions • Strictly speaking, C has no exceptions • In practice hard to accommodate exceptions in hardware implementations • An advantage of software flexibility: PC is single point of execution control Low ILP computation + OS + VM + exceptions CPU $$$ ASH High-ILP computation Memory back 46 Critical Paths b if (x > 0) y = -x; else y = b*x; x * 0 - > ! y 47 Lenient Operations b if (x > 0) y = -x; else y = b*x; x * 0 - > ! y Solves the problem of unbalanced paths back back to talk 48 back ag e av er pe gw i t_ d pe gw i t_ e 2_ e 2_ d 200 pe g pe g _e 250 m m jp eg _d _e jp eg gs m _d _e gs m g7 21 _d m _e m _d g7 21 ad pc ad pc Source Lines/sq mm Lines/sq mm KBytes/sq mm 5 4 150 3 100 2 50 1 0 Object code Kb/sq mm Normalized Area 6 0 49 ASH Weaknesses • Both branch and join not free • Static dataflow (no re-issue of same instr) • Memory is “far” • Fully static – No branch prediction – No dynamic unrolling – No register renaming • Calls/returns not lenient back 50 Branch Prediction i ASH crit path for (i=0; i < N; i++) { ... CPU crit path 1 + < exception if (exception) break; } Predicted not taken Effectively a noop for CPU! back Predicted taken. ! & result available before inputs51 Memory Partitioning • MIT RAW project: Babb FCCM ‘99, Barua HiPC ‘00,Lee ASPLOS ‘00 • Stanford SpC: Semeria DAC ‘01, TVLSI ‘02 • Illinois FlexRAM: Fraguella PPoPP ‘03 • Hand-annotations #pragma back 52 Recursion save live values recursive call restore live values back stack 53 Leakage Power Ps = k Area e-VT • Employ circuit-level techniques • Cut power supply of idle circuit portions – most of the circuit is idle most of the time – strong locality of activity back 54 Why Not Compare To… • In-order processor – Worse in all metrics than superscalar, except power – We beat it in all metrics, including performance • DSP – We expect roughly the same results as for superscalar (Wattch maintains high IPC for these kernels) • ASIC – No available tool-flow supports C to the same degree • Asynchronous ASIC – We compared with a Balsa synthesis system – We are 15 times better in Et compared to resulting ASIC • Async processor – We are 350 times better in Et than Amulet (scaled to .18) back 55 Why not target FPGA • • • Do not support asynchronous circuits Very inefficient in area, power, delay Too fine-grained for datapath circuits • We are designing an async FPGA back 56