Spatial Computation Computing without General-Purpose Processors Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Outline • Intro: Problems of current architectures 100 2000 1998 1996 1994 1992 1990 1988 1986 1984 1 1982 10 1980 Performance 1000 • Compiling Application-Specific Hardware • ASH Evaluation • Conclusions 2 Resources [Intel] • We do not worry about not having hardware resources • We worry about being able to use hardware resources 3 1010 109 gate 108 wire 107 106 105 5ps 20ps 104 Complexity ALUs Cannot rely on global signals (clock is a global signal) 4 1010 109 108 107 106 105 104 gate short, Simple, wire unidirectional interconnect 5ps 20ps Automatic translation C ! HW Simple hw, mostly idle Complexity ALUs No interpretation Distributed control, Asynchronous Cannot rely on global signals (clock is a global signal) 5 Our Proposal: Application-Specific Hardware • ASH addresses these problems • ASH is not a panacea • ASH “complementary” to CPU Low ILP computation + OS + VM CPU ASH High-ILP computation $ Memory 6 Paper Content • Automatic translation of C to hardware dataflow machines • High-level comparison of dataflow and superscalar • Circuit-level evaluation -power, performance, area 7 Outline • Problems of current architectures • CASH: Compiling Application-Specific Hardware • ASH Evaluation • Conclusions 8 Application-Specific Hardware C program Compiler Dataflow IR HW backend Reconfigurable/custom hw 9 Computation Program IR a x = a & 7; ... Circuits a 7 & 2 y = x >> 2; x Operations Variables Dataflow >> Nodes Def-use edges No interpretation &7 >>2 Pipeline stages Channels (wires) 10 Basic Computation= Pipeline Stage + latch data ack valid 11 Distributed Control Logic global FSM ack rdy + short, local wires 12 MUX: Forward Branches b if (x > 0) y = -x; else y = b*x; x * 0 - f > ! y SSA = no arbitration Conditionals ) Speculation 13 Memory Access LD ST pipelined arbitrated network Monolithic Memory LD local communication global structures Future work: fragment this! 14 Outline • Problems of current architectures • Compiling ASH • ASH Evaluation • Conclusions 15 Evaluating ASH C Mediabench kernels (1 hot function/benchmark) CASH core Verilog back-end commercial tools Synopsys, Cadence P/R 180nm std. cell library, 2V ModelSim Mem (Verilog simulation) ASIC ~1999 technology performance numbers 16 Compile Time C 200 lines CASH core 20 seconds Verilog back-end 10 seconds Synopsys, Cadence P/R 20 minutes 1 hour Mem ASIC 17 it_ e gw pe it_ d gw pe e g2 _ m pe d _e _d g2 _ m pe eg jp eg jp m _e gs Mem access Datapath m _d 21 _e 8 gs g7 21 _d _e 2 g7 pc m _d 7 ad pc m ad Square mm ASH Area P4: 217 6 5 4 3 minimal RISC core 1 0 18 ASH vs 600MHz CPU [.18 mm] 4 3.65 3.57 3 2.5 1.93 1.87 2 1.52 1.55 1.35 1.5 1 0.77 0.60 1.23 0.70 0.53 0.5 0.48 av g pe g2 _d m pe g2 _e pe gw it _ d pe gw it _ e m jp eg _d jp eg _e _e gs m _d gs m g7 21 _e g7 21 _d _e ad pc m _d 0 ad pc m Times slower 3.5 19 Bottleneck: Memory Protocol ST LSQ LD Memory • Enabling dependent operations requires round-trip to memory. • Limit study: round trip zero time ) up to 5x speed-up. • Exploring novel memory access protocols. 20 Power DSP 110 45.0 Xeon [+cache] 67000 mP 4000 42.5 40.0 35.0 34.4 28.3 25.2 25.0 23.6 21.8 25.2 22.5 21.6 20.0 15.0 10.0 13.0 9.3 9.3 5.0 av g m pe g2 _d m pe g2 _e pe gw i t_ d pe gw i t_ e jp eg _e jp eg _d gs m _e gs m _d g7 21 _e g7 21 _d 0.0 ad pc m _d ad pc m _e Power [mW] 29.7 30.0 21 _e pc m _d pc m av g g7 21 _d g7 21 _e gs m _d gs m _e jp eg _d jp eg _e m pe g2 _d m pe g2 _e pe gw it_ d pe gw it_ e ad ad Energy-delay vs superscalar (times better) Energy-delay vs. Wattch 10000 1000 100 10 1 22 Energy Efficiency 1000x Dedicated hardware ASH media kernels Asynchronous mP FPGA General-purpose DSP Microprocessors 0.01 0.1 1 10 100 1000 Energy Efficiency [Operations/nJ] 23 Outline Problems of current architectures + Compiling ASH + Evaluation = Related work, Conclusions 24 Related Work • • • • • • Optimizing compilers High-level synthesis Reconfigurable computing Dataflow machines Asynchronous circuits Spatial computation We target an extreme point in the design space: no interpretation, fully distributed computation and control 25 ASH Design Point • Design an ASIC in a day • Fully automatic synthesis to layout • Fully distributed control and computation (spatial computation) – Replicate computation to simplify wires • Energy/op rivals custom ASIC • Performance rivals superscalar • E£t 100 times better than any processor 26 Conclusions Spatial computation strengths Feature No interpretation Advantages Energy efficiency, speed Spatial layout Short wires, no contention Asynchronous Low power, scalable Distributed No global signals Automatic compilation Designer productivity 27 Backup Slides • Absolute performance • Control logic • Exceptions • Leniency • Normalized area • Loops • ASH weaknesses • Splitting memory • Recursive calls • Leakage • Why not compare to… • Targetting FPGAs 28 pe gw pe gw _e g av it _ e it _ d pe g2 m _e _d _d jp eg jp eg m _e gs pe g2 m _e _d m _d gs g7 21 g7 21 m _e ad pc m _d ad pc Megaoperations per second Absolute Performance 9000 MOPSall 8000 MOPSspec 7000 MOPS 6000 5000 4000 3000 2000 1000 0 29 Pipeline Stage ackout C rdyin ackin rdyout = datain back Reg D dataout Exceptions • Strictly speaking, C has no exceptions • In practice hard to accommodate exceptions in hardware implementations • An advantage of software flexibility: PC is single point of execution control Low ILP computation + OS + VM + exceptions CPU $$$ ASH High-ILP computation Memory back 31 Critical Paths b if (x > 0) y = -x; else y = b*x; x * 0 - > ! y 32 Lenient Operations b if (x > 0) y = -x; else y = b*x; x * 0 - > ! y Solves the problem of unbalanced paths back 33 back av g e d it_ gw pe e it_ gw pe d 2_ pe g m e 2_ pe g m eg _ jp d eg _ _e gs m jp _e _d _d 21 21 gs m g7 g7 _e pc m ad _d pc m ad Normalized Area 120 2.5 Lines/sq mm sq mm/kbyte 100 2 80 1.5 60 1 40 20 0.5 0 0 34 Control Flow ) Data Flow data f Merge (label) data data predicate Gateway p Split (branch) ! 35 0 Loops i * 0 int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return return sum; sum; +1 < 100 sum + ! ret back 36 ASH Weaknesses • Both branch and join not free • Static dataflow (no re-issue of same instr) • Memory is “far” • Fully static – No branch prediction – No dynamic unrolling – No register renaming • Calls/returns not lenient back 37 Branch Prediction i ASH crit path for (i=0; i < N; i++) { ... CPU crit path 1 + < exception if (exception) break; } Predicted not taken Effectively a noop for CPU! back Predicted taken. ! & result available before inputs38 Memory Partitioning • MIT RAW project: Babb FCCM ‘99, Barua HiPC ‘00,Lee ASPLOS ‘00 • Stanford SpC: Semeria DAC ‘01, TVLSI ‘02 • Illinois FlexRAM: Fraguella PPoPP ‘03 • Hand-annotations #pragma back 39 Recursion save live values recursive call restore live values back stack 40 Leakage Power Ps = k Area e-VT • Employ circuit-level techniques • Cut power supply of idle circuit portions – most of the circuit is idle most of the time – strong locality of activity • High VT transistors on non-critical path back 41 Why Not Compare To… • In-order processor – Worse in all metrics than superscalar, except power – We beat it in all metrics, including performance • DSP – We expect roughly the same results as for superscalar (Wattch maintains high IPC for these kernels) • ASIC – No available tool-flow supports C to the same degree • Asynchronous ASIC – We compared with a Balsa synthesis system – We are 15 times better in Et compared to resulting ASIC • Async processor – We are 350 times better in Et than Amulet (scaled to .18) back 42 Compared to Next Talk Engine Performance [180nm] [MIPS] SNAP/LE 28 SNAP/LE ASH back 240 1100 E/instruction [pJ] 24 218 20 43 Why not target FPGA • • • Do not support asynchronous circuits Very inefficient in area, power, delay Too fine-grained for datapath circuits • We are designing an async FPGA back 44