Computing Without Processors Thesis Proposal Mihai Budiu July 30, 2001 Thesis Committee: Seth Goldstein, chair Todd Mowry Peter Lee Babak Falsafi, ECE Nevin Heintze, Agere Systems This presentation uses TeXPoint by George Necula Four Types of Research • • • • Solve nonexistent problems Solve past problems Solve current problems Solve future problems 2 The Law (source: Intel) 3 The Crossover Phenomenon technology time 4 Example Crossover access speed (ns) no caches caches CPU DRAM 200 1980 time 5 Trouble Ahead for Microarchitecture Signal Propagation mm die size 20 distance in 1 clock now time 7 Reliability & Yield defects/chip occurring tolerable new process now time 8 Energy power CPU consumption thermal dissipation 100W now time 9 Instruction-Level Parallelism (ILP) instructions fetch commit now time 10 Premises of this Research • We will have lots of gates – Moore’s law continues – Nanotechnology • Contemporary architectures do not scale 11 Outline • • • • • • • Motivation ASH: Application-Specific Hardware The spatial model of computation CASH: Compiling for ASH Evolutionary path Conclusions Future work 12 ASH Application-Specific Hardware HLL program Compiler Circuit Reconfigurable hardware 13 ASH: A Scalable Architecture -- Thesis Statement -Application-specific hardware on a reconfigurable-hardware substrate is a solution for the smooth evolution of computer architecture. We can provide scalable compilers for translating high-level languages into hardware. 14 Example int f(void) { int i=0, j = 0; for (; i < 10; i++) j += i; return j; } 15 Outline • • • • • • • Motivation ASH: Application-Specific Hardware The spatial model of computation CASH: Compiling for ASH Evolutionary path Conclusions Future work 16 ASH and Nanotechnology • Build reconfigurable hardware using nanotechnology • Low Power: 1010 gates use less than 2 W Huge structures • Low cost: nanocents/gate • High density: 105x over CMOS Nano-RAM cell . In yellow: a CMOS RAM cell 17 A Limit Study of Performance A graph of the whole program execution: Basic block Control-flow transfer Memory write Memory read Memory word 18 Typical Program Graph (g721_e) Memory reads Control flow transfer 100% code cluster memcpy 100% memory cluster 19 Program Graph After Inlining memcpy memcpy 20 09 12 9. 9. go co m pr es s -1 _d g_ e g_ d pe g2 m jpe jpe _e m 1 clock/square gs _e _d e _d m _Q _Q gs g7 21 g7 21 ep i c_ 9 13 0. li 13 2. ijp eg ad pc m _d ad pc m _e times slower than native Application Slowdown 11 10 5 clocks/square 8 7 6 5 4 3 2 1 0 21 How Time Is Spent No caches: reads expensive 100% 90% 80% 60% 50% 40% 30% idle execution control flow register traffic 20% 10% 0% 09 12 9 9. co .g o m pr es s 13 13 0.li 2. ijp ad e g pc m ad _d pc m _e ep g7 ic_e 21 _Q g7 _ 21 d _Q _e gs m _d gs m _e jp eg _d jp eg _e m pe g2 _d percent 70% No speculation 22 Lesson The spatial model of computation has different properties. 23 Outline • • • • • • Motivation ASH: Application-Specific Hardware The spatial model of computation CASH: Compiling for ASH Evolutionary path Future work 24 CASH: Compiling for ASH Program to circuits Memory partitioning Interconnection net 25 Reliability Compilation 1. Program 2. Split-phase Abstract Machines int reverse(int x) { int k,r=0; for (k=0; k<32; k++) r |= x&1; x = x >> 1; r = r << 1; } } Computations & local storage Unknown latency ops. 3. Configurations placed independently 4. Placement on chip 26 Power Split-phase Abstract Machines CFG SAM 1 SAM 3 SAM 2 27 Hyperblock => SAM • Single-entry, multiple exit • May contain loops 28 SAM => FSM Exit Start Loop Local memory Exit Remote Memory 29 Implementing SAMs - interesting details - 30 args Computation start Register The SAM FSM results exit Predicates (control) Combinational logic 31 Signals Computation = Dataflow Programs Circuits a x = a & 7; ... 7 & 2 y = x >> 2; x >> • Variables => wires + tokens • No token store; no token matching • Local communication only 32 Tokens & Synchronization • Tokens signal operation completion • Possible implementations: data ack valid data data valid valid reset Local Global Static 33 ILP Speculation and Eager Muxes b if (x > 0) y = -x; else y = b*x; slow x * 0 - f y Computation > ! Predicates Static-Single Assignment implemented in hardware 34 Predicates • Select variable definition • Guard side-effects – Memory access – Procedure calls • Control looping • Decide exit branch x=... x=... ...=x *q = 2; 35 Computing Predicates s t b • Correct for irreducible graphs • Correct even when speculatively computed • Can be eagerly computed 36 Loops + Dataflow = Pipelining 0 i 1 &a[0] for (i=0; i < 10; i++) a[i] += i; a[3] a[2] + + load a[1] + a[0] store 37 Outline • • • • • • • Motivation ASH: Application-Specific Hardware The spatial model of computation CASH: Compiling for ASH Evolutionary path Conclusions Future work 38 Evolutionary Path Microprocessors ASH The problem with ASH: Resources 39 Virtualization 40 CPU+ASH support computation + OS + VM CPU ASH core computation Memory 41 Outline • • • • • • • Motivation ASH: Application-Specific Hardware The spatial model of computation CASH: Compiling for ASH Evolutionary path Conclusions Future work 42 ASH Benefits Problem Reliability Power Signals ILP Solution Configuration around defects Only “useful” gates switching Localized computation Statically extracted 43 Scalable Performance performance ASH CPU now time 44 Summary • Contemporary CPU architecture faces lots of problems • Application-Specific Hardware (ASH) provides a scalable technology • Compiling HLL into hardware dataflow machines is an effective solution 45 Timeline now CASH core Explore architectural/compiler trade-offs Hw/sw partitioning (ASH + CPU) Cost models 06/01 09/01 Loop parallelization Memory partitioning Write thesis ASH Simulation 12/01 04/02 06/02 09/02 12/02 46 Extras • • • • • Related work Reconfigurable hardware Other cross-over phenomena A CPU + ASH study More about predicates 47 Related Work • • • • • • Hardware synthesis from HLL Reconfigurable hardware Predicated execution Dataflow machines Speculative execution Predicated SSA back 48 Reconfigurable Hardware Interconnection network Universal gates and/or storage elements Programmable Switches back to presentation 49 Main RH Ingredient: RAM Cell a0 a1 0 0 0 1 data a0 a1 a1 & a2 Universal gate = RAM data in 0 control Switch controlled by a 1-bit RAM cell back 50 Reconfigurable Computing • Back to ENIAC-style computation • Synthesize one machine to solve one problem back back to “extras” 51 Efficiency hardware resources idle used now time 52 Manufacturing Cost cost cost affordable 3x109$ now time 53 Complexity transistors available 1010 109 108 manageable now time 54 CAD Tools manual interventions necessary feasible now time back 55 ASH Benefits Problem Reliability Power Signals ILP Complexity CAD Efficiency Cost Performance Solution Configuration around defects Only “useful” gates switching Localized computation Statically extracted Hierarchy of abstractions Compiler + local place & route Circuit customized to application No masks, no physics, same substrate Scalable back 56 CPU+ASH Study • Reconfigurable functional unit on processor pipeline • Adapted SimpleScalar 3.0 • ASH & CPU use the same memory hierarchy (incl. L1) • ASH can access CPU registers • CPU pipeline interlocked with ASH • Results pending back 57 Simplifying Predicates • Shared implementations a b • Control equivalence c 58 Deep Speculation if (p) if (q) x = a; else x = b; else x = c; a b c p&q p&!q !p x 59 Predicates & Tokens q P ready x P_ready safe ready ~x *q = 2 q ready 1 x safe safe & ready P & P_ready ready & safe Predicated tokens *q = 2 Eliminate wires ~x Eliminate speculation back 60