Asynchronous Logic: Results and Prospects Alain J. Martin California Institute of Technology NTU, March 2007 What Is Asynchronous Logic? Sequencing and Computation “An algorithm is a sequence of computational steps.” CL&R How do we implement sequencing in a continuous physical medium? Traditional answer: use of a global time reference (“the clock”) CLK A 3 B C D E Can we compute without a clock? Yes!: “asynchronous” or “clockless” logic Also “self-timed” or “speed-independent” David Muller “Theory of Asynchronous Circuits” (1959) ILLIAC (1959) and ILLIAC II (1962) partially asynchronous PDP6 (1960) asynchronous 4 Can we compute without a clock and without delay assumptions? Delay-insensitivity (Molnar, 198x…) Almost: “The class of delay-insensitive circuits is limited (not Turing-complete).” (Martin, 1990) Quasi-delay-insensitive (QDI) logic: – Delay-insensitive – Isochronic forks (only delay assumption) 5 QDI is Turing-complete (Martin & Manohar, 1996) What is an Asynchronous Circuit? Asynchronous system: collection of modules communicating by handshake protocols Distributed system on a chip (communicating by message exchange) A ack 6 B C D ack ack ack E ack Caltech QDI Approach 7 Quasi delay-insensitive (QDI) design Minimal delay assumptions (only isochronic forks) Stricter logic synthesis (DI codes for datapath, completion trees), but… Robust and efficient (no evidence that delay assumptions improve efficiency) Why Asynchronous and QDI Logic? Scientific Reasons 9 Understanding the role of time in computation Limit of delay insensitivity Implementing a digital computation directly in a continuous physical medium Design by program transformation (real correctness-by-construction approach) “VLSI design as programming” paradigm Engineering Reasons Better match for high-level synthesis – Can separate correctness from performance issues Modularity and better use of concurrency Large system design (SoC): Only local communication Efficiency – Average-case instead of worst-case behavior – Less pressure for global optimization (“timing closure”) Robustness and reliability – Robust to variations in fabrication technology, temperature, voltage, noise, SEU-tolerance 10 Energy efficiency Energy Advantages of Async No clock – Up to 50% of clock power recuperated Automatic shut-off of idle parts – Perfect clock gating No glitches (spurious transitions) – Up to 50% of power in combinational circuits Automatic adaptation to parameter’s variations – Voltage scaling: Perfect exchange of delay against energy through voltage scaling Flexibility of asynchronous interfaces: – Better use of concurrency 11 Reactive Use in Embedded Systems 12 Archetype of a reactive system Average execution time may be much shorter than maximal execution time Sleep sequence without race condition – Modeled after wait/signal with condition variables Instant wake-up from deep sleep Robustness to PVT Variations 13 Increase in physical parameter variations (PVT) is becoming a huge problem… Even worse in future technologies (nano CMOS or others) Variations of physical parameters all affect timing Increased timing variations reduce robustness and/or performance Single time reference (clock) may become unavailable or too expensive in future technologies and large systems (SoC) Robustness to Voltage and Temperature Variations 14 Single-event Upset and Soft-error Tolerance of QDI circuits 15 Soft-errors caused by alpha particles, cosmic rays and other radiation sources are becoming increasingly problematic, even at ground-level QDI circuits can absorb most “dose-effects” Single-event upsets that cause a soft-error (bit flip) can be corrected efficiently in QDI circuits Error-correction scheme specific to QDI Entire async microcontroller SEU-tolerant Detection and Correction of SE in QDI circuits Single-error detection: duplicate and compare Correction: – prevent propagation of detected SE – stability of guards corrects automatically – “Detection is correction” Simplest, most expensive coding, but simplest detection mechanism Entire microcontroller SEU tolerant 16 Disadvantages of Async Size overhead (more transistors) Poorly understood and rarely taught No industrial CAD tools (yet) No well-developed testing procedure (yet) No easy transition path for large established companies… 17 Experimental Evidence Asynchronous Chips @ Caltech World-first Asynchronous Microprocessor (1988) MiniMIPS (1998) 19 Lattice-Structure Filter (1994) Lutonium 8051 Microcontroller (2005) First Asynchronous Microprocessor (Caltech, 1988) 16-bit RISC, 2-micron CMOS Formal synthesis: – Initial sequential description was a single page of CHP code – 5 months from start of project to tape-out (small group) – Fully functional on first silicon 20 Performance: – 5 MIPS, 5mA @ 2V – 18 MIPS, 45mA @ 5V – 26 MIPS, 100mA @ 10V Potato-chip experiment – Runs on a potato as power supply! – 50kHz @ 0.75V, 300kHz @ 0.9V Asynchronous MIPS R3000 Microprocessor Standard 32-bit RISC ISA Single instruction issue, one branch delay slot Precise exceptions 2 on-chip caches: 4kB Icache and 4kB Dcache First prototype (1998): – No TLB – 2M transistors – First asynchronous processor competitive with large synchronous designs 21 MiniMIPS Low-Voltage Operation Functional from 0.5V Vdd up Functional at 0.4V with some transistor resizing 22 Asynchronous MIPS: Practical Results HP’s 0.6-micron CMOS – Expected: – First prototype: – Voltage range: 275 MIPS @ 7W @ 3.3V @ 25oC 190 MIPS @ 4W @ 3.3V @ 25oC 1V (9.66MHz @ 0.021 W) to 8V Functional on first silicon despite – Inconsistencies in HP’s process parameters (e.g. higher Vt’s) – Long polysilicon wire overlooked in the critical fetch loop – (Testament to the robustness of asynchronous design style!) Roughly 4x faster than commercial synchronous MIPS ported to same technology – Note: no particular effort made towards designing for low power. 23 Lutonium-18: QDI 8051 Microcontroller TSMC SCN018 through MOSIS – 0.18mm CMOS – 1.8V nominal – |Vt| = 0.4V to 0.5V 24 Expected area: 5mm2 (including 8kB SRAM) Performance from low-level simulation (conservative!) 1.8 V 200 MIPS 100.0 mW 500 pJ/inst 1800 MIPS/W 1.1 V 100 MIPS 20.7 mW 207 pJ/inst 4830 MIPS/W 0.9 V 66 MIPS 9.2 mW 139 pJ/inst 7200 MIPS/W 0.8 V 48 MIPS 4.4 mW 92 pJ/inst 10900 MIPS/W 0.5 V 4 MIPS mW 43 pJ/inst 23000 MIPS/W 170 Energy Efficiency Metric: Et2 E = C*V2 , t = k / V E*t2 independent of V Estimate of energy efficiency Comparison of designs “Algorithmic of energy’’ See Chapter 15 in “Power Aware Computing” book by Graybill & Melhem eds. Kluwer 25 Voltage Scaling Advantage: Comparison to Intel Xscale 26 Energy Breakdown and Comparisons icache fetch Microprocessor -- Results exec units (adder) (shifter) (fblock) (mem) (mult/div) decode write back MIPS 33nJ Energy 70nJ async-0.6m sync-0.6m MIPS 6ns CycleTime 21ns async-0.6m sync-0.6m Microcontroller -- Estimation regfile (bypass) fetch 11% bus 12% icache 31% 27 10.00nJ (1X) decode 4% execunits 8% writeback 7% regfile 27% Energy Breakdown sync-0.5m 8051 1.67nJ (6X) async-0.5m Energy 0.56nJ (18X) async-0.18m@1.8V per Instr 0.14nJ (72X) async-0.18m@0.9V 20ns (1X) 8051 10ns (2X) CycleTime 5ns (4X) 10ns (2X) sync-0.5m async-0.5m async-0.18m@1.8V async-0.18m@0.9V More than 100X Et2 improvement over any other 8051 Design Methodology Handshakes & Dual-Rail Encoding BUFFER: *[ L?x; R!x ] L? DATA ACK L0 L1 La Four-phase handshake Dual-rail encoding: – 3 wires (2 data, 1 ack) for one bit of information – Other DI codes are used: 1-of-N 29 R0 R1 Ra R! C0 C1 Data 0 0 Hasn’t arrived 1 0 0 0 1 1 1 1 invalid A QDI pipeline stage *[ L?x; R!f(x)] 30 QDI PIPELINE vs Bundled Data 31 Dual-rail or 1-of-n data encoding Completion tree Critics: high overhead (2*N +1 wires and completion tree) Alternative: Bundled data N + 1 wires, no completion tree Delay line for indicating completion, spurious transitions Big controversy! Fine-grain Pipeline (PCHB) en R R! L? f en validity Rv La en 32 completion validity Lv L? Ra FINE-GRAIN PIPELINE 33 No need for separate register Very high throughput and low forward latency Excellent Et^2 performance Entirely QDI Used in MiniMIPS and Lutonium Area overhead significant Lower-Level Synthesis: HSE CHP Program *[ L?x; R!x ] Handshaking Expansion *[ [ Ra L0 R0 Ra L1 R1 ]; La ; [ Ra R0, R1 ]; [ L0 L1 La ] ] 34 2 4 7 8 1 3 5 6 [ Ld ]; La; [ Ld ]; La [ Ra ]; Rd; [ Ra ]; Rd Lower-Level Synthesis: PRS CHP Program Production Rule Set *[ L?x; R!x ] L0 L1 Lv La Ra L0 R0 La Ra L1 R1 R0 R1 Rv Lv Rv La L0 L1 Lv Ra La R0 Ra La R1 R0 R1 Rv Lv Rv La Handshaking Expansion *[ [ Ra L0 R0 Ra L1 R1 ]; La ; [ Ra R0, R1 ]; [ L0 L1 La ] ] To PRS for CMOS … 35 Lower-Level Synthesis: PRS Each production rule has the form: guard expr node or guard expr node These can be evaluated as If ( guard expr is true ) node = Vdd or If ( guard expr is true ) node = GND A set of production rules must be stable and non-interfering (for hazard-free circuits) Production Rule Set L0 L1 Lv La Ra L0 R0 La Ra L1 R1 R0 R1 Rv Lv Rv La L0 L1 Lv Ra La R0 Ra La R1 R0 R1 Rv Lv Rv La To PRS for CMOS … 36 Asynchronous Architectures 37 New asynchronous solutions for pipelined microprocessors Execution units are in parallel, allowing concurrent and outof-order execution of instructions CAD Tools 38 Complete suite of tools: synthesis, simulation, verification, optimization, layout Designer-assisted compilation Tools are modular and customizable Main representations: CHP, PRS, Cast Design Flow sequential program chpsim DDD SDD Legend cosim synthesis concurrent system prsim/esim simulators spice logical PL2 physical physical PRS database add ? ! Placer Router Sizer = resize using wire information 39 sized PRS collection of cells placed cells routed cells physical layout Robustness and Reliability Robustness to Power-Supply Noise HPSICE simulation of a typical QDI asynchronous circuit: A five-stage ring of async (PCHB) pipeline stages. Technology: TSMC 0.18micron CMOS Vdd: 1.8V, Vt : .5V, Complete layout. Vdd is oscillating between 3.5V and 0V (maximal amplitude), and at various frequencies. The circuit keeps working correctly! (It will malfunction at some very high-frequency noise in phase with circuit frequency.) 41 Robustness to Power-Supply Noise 42 SE-Tolerant QDI Circuits xa ya xb yb 43 z’a C za C zb z’b intermediate final Soft-error Tolerant Asynchronous Microprocessor (STAM) 44 The STAM architecture defines simplified 32-bit RISC instruction set, which has eight general registers, and four types of instructions: arithmetic, branch, memory and shift operations. A partially-wired layout of the STAM was completed TSMC.SCN 0.18um CMOS. In SPICE simulation, it runs about 120 MHz. The soft-error tolerance of the STAM has been tested by injecting errors randomly while the STAM runs the RC4 program (a simple stream cipher) in the digital-level simulator. About five soft errors, whose locations are chosen randomly from a list of all nodes of the STAM, are injected in each execution of an instruction. About 25% of 203,000 nets in the STAM experience a bit-flipping in each testing The figure shows locations of errors by dots and a box in the figure represents a CHP process. Soft-error Tolerant Asynchronous Microprocessor (STAM) 45 Async Molecular Nanoelectronics Molecular nano was our motivation for XQDI: Extreme case of variability! 46 “Extreme” QDI (XQDI) Can we improve QDI to eliminate (or reduce further) the remaining variability dependencies? Isochronic forks Keepers on state-holding nodes Slew rates and oscillating rings 47 Isochronic Forks 48 Only timing assumption in QDI design New design style that (1) minimizes the number of isochronic forks, and (2) mitigates their effect d(single transition) << d(multitransition path) One-sided inequality can always be satisfied Cell Design without Keeper Keepers needed for state-holding cells Keeper requires transistor sizing and balancing current strengths. Difficult with variability… Example of the C-element: With keeper 49 Without keeper Ring Oscillators An async system is a collection of rings of operators. Oscillating rings are the engine of an asynchronous circuit. Right choices of slew rates and number of stages guarantee that each ring oscillates. What are the limits? How many restoring stages per ring? .... C C C .... 50 Theoretical Results & General Comments Concurrency and the digital/analog interface 52 Elementary building block: guarded transition (PR: guard expr node or guard expr node ) Stability and non-interference are necessary and sufficient to guarantee the absence of logical hazards Stable and non-interfering PR set is deterministic (Church-Rosser property) Any sequential execution is OK (powerful simulator and execution model) Analog Implementation 53 There exists a QDI (stable, non-interfering) implementation for any deterministic computation (Turing-completeness) Arbitration treated separately. Metastability of arbiters is not a problem because of asynchrony Analog requirements on isochronic forks and ring oscillation can always be satisfied by adding restoring delays to the circuit (singlesided timing requirement). Knowledge vs. Ignorance Cost of implementing sequencing In a clocked discipline: relies on knowledge of delays Because of increasing variability and complexity, this knowledge is increasingly expensive! In a QDI system, timing is ignored; cost to implement sequencing is high but fixed! “If knowledge is expensive, try ignorance” 54 At some point in time the costs cross… Crossing point already passed for SoC… COST/ COMPONENT CLOCKED QDI TECHNOLOGY (increasing variability and complexity) 55 Intel Says… 56 From ISSCC 2005 article by Intel about Itanium L3 cache: “ …traditional synchronous design becomes increasingly inefficient. Much of total delay is dedicated to clock skew, latch delay, margin in each cycle, and non-ideal division to cycle boundaries. …Significant margins must be added to account for slow marginal cells that are statistically probable in a 24MB cache. The delivery of low clock skew over such an area is also difficult and costly. This single-ended asynchronous design eliminates the drawbacks above…” Conclusion 57 Async QDI logic can be made extremely robust to timing variations and therefore to parameter variability Flexible interfaces of async & absence of global signal better suited for complex system design as in SoC Better match for probabilistic design Energy efficient No synchronization failure because of metastability As technology advances, less costly for complex designs Conclusion As we enter the nanoscale era: System complexity (interfaces, clocking in SoC, reuse) Robustness issues (parameters variations, soft errors, noise) Costs: masks, design time Power and energy consumption “End-of-Moore’s-law” argument for parallelism An asynchronous approach offers many advantages and is unavoidable in the long run. 58 Industrial Prospects 59 Time is ripe. Why is industry so aloof? Absence of industrial CAD tools No seamless transition (GALS the stopgap solution?) Maybe not in Intel’s interest? Perhaps, we need an industrial environment untied to traditional approaches and EDA tools Async offers an opportunity to leapfrog the current technology limitations 60 61 Managing Complexity: The Design Productivity Gap 62 From: The International Roadmap for Semiconductors: 1999 Managing Complexity All circuits designed have been found fully functional on first silicon: Year Transistors 1985 200 Distributed mutual exclusion element 1986 2000 Stack Element 1989 20 000 First microprocessor 1995 500 000 DSP filter 1998 2 000 000 63 Description MIPS microprocessor