Uncle – An RTL Approach to Asynchronous Design Presentor : Chi-Chuan Chuang Date : 2012.12.20 Outline Introduction ◦ C-element ◦ Null convention logic (NCL) ◦ NCL asynchronous systems UNCLE synthesis flow ◦ ◦ ◦ ◦ ◦ From RTL to gates Ack generation Net buffering Latch balancing Relaxation, cell merging Comparisons Conclusion C-element Commonly used asynchronous logic component Hysteresis Implementations ◦ Semi-static : with two cross-coupled inverters ◦ Static : doesn’t rely on feedback inverters ◦ Gate-level : depends on which gate used C-element (cont.) Semi-static C-element (cont.) Static Gate-level Null convention logic Dual-rail Delay-insensitive logic style Based on threshold logic Use 27 fundamental threshold gates with 2~4 inputs Hysteresis state-holding capability Null convention logic (cont.) Definitions of threshold gate ◦ ◦ ◦ ◦ set : equation determines the gate function hold1: all input Ored together reset : complement of hold1 hold0 : complement of set Z = set + Z − ∙ hold1 Z ′ = reset + Z −′ ∙ hold1 An example of implement TH23 set = AB + BC + AC hold1 = A + B + C reset = ABC hold0 = AB + BC + AC Null convention logic (cont.) Compare between two types of DR AND2 27 Basic NCL macros NCL asynchronous systems Data-driven approach ◦ Use NCL gates for both registers and control Control-driven approach ◦ Uses Balsa-style registers and control Data-driven approach Using dual-rail latch with acknowledge signals ki, ko to control the datapath Dual-rail latches Dual-rail latches ◦ ◦ ◦ ◦ ◦ ◦ C_0 = C-element with async reset to 0 C_1 = C-element with async reset to 1 t_d/f_d = dual-rail in ko = ackout t_q/f_q = dual-rail out ki = ackin Types of latch ◦ drlatn ◦ drlatr ◦ drlats Dual-rail latches (cont.) drlatn drlatr drlats Data-driven approach (cont.) Finite state machine ◦ The middle half-latch contains initial data ◦ All ports and registers are read and written every cycle Control-driven Approach Registers with selective read/write Control network is separate from the datapath Number of read ports can be easily added to the register UNCLE synthesis flow Both data-driven and control-driven are supported lower-level synthesis tool Verilog as its input language From RTL to Gates RTL is transformed to a gate level netlist using commercial synthesis tools The target library read by the tool contains: ◦ AND2, XOR2, OR2, inverter ◦ D-flip-flop (DFF), D-latch (DLAT) ◦ Gates for special (T- elements, S-elements…) ◦ Complex gates that have been mapped into NCL Gates have unit delays for timing Area is proportional to transistor counts Ack Generation Data-driven ◦ Each latch receive an ack signal from each destination latch of its output Control-driven ◦ Each control element receive an ack signal from each destination latch A simple Ack merging algorithm: ◦ any latches having at least one common destination have their ack networks merged An ack checker step is included at the end of the flow to check ack network validity Net Buffering Timing data is non-linear delay model (NLDM) The signal net target transition time used for all examples in this paper is approximately equivalent to a 1X inverter driving four separate 4X inverter loads Gate sizing Build a buffer tree with invertors Latch Balancing For the data-driven style that moves halflatches in the netlist to balance data delays with ack delays Ack delay ◦ Depends on the number of destination that sets the completion network depth Data delay ◦ depends on the data logic complexity. Latch Balancing (cont.) Latch Balancing (cont.) Generally results in more transistors as the datapath width increases moving towards the source registers Requiring more latches, with a increase in the ack network size Implement by iterative heuristic algorithm Latch Balancing (cont.) Latch Balancing (cont.) Several sorting/pruning stages based on data/ack/cycle delays are used to find latch that are most likely to improve performance if pushed Chosen latches are pushed one gate level, and affected ack networks are rebuilt Latches only feed primary outputs are ineligible Latch Balancing (cont.) Works appropriately for FSMs Has problems with linear pipelines if latches are pushed in one direction only Relaxation and Cell Merging Relaxation is a technique that ◦ Look for redundant paths from a PI to a PO ◦ Finds gates that don’t have to be fully expanded to dual-rail versions, but can be implemented by eager versions that require fewer transistors Cell Merging ◦ A cell merging step is performed in which adjacent gates with no fanout are merged into more complex gates ◦ Area-driven Example RTL Statements Comparison GCD16 with different Uncle version Uncle ver. DD DD/NB DD/LB/NB CD CD/NB transistors 16192 16226 20128 8658 8662 * 1.87 1.87 2.32 1.00 1.00 cyc. time (ns) 105.7 86.0 64.9 75.7 62.4 * 1.69 1.38 1.04 1.21 1.00 energy (pJ) 32.4 35.3 49.7 10.2 10.8 * 3.17 3.44 4.85 1.00 1.05 Conditional port activity caused data-driven designs to be large, slow. Latch balancing helped DD performance. Control driven produced best results DD:data driven, CD:ctrl-driven, LB:latch balanced, NB:net buffered, *:ratio to best Comparison (cont.) GCD16 between Uncle and Balsa transistors * cyc. time (ns) energy (pJ) Balsa Uncle (CD/NB) Balsa Uncle (CD/NB) Balsa Uncle (CD/NB) 11455 8662 85.2 62.4 13.7 10.8 1.32 1.00 1.37 1.00 1.27 1.00 Balsa used more read ports on registers reducing loading but increasing transistor count Net buffering helped offset increased loading in Uncle design, improved performance Comparison (cont.) Viterbi decoder design ◦ Branch Metric Unit (BMU) Just combinational logic With a half latch at the output for UNCLE ack ◦ Path Metric Unit (PMU) It’s a set of parallel accumulator-like registers resulting in many parallel three half-latch loops ◦ History Unit (HU) It has three 16-entry register files(4-bit, 2-bit, and 1-bit) An outer loop writes the registers, and can conditionally trigger an inner while loop that contains register read/write operations and executes a variable number of iterations Comparison (cont.) Viterbi’s Branch Metric Unit comparison ◦ Combination only transistors * cyc. time (ns) energy (pJ) Balsa Uncle (CD/NB) Balsa Uncle (CD/NB) Balsa Uncle (CD/NB) 9040 5338 9.30 8.87 2.33 1.35 1.69 1.00 1.05 1.00 1.73 1.00 Uncle version just combinational logic with half-latch on output Balsa version used loop splitting to split combinational logic into concurrent blocks that increased parallelism of internal computations at the cost of more transistors. Comparison (cont.) Uncle’s Viterbi Path Metric Unit (PMU) Uncle ver. DD/NB DD/NB/LB DD/NB/LB+ CD/NB transistors 20184 21778 24561 18838 * 1.07 1.16 1.30 1.00 cyc. time (ns) 13.4 13.4 6.9 13.3 * 1.93 1.93 1.00 1.91 energy (pJ) 5.1 5.7 6.8 4.6 * 1.12 1.24 1.48 1.00 LB+=latch-balanced, two set of half-latches added to RTL (one in FSM loop, and one on output port) Comparison (cont.) Viterbi’s Path Metric Unit comparison transistors * cyc. time (ns) energy (pJ) Balsa Uncle (DD/NB/ LB+) Balsa Uncle (DD/NB/ LB+) Balsa Uncle (DD/NB/ LB+) 38328 24561 9.39 6.94 9.73 6.81 1.56 1.00 1.35 1.00 1.43 1.00 Comparison (cont.) Viterbi’s History Unit comparison V1 V2 Balsa Uncle CD/NB Uncle CD transistors 21819 16471 16425 * 1.33 1.00 1.00 cyc. time (ns) 10.8 6.8 8.4 * 1.60 1.00 1.25 energy (pJ) 1.34 1.17 1.07 * 1.26 1.09 1.00 cyc. time (ns) 230.7 161.3 192.0 * 1.43 1.00 1.19 energy (pJ) 2.54 19.6 18.7 * 1.36 1.05 1.00 Comparison (cont.) Viterbi comparison between Balsa and Uncle transistors * cyc. time (ns) energy (pJ) Balsa Uncle (DD/NB/ LB+) Balsa Uncle (DD/NB/ LB+) Balsa Uncle (DD/NB/ LB+) 71370 46752 22.0 17.3 15.0 10.5 1.53 1.00 1.27 1.00 1.43 1.00 The Uncle decoder uses the DD/NB/LB+ PMU RTL Comparison (cont.) Balsa Uncle Combinational synthesis Yes Yes Control synthesis Yes Data-driven only Logic Style Different dual-rail styles, NCL only bundled data Behavioral simulation Yes Limited Area optimizations No Relaxation, limited cell merging, ack sharing Area optimizations Relaxation, limited cell merging, ack sharing RTL style allow area/perf. tradeoffs, latch balancing, net buffering Timing model Fixed delay NLDM Conclusion Requires more effort by the designer than Balsa, But can have a higher quality design If performance of the always active module is our goal, data-driven style would be better Control-driven style better for modules with conditional port activity. Appendix : Teak Teak is a successor toolset to Balsa that uses a data-driven style One of Teak’s goals is to automatically insert latch stages and balance delays for optimum throughput. Teak is a fairly new tool with only one public release Reference Uncle – An RTL Approach to Asynchronous Design ASYNC12 powerpoint about Uncle – An RTL Approach To Asynchronous Design Design of Asynchronous Circuits Using Synchronous CAD Tools Optimization of NULL convention self-timed circuits