Uncle * An RTL Approach to Asynchronous Design

Uncle – An RTL Approach to Asynchronous Design Presentor : Chi-Chuan Chuang Date : 2012.12.20 Outline  Introduction ◦ C-element ◦ Null convention logic (NCL) ◦ NCL asynchronous systems  UNCLE synthesis flow ◦ ◦ ◦ ◦ ◦   From RTL to gates Ack generation Net buffering Latch balancing Relaxation, cell merging Comparisons Conclusion C-element Commonly used asynchronous logic component  Hysteresis  Implementations  ◦ Semi-static : with two cross-coupled inverters ◦ Static : doesn’t rely on feedback inverters ◦ Gate-level : depends on which gate used C-element (cont.)  Semi-static C-element (cont.)  Static  Gate-level Null convention logic Dual-rail  Delay-insensitive logic style  Based on threshold logic  Use 27 fundamental threshold gates with 2~4 inputs  Hysteresis state-holding capability  Null convention logic (cont.)  Definitions of threshold gate ◦ ◦ ◦ ◦ set : equation determines the gate function hold1: all input Ored together reset : complement of hold1 hold0 : complement of set Z = set + Z − ∙ hold1  Z ′ = reset + Z −′ ∙ hold1  An example of implement TH23     set = AB + BC + AC hold1 = A + B + C reset = ABC hold0 = AB + BC + AC Null convention logic (cont.)  Compare between two types of DR AND2 27 Basic NCL macros NCL asynchronous systems  Data-driven approach ◦ Use NCL gates for both registers and control  Control-driven approach ◦ Uses Balsa-style registers and control Data-driven approach  Using dual-rail latch with acknowledge signals ki, ko to control the datapath Dual-rail latches  Dual-rail latches ◦ ◦ ◦ ◦ ◦ ◦  C_0 = C-element with async reset to 0 C_1 = C-element with async reset to 1 t_d/f_d = dual-rail in ko = ackout t_q/f_q = dual-rail out ki = ackin Types of latch ◦ drlatn ◦ drlatr ◦ drlats Dual-rail latches (cont.)  drlatn  drlatr  drlats Data-driven approach (cont.)  Finite state machine ◦ The middle half-latch contains initial data ◦ All ports and registers are read and written every cycle Control-driven Approach Registers with selective read/write  Control network is separate from the datapath  Number of read ports can be easily added to the register  UNCLE synthesis flow Both data-driven and control-driven are supported  lower-level synthesis tool  Verilog as its input language  From RTL to Gates RTL is transformed to a gate level netlist using commercial synthesis tools  The target library read by the tool contains:  ◦ AND2, XOR2, OR2, inverter ◦ D-flip-flop (DFF), D-latch (DLAT) ◦ Gates for special (T- elements, S-elements…) ◦ Complex gates that have been mapped into NCL Gates have unit delays for timing  Area is proportional to transistor counts  Ack Generation  Data-driven ◦ Each latch receive an ack signal from each destination latch of its output  Control-driven ◦ Each control element receive an ack signal from each destination latch  A simple Ack merging algorithm: ◦ any latches having at least one common destination have their ack networks merged  An ack checker step is included at the end of the flow to check ack network validity Net Buffering Timing data is non-linear delay model (NLDM)  The signal net target transition time used for all examples in this paper is approximately equivalent to a 1X inverter driving four separate 4X inverter loads  Gate sizing  Build a buffer tree with invertors  Latch Balancing For the data-driven style that moves halflatches in the netlist to balance data delays with ack delays  Ack delay  ◦ Depends on the number of destination that sets the completion network depth  Data delay ◦ depends on the data logic complexity. Latch Balancing (cont.) Latch Balancing (cont.) Generally results in more transistors as the datapath width increases moving towards the source registers  Requiring more latches, with a increase in the ack network size  Implement by iterative heuristic algorithm  Latch Balancing (cont.) Latch Balancing (cont.) Several sorting/pruning stages based on data/ack/cycle delays are used to find latch that are most likely to improve performance if pushed  Chosen latches are pushed one gate level, and affected ack networks are rebuilt  Latches only feed primary outputs are ineligible  Latch Balancing (cont.) Works appropriately for FSMs  Has problems with linear pipelines if latches are pushed in one direction only  Relaxation and Cell Merging  Relaxation is a technique that ◦ Look for redundant paths from a PI to a PO ◦ Finds gates that don’t have to be fully expanded to dual-rail versions, but can be implemented by eager versions that require fewer transistors  Cell Merging ◦ A cell merging step is performed in which adjacent gates with no fanout are merged into more complex gates ◦ Area-driven Example RTL Statements Comparison  GCD16 with different Uncle version Uncle ver. DD DD/NB DD/LB/NB CD CD/NB transistors 16192 16226 20128 8658 8662 * 1.87 1.87 2.32 1.00 1.00 cyc. time (ns) 105.7 86.0 64.9 75.7 62.4 * 1.69 1.38 1.04 1.21 1.00 energy (pJ) 32.4 35.3 49.7 10.2 10.8 * 3.17 3.44 4.85 1.00 1.05 Conditional port activity caused data-driven designs to be large, slow. Latch balancing helped DD performance. Control driven produced best results DD:data driven, CD:ctrl-driven, LB:latch balanced, NB:net buffered, *:ratio to best Comparison (cont.)  GCD16 between Uncle and Balsa transistors * cyc. time (ns) energy (pJ) Balsa Uncle (CD/NB) Balsa Uncle (CD/NB) Balsa Uncle (CD/NB) 11455 8662 85.2 62.4 13.7 10.8 1.32 1.00 1.37 1.00 1.27 1.00 Balsa used more read ports on registers reducing loading but increasing transistor count Net buffering helped offset increased loading in Uncle design, improved performance Comparison (cont.)  Viterbi decoder design ◦ Branch Metric Unit (BMU)  Just combinational logic  With a half latch at the output for UNCLE ack ◦ Path Metric Unit (PMU)  It’s a set of parallel accumulator-like registers resulting in many parallel three half-latch loops ◦ History Unit (HU)  It has three 16-entry register files(4-bit, 2-bit, and 1-bit)  An outer loop writes the registers, and can conditionally trigger an inner while loop that contains register read/write operations and executes a variable number of iterations Comparison (cont.)  Viterbi’s Branch Metric Unit comparison ◦ Combination only transistors * cyc. time (ns) energy (pJ) Balsa Uncle (CD/NB) Balsa Uncle (CD/NB) Balsa Uncle (CD/NB) 9040 5338 9.30 8.87 2.33 1.35 1.69 1.00 1.05 1.00 1.73 1.00 Uncle version just combinational logic with half-latch on output Balsa version used loop splitting to split combinational logic into concurrent blocks that increased parallelism of internal computations at the cost of more transistors. Comparison (cont.)  Uncle’s Viterbi Path Metric Unit (PMU) Uncle ver. DD/NB DD/NB/LB DD/NB/LB+ CD/NB transistors 20184 21778 24561 18838 * 1.07 1.16 1.30 1.00 cyc. time (ns) 13.4 13.4 6.9 13.3 * 1.93 1.93 1.00 1.91 energy (pJ) 5.1 5.7 6.8 4.6 * 1.12 1.24 1.48 1.00 LB+=latch-balanced, two set of half-latches added to RTL (one in FSM loop, and one on output port) Comparison (cont.)  Viterbi’s Path Metric Unit comparison transistors * cyc. time (ns) energy (pJ) Balsa Uncle (DD/NB/ LB+) Balsa Uncle (DD/NB/ LB+) Balsa Uncle (DD/NB/ LB+) 38328 24561 9.39 6.94 9.73 6.81 1.56 1.00 1.35 1.00 1.43 1.00 Comparison (cont.)  Viterbi’s History Unit comparison V1 V2 Balsa Uncle CD/NB Uncle CD transistors 21819 16471 16425 * 1.33 1.00 1.00 cyc. time (ns) 10.8 6.8 8.4 * 1.60 1.00 1.25 energy (pJ) 1.34 1.17 1.07 * 1.26 1.09 1.00 cyc. time (ns) 230.7 161.3 192.0 * 1.43 1.00 1.19 energy (pJ) 2.54 19.6 18.7 * 1.36 1.05 1.00 Comparison (cont.)  Viterbi comparison between Balsa and Uncle transistors * cyc. time (ns) energy (pJ) Balsa Uncle (DD/NB/ LB+) Balsa Uncle (DD/NB/ LB+) Balsa Uncle (DD/NB/ LB+) 71370 46752 22.0 17.3 15.0 10.5 1.53 1.00 1.27 1.00 1.43 1.00 The Uncle decoder uses the DD/NB/LB+ PMU RTL Comparison (cont.) Balsa Uncle Combinational synthesis Yes Yes Control synthesis Yes Data-driven only Logic Style Different dual-rail styles, NCL only bundled data Behavioral simulation Yes Limited Area optimizations No Relaxation, limited cell merging, ack sharing Area optimizations Relaxation, limited cell merging, ack sharing RTL style allow area/perf. tradeoffs, latch balancing, net buffering Timing model Fixed delay NLDM Conclusion Requires more effort by the designer than Balsa, But can have a higher quality design  If performance of the always active module is our goal, data-driven style would be better  Control-driven style better for modules with conditional port activity.  Appendix : Teak Teak is a successor toolset to Balsa that uses a data-driven style  One of Teak’s goals is to automatically insert latch stages and balance delays for optimum throughput.  Teak is a fairly new tool with only one public release  Reference     Uncle – An RTL Approach to Asynchronous Design ASYNC12 powerpoint about Uncle – An RTL Approach To Asynchronous Design Design of Asynchronous Circuits Using Synchronous CAD Tools Optimization of NULL convention self-timed circuits

Uncle * An RTL Approach to Asynchronous Design

Related documents

Products

Support

Uncle * An RTL Approach to Asynchronous Design

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib