Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Integrated Circuits and Systems Lab Departments of Computer Science and Electrical Engineering Darmstadt University of Technology Outline Pipelines: synchronous, asynchronous, wave pipelined, and asynchronous wave pipelined (AWP) Comparison: AWPs vs. sync, async, and sync wave pipes AWP Circuit Design Conclusion 2 Pipelining Pipelining used as premier technique to better exploit hardware and boost performance of VLSI chips Clocking overhead presents serious threat for deeply pipelined systems built upon sub-micron CMOS processes running at GHz frequencies 3 General Framework for Pipelines i Logic Latch/Reg Latch/Reg Data o Clk 4 Some Notations... G : set of all gate output nodes in logic t min , t max : minimum and maximum logic delay t min (i ), t max (i ) : minimum and maximum logic delay from input to internal node i G tstable(i ) : minimum time internal node i G has to be stable Tclk : clock period or cycle time i, o : intentiona l skew at input and output registers o i : delay between input and output clock tskew : uncontroll ed clock skew at register td : propagatio n delay of register tsetup : set - up time of register thold : hold time of register 5 General Relations Data is latched by output clock at time t k Tclk o (1) k is called ``global clock latency´´, equals # clocks at output before data Lower bound : t i t max td tsetup tskew (2) Upper bound : t Tclk i t min td thold tskew (3) Combining (2) and (3) : t max tsetup tskew k Tclk td Tclk t min thold tskew (4) By transit ivity, (4) implies : Tclk (t max t min ) tsetup thold 2 tskew (5) I. e., cycle time bounded by delay vari ation, register overhead and clock skew Similarly, minimum pulse width has to be respected for all i G : Tclk (t max (i ) t min (i )) tstable(i ) tskew (6) 6 Synchronous Pipeline Logic Latch/Reg Latch/Reg Data Clk Throughput Negative Implementation side-effects determined options: of gate-level by longest pipelining logic path : + k 1, 0 Tclk t max td tsetup tskew Increased Registerclock/register vs. latency, latch-based, overhead clock load/skew, explicit latches power,vs. area, latchless design time TSPC Fine-grain More area vs. local pipelining for clocking clocksallows derived and registers high from throughput global than for clock at logic the cost of Static increased vs. dynamic, clock/register single-ended overheadvs. dual-rail 7 Asynchronous Pipeline ack_in Logic Handshake req_in Handshake Data req_out ack_out Micropipeline (Sutherland 1989) Implementation options: Operation is data dependant, saves power during idle Plug & Play composability Synchronous clock replaced(event) by asynchronous handshaking 4-phase (level) vs.sync 2-phase protocol can As with fine-grain pipelines, throughput be high; onoperation: req and ack lines Elastic input anddistributed output completion rate may differ Load Bundled data (matched handshake causes high delay) latencyvs. and backwarddetection stall Used by Furber‘s at will Manchester momentarily, and group pipeline buffer U for AMULET1/2/3 8 Synchronous Wave Pipeline 1 Wave Logic Latch/Reg Clk Latch/Reg Data 2 Several Wave pipelining data waves potentially simultaneously gives higher active throughput in the logic as t maxminimize td tsetupdelay tdecreased skewvariations t min P,T,V td treduced hold tskew Logic conventional has to pipelines at latency over and corners k 0, 1 Tclk k k 1 Global clock load, clockarea used and with power constructive skew to adjust phases However, tuning the logic and the delay elements is difficult 9 Wave Pipelining: A Short Outline Wave pipelining occurs when combinational logic is clocked faster than latency would allow Several data waves are then active in the logic without being separated by storage elements Latency remains constant and throughput is determined by delay differences rather than absolute delay Requirement for delay balanced logic and complicated timing are the main hurdles 10 Wave Pipelining: A Little History Technique stems from the 60s and has had a reputation for being exotic since Wave pipelining was long dead before being revived by W. Burleson (U. Mass.) and M. Flynn (Stanford U., PhDs by Wong, Klass, and Nowka) and C. Gray at NCSU Some working academic chips exist, mainly datapath Some commercial memory is wave pipelined (e.g. ULTRA-III cache), but no logic, as far as we know 11 Asynchronous Wave Pipeline (AWP) Wave Latch req_in Wave Latch Data Wave Logic req_out matched delay AWP Data words is special associated case of the withsync events wave on pipeline request line with the clk min d hold skew Several constructive data skew wavesset and to protocol worst-case events logicsimultaneously delay k 0 T t t t t t t t t Itactive is crucial in thethat logic theand delay theelement matched accurately element, tracks respectively the delay max ddelaysetup skew behaviour of the logic over P, T, V corners 12 AWPs vs. Synchronous Pipelines No global clock, instead a local clock (request) that is fed through the pipeline and obeys a simple asynchronous protocol, i.e. data is associated with event on request Many pipeline registers removed, thus requirements on the clock (request) relaxed Synchronous pipelines can reach the throughput of AWPs only with excessive cost in area, power and latency 13 AWPs vs. Asynchronous Pipelines AWPs deliberately sacrifice the ack and keep only the req to avoid protocol overhead AWPs not elastic: data at output has to be consumed AWPs eliminate hazards as side-effect of delay balancing AWPs have in common with other async methodologies: data dependant operation (avoids redundant transitions), composability (though inelastic), no global clock 14 AWPs vs. Synchronous Wave Pipelines AWPs tackle two main difficulties in sync wave pipes: Replacing the constructive skew by worst-case delay removes double-sided timing constraint, i. e. in contrast to sync wave pipes do AWPs operate at any rate Using dynamic self-resetting logic controls delay variation and doesn´t impact latency much 15 Wave Pipelining Combinational Logic Overall goal: keep data wave coherent under all possible conditions (data, PTV) Desirable architecture features: most logic paths have same depth fanin/fanout the same everywhere First step: pad all short paths to maximum length 16 Example: 64-b Brent-Kung Parallel Adder 0 1 2 3 4 Buffers provide All gates in the for same depth on every logic path pg PG PG x G o r same column must have the same delay 17 Circuits Logic style used has to minimize delay variation Earlier work focused on bipolar logic (ECL, CML), but CMOS is mainstream Static CMOS is not well suited for wave piping, fixing the problem results in more power and slower speed Pass transistor logic gives slopy edges thereby introducing delay variation Dynamic logic is attractive as only output high transition is data-dependant, output pulldown is done by precharge 18 Circuits (cont.) Using dynamic logic as in Burleson´s Wave Domino jeopardizes the concept as it needs fine-grain precharge What is needed is a dynamic logic family without precharge overhead: SRCMOS Work done at IBM: classic paper by Chappell et al:``A 2-ns Cycle, 3.8-ns Access 512-kb CMOS ECL SRAM with a Fully Pipelined Architecture,´´ JSSC (26), 11, 1991; or, more recently: ``Implementation of a SelfResetting CMOS 64-Bit Parallel Adder with Enhanced Testability,´´ JSSC (34), 8, 1999, by Hwang et al. 19 SRCMOS Distinguishing property of our SRCMOS circuits: precharge feedback is fully local, and NMOS trees are delay balanced output inputs N 20 Operation of a 2-AND 21 Delay Balancing at Transistor Level NMOS tree is designed so that the precharge node is pulled down by a constant number of series devices Short paths are padded with dummy devices Delay variation is minimal when exactly one path is on, i. e. wide fanin OR´s are hard to use Every output has to see the same load Lightly loaded outputs are given dummy cap 22 Example: Carry tree in a 64-bit adder Gim Glm Plm (Gkl Pkl (Gjk Pjk Gij)) 23 Gim Layout 24 Simulation of Gim cell Pulses of 4 possible input situations giving ´1´ at the output are tightly matched Note: in this case never are Pxy=Gxy=1 25 First Pulse Problem 26 Miller Effect 27 64-bit Adder Output Waveforms latching window 28 Transistor Sizing Wprecharge Wkeeper Cfeedback Cload Cdrive inputs N output Wpd Wpd / Cdrive = const Cdrive / (Cload+Cfeedback+Wkeeper) = const Cfeedback / Wprecharge = const Wprecharge / Cdrive = const LINEAR SIZING 29 Interconnect: Resistive Effects 0.9µm x 900µm MET2 parasitics: C=116fF, R=70 Ohms C only R/3, R/3, R/3 R/2, R/2 RC only 30 Interconnect: Coupling Effects 2 adjacent MET2 lines coupled by C=54fF 31 PTV Variations SRCMOS provides some robustness by generating fresh pulses at every gate output Pulsed operation reduces data dependancy, coupling PTV noise is not critical when drift is in the same direction across die Critical are: temperature gradient, supply drop, and local variations What is needed: Rule of thumb like ``For process X, to be on the safe side, keep area between two latches < Y sqmm´´ 32 Conclusion AWPs presented as alternative approach to high-speed design, shows potential for GHz throughput without clocks AWPs avoid some problems of conventional wave pipes and (a)synchronous systems 64b adder + test circuit and EC crypto layout in the making Not covered here: feedback + controllers To do: support transistor sizing 33