Lecture 8: Latch and Flip Flop Design Slides originally from: Vladimir Stojanovic & Vojin G. Oklobdzija Computer Systems Laboratory Stanford University horowitz@stanford.edu 4/24/02 EE371 1 Outline • • • • • • Recent interest in latches and flip-flops Timing and Power metrics Design and optimization tradeoffs Master-slave vs. Pulse-triggered Latch Representative designs Comparison 4/24/02 EE371 2 Recent Interest in Flip-Flops • Trends in high-performance systems à Higher clock frequency à More transistors on chip • Consequences à Increased flip-flop overhead relative to cycle time • Cycle time 10 - 20 FO4 delays, flop overhead 2 - 4 FO4 à à à à Difficult to control both edges of the clock Higher impact of clock skew Higher crosstalk and substrate coupling Higher power consumption • expensive packages and cooling systems • limit in performance à Clock burns up to 40%, flops up to 20% of total power 4/24/02 EE371 3 Requirements in the Flip-Flop Design • • • • Small Clk-Output delay, Narrow sampling window Low power Small clock load High driving capability (increased levels of parallelism) à Typical flip-flop load in a 0.18µm CMOS ranges from 50fF to over 200fF, with typical values of 100-150fF in critical paths • Integration of logic into the flop • Multiplexed or clock scan • Crosstalk insensitivity - dynamic/high impedance nodes are affected 4/24/02 EE371 4 Flip-Flop Delay • Sum of setup time and Clk-output delay is the only true measure of the performance with respect to the system speed • T = TClk-Q + TLogic + Tsetup+ Tskew D Q Logic D Q N Clk TClk-Q 4/24/02 Clk TSetup TLogic EE371 5 Delay vs. Setup/Hold Times 350 300 Minimum Data-Output Clk-Output [ps] 250 200 150 Setup Hold 100 50 Sampling Window 0 -200 -150 -100 -50 0 50 100 150 200 Data-Clk [ps] 4/24/02 EE371 6 Timing parameters, details 410 Unstable Clk-Q region 390 Failure region Time [ps] 370 350 330 D CQ +U Stable Clk-Q region D-Q minimum D-Q Clk-Q stable 310 D CQ 290 270 Optimum setup time U 250 -80 -60 -40 -20 0 20 40 D - Clk delay [ps] 60 80 100 The best point to pick on delay curve is minimum D-Q 4/24/02 EE371 7 Types of State-Elements Master-Slave Latch Pulse-Triggered Latch L Data L1 L2 D Q D Q Clk Data Clk D Q Clk Clk Clk Data Clk 4/24/02 EE371 S Q R 8 Master-Slave Latches • Positive setup times • Two clock phases: à distributed globally à generated locally • Small penalty in delay for incorporating MUX • Some circuit tricks needed to reduce the overall delay 4/24/02 EE371 9 T-G Master-Slave Latch • PowerPC 603 (Gerosa, JSSC 12/94) Vdd Clk Vdd Clkb Q D Clkb 4/24/02 Clk EE371 10 T-G Master-Slave Latch • Low power feedback • Unbuffered input à input capacitance depends on the phase of the clock à over-shoot and under-shoot with long routes à wirelength must be restricted at the input • • • • Clock load is high Low power Small clk-output delay, but positive setup Easily embedded scan or mux 4/24/02 EE371 11 C2MOS MS Latches Y. Suzuki, “Clocked CMOS Calculator Circuitry”, IEEE J. Solid-State Circuits, Dec. 1973 Vdd Vdd Ck Ckb D Q Ckb Vdd Clk • • • • Vdd Vdd Ck Low power feedback Locally generated second phase Poor driving capability Robustness to clock slope 4/24/02 Ck Vdd Vdd Vdd Ckb Ck Ck Ckb EE371 12 Single-Transistor-Clocked MS latches D Vdd Clk Clk Vdd Q Q D D D Vdd • • • • • DSTC SSTC Yuan and Svennson, JSSC Jan. ‘97 Ratioed DCVS and SRPL based designs Relatively small clock load Very sensitive to input glitching Capacitive coupling and charge sharing related speed and power problems 4/24/02 EE371 13 Pulse-Triggered Latches • First stage is a pulse generator à generates a pulse (glitch) on a rising edge of the clock • Second stage is a latch à captures the pulse generated in the first stage • Pulse generation results in a negative setup time • Frequently exhibit a soft edge property • Must check for hold time violations Note: power is always consumed in the clocked pulse generator 4/24/02 EE371 14 Hybrid Latch Flip-Flop (H. Partovi, ISSCC’96) Vdd Second Stage Latch Q Q D D=1 Clk D=0 D=0 D=1 signal at node X Pulse Generator 4/24/02 EE371 15 HLFF – pulse generation Keepers Second Stage Latch Data Clk D=1 Pulse Generator D=0 D=0 signal at node X D=1 4/24/02 EE371 16 HLFF Operation • 1-0 and 0-1 transitions at the input with 0ps setup time 4/24/02 EE371 17 Hybrid Latch Flip-Flop Skew absorption Partovi et al, ISSCC’96 4/24/02 EE371 18 Hybrid Latch Flip-Flop • Flip-flop features: à single phase clock à edge triggered, on one clock edge • Latch features: Soft clock edge property à à à à brief transparency, equal to 3 inverter delays negative setup time allows slack passing absorbs skew • Hold time is comparable to HLFF delay à minimum delay between flip-flops must be controlled • Fully static • Possible to incorporate logic 4/24/02 EE371 19 Semi-Dynamic Flip-Flop (SDFF) • Sun UltraSparc III, Klass, VLSI Circuits’98 Vdd Vdd Q Q D Clk • • • Soft edge conditioned by data since first stage is precharged - cross-coupled latch is added for robustness Small penalty for adding logic Latch has one transistor less in stack - faster than HLFF, but 1-1 glitch exists 4/24/02 EE371 20 Sense-amplifier-based flip-flop Madden & Bowhill, 1990, Matsui et al. 1994. DEC Alpha 21264, StrongARM 110 • • • • • First stage is a sense amplifier On rising clock edge monotonic S_b or R_b trigger the S-R latch Cross-coupled NAND speed bottleneck Big power savings in reduced swing designs Nice interface to/from domino logic 4/24/02 EE371 21 Modified Sense Amplifier-Based Flip-Flop • The first stage is unchanged sense amplifier • Second stage is sized to provide maximum switching speed • Driver transistors are large • Keeper transistors are small and disengaged during transitions Nikolic & Stojanovic, ISSCC ‘99 4/24/02 EE371 22 Modified Sense Amplifier-Based Flip-Flop • Delay of each of the outputs is independent of the load on the other output • Delay of Q and Q is symmetrical as opposed to the NAND based design • Convenient for dual rail logic and driving strength for standard CMOS is effectively doubled • SAFF presents a small clock load, small setup time and all the advantages of original design • Possible tradeoff between speed and robustness to crosstalk 4/24/02 EE371 23 K-6 Dual-Rail ETL • • • • Clk D Self-reset property à increases dynamic power à drives domino logic Precharge increases speed Very fast but burns a lot of power Small clock load Vdd 4/24/02 EE371 24 Power and Delay Definitions • PD All power related to the SE can be divided into: à VDD Input power D • Data power (PD) • Clock power (PCLK) à à • • à Internal power (PINT) Load power (PLOAD) CLK PLOAD CLK Qb PCLK data activity ratio (α) – number of captured data transitions with respect to number of clock transitions (αmax=100%) • no activity (0000… and 1111…) • maximum activity (0101010..) • average activity (random sequence) Glitching activity 4/24/02 Q VDD PLOAD can be merged into PINT Internal power is a function of à D VDD EE371 Ptot = Pinternal + PINT ∑P driver inputs(D,CLK) Delay is (minimum D-Q) Clk-Q + setup time 25 State Element Performance Metrics It is always possible trade power for speed Common metrics: • Power-Delay Product (PDP) • Misleading measure • Good only if measured at constant frequency = EDP • EDP - Energy-Delay Product (EDP) à More accurate measure (Gonzalez & Horowitz) • ED2P – Energy-Delay2-Product à A new measure, being justified by new results (Hofstee, Nowka, IBM) 4/24/02 EE371 26 Design & optimization tradeoffs PDPtot [fJ] 90 80 • Opposite Goals 70 60 à Minimal Total power consumption à Minimal Delay 50 40 30 20 • • Opt. 10 0 0 50 100 150 Power-Delay tradeoff Minimize Power-Delay product (PDPtot) @ f=const. 200 90 80 70 60 70 60 PDPtot [fJ] PDPtot [fJ] Total Power [uW] 90 80 50 40 30 20 0 5 10 30 20 10 0 Opt. 10 0 50 40 15 20 25 0 Width [um] 4/24/02 Opt. 200 400 600 800 1000 Delay [ps] EE371 27 1200 Delay Comparison (50% activity) Overall Results 5 MS Latch Pulsed Latch Differential 4.5 4 Delay [ FO4 ] 3.5 3 2.5 2 1.5 1 0.5 0 PowPC 4/24/02 C2MOS HLFF EE371 SDFF StrongArm SAbFF 28 Conventional Clk-Q vs.minimum D-Q 400 Total power [uW] HLFF SSTC & DSTC 350 PowerPC Pulsed designs 300 MS designs 250 Strong Arm FF 200 SA-F/F 150 mC2MOS latch 100 K6 ETL 50 SSTC 0 0 1 2 3 4 5 6 7 8 9 10 11 Delay [ FO4 ] DSTC SDFF 400 HLFF 350 Total Power [uW] PowerPC 300 Strong Arm FF 250 200 SA-F/F 150 mC2MOS latch 100 K6 ETL 50 SSTC 0 0 1 2 3 4 5 SDFF Clk-Q delay [FO4] 4/24/02 DSTC EE371 • • Hidden positive setup time Degradation of total delay Older 0.22u comparison results 29 Overall Results Single-Edge Triggered Structures Power Consumption Comparison (50% activity) Internal Power [uW] MS Latch Data Power [uW] Single Ended Dual Ended 200 150 100 50 4/24/02 FF CC DE SA bF F F CP F St ro ng Ar m SE EE371 TG CC FF FF SD FF HL SS TC TC DS C2 M O S 0 Po wP C Power Consumption [uW] 250 Clock Power [uW] 30 Internal Power distribution Internal Power [uW] 400 350 300 250 200 150 100 50 0 Random, activity=0.5 …01010101… activity=1 …11111111… activity=0 …00000000… activity=0 Data patterns HLFF SDFF PowerPC 603 latch mC2MOS latch StrongARM FF Alpha 21264 FF K6 ETL • Four sequences characterize the boundaries for internal power consumption à à à à 4/24/02 …010101… random, equal transition probability, …111111… …000000… maximum average precharge activity leakage + internal clock processing EE371 Older 0.22u comparison results 31 Comparison of Clock power consumption DSTC MS latch SSTC MS latch K6 ETL StrongArm FF SA-F/F 2 mC MOS PowerPC MS latch SDFF HLFF 0 10 20 30 40 50 Local Clock power consumption [? W] Older 0.22u comparison results 4/24/02 EE371 32 Design goals • Apply à Small clock load à Short direct path à Reduced node swing à Low-power feedback à Pulsed design à Optimization of both Master and Slave latch • Avoid à Positive setup time à Sensitivity to clock slope and skew à Dynamic (floating) nodes à Dynamic Master latch Conduct Energy - Delay optimizations Take into account all sources of power dissipation ALWAYS use Clk-Q + setup time for max delay For more details on storage elements check prof. Oklobdzija’s ISSCC’02 talk: http://www.ece.ucdavis.edu/acsel under Presentations 4/24/02 EE371 33 Simulation Conditions: • • • • • Power Supply Voltage: VDD=1.8V nominal Temperature T=27°C nominal Technology: 0.18µm Fujitsu Fan-Out of 4 Delay = 75pS Transistor Widths à Minimal 0.36µm Maximal 10µm • Load: 14 minimal inverters in the technology used • Clock frequency: 500MHz (250MHz for Dual-Egde) • Data/Clock slopes of ideal signal 100ps 4/24/02 EE371 34