Clocked storage elements Vladimir Stojanovic Material in this presentation is adapted from “Digital System Clocking: High-Performance and Low Power Aspects”, V.G.Oklobdzija, V.M.Stojanovic, D.M.Markovic, N.M.Nedovic, © 2003 J.Wiley & IEEE EE371 Lecture 6 Outline Latch and Flip-Flop ! Timing and Power Metrics ! High-Performance Issues ! Low-Energy Issues ! State-of-the-art circuits ! Microprocessor Examples ! EE371 Lecture 6 2 1 Recent Interest in Flip-Flops ! Trends in high-performance systems » Higher clock frequency » More transistors on chip ! Consequences » Increased flip-flop overhead relative to cycle time – Cycle time 10 - 20 FO4 delays, flop overhead 2 - 4 FO4 » » » » Difficult to control both edges of the clock Higher impact of clock skew Higher crosstalk and substrate coupling Higher power consumption – expensive packages and cooling systems – limit in performance » Clock burns up to 40%, flops up to 20% of total power 3 EE371 Lecture 6 Why are clocked storage elements important? ! ! Cycle time ~ 12-20 FO4 delays Flip-flop overhead 2-3FO4 (20% of cycle time!!!) D Q Logic D Q N Clk Clk T TClk-Q TLogic EE371 Lecture 6 TSetup TD-Q=TClk-Q + TSetup 4 2 Latch and Flip-Flop Data Q Latch Latch – “soft” edge clocking Clock Clock Latch is “transparent” (clock-level sensitive) Data Q Data Q Flip-Flop – “hard” edge clocking F-F Clock After the transition of the clock, data change does not affect the output (clock-edge sensitive) Clock Data Q 5 EE371 Lecture 6 Flip-Flop and Master-Slave Latch Data F-F Operational behavior appears the same… Q Clock Clock Data Q Slave - L2 Master - L1 D Data M-S Latch Latch Clock Q Q D Q Clk1 Clock Q Q Clk2 How can one recognize the difference without knowing what is inside the “blackbox” ? Data Q EE371 Lecture 6 6 3 Flip-Flop and M-S Latch: Structural Difference Input Input Input Clock: Φ1 Pulse Generator Clock Q1 S No Clock R Slave Latch Q Q Clock: Φ2 Pulse Capturing Latch Master (L1) Latch Q1 Slave (L2) Latch Q2 Q2 Q Flip-Flop Input Q M-S Latch 7 EE371 Lecture 6 T-G Master-Slave Latch ! PowerPC 603 (Gerosa, JSSC 12/94) Vdd Clk Vdd Clkb Q D Clkb Clk EE371 Lecture 6 8 4 Flip-Flop Example 1: SAFF (Sense-Amplifier-Based Flip-Flop) S R Pulse generator D D Clk D=0 pulse D=1 S R Q Capturing Latch Q SAFF DEC Alpha 21264 (Madden & Bowhill, 1990, Matsui 1994) 9 EE371 Lecture 6 Flip-Flop Example 2: HLFF (Hybrid Latch Flip-Flop) Vdd Q Q D=1 D Clk D=0 D=0 Pulse Generator D=1 EE371 Lecture 6 signal at node X Second Stage Latch 10 5 Logic Diagram of HLFF D Clk Q Enable Pulse Generator Second Stage Latch D=1 D=0 D=1 signal at node X D=0 EE371 Lecture 6 11 Performance Metrics: Timing and Power EE371 Lecture 6 12 6 Timing Parameters in Latches D Q L Setup and Hold Times are defined relative to closing the clock signal Clk U Clk tCW D1 (a) Early data arrival H tCQ Q1 (a) D2 (b) Late data arrival tDQ Q2 (b) 13 EE371 Lecture 6 Timing Parameters in Flip-Flops D Q FF Setup and Hold Times are defined relative to rising edge Sampling of the clock Window Clk U Clk H tCW D tCQ Q EE371 Lecture 6 14 7 Data-to-Output Delay ! ! Sum of setup time and Clk-Q delay is the only true measure of performance w.r.t. system speed T = TClk-Q + TLogic + Tsetup+ Tskew D Q Logic D Q N Clk Clk T TClk-Q TLogic TD-Q=TClk-Q + TSetup TSetup 15 EE371 Lecture 6 Clk-Q Delay is a function of D-Clk 350 300 Clk-Output [ps] 250 200 Setup Hold 150 100 50 -200 -150 -100 -50 Sampling Window 00 50 100 150 200 Data-Clk [ps] EE371 Lecture 6 16 8 Data to Output Delay Setup Time vs. Data-to-Output (D-Q) Delay Constant Clk-Q Region Failure Region Variable Clk-Q Region D-Q o Clk-Q 45 opt Setup DDQm Uopt Data arrives early Data arrives late Data to Clock Delay 17 EE371 Lecture 6 Power Consumption ! Power related to a CSE can be divided into: » Input power PD VDD D – Data power (PD) – Clock power (PCLK) » Internal power (PINT) D VDD Q VDD CLK PLOAD CLK Qb PCLK – Depends on data activity and glitching activity Ptot = Pinternal&load + » Load power (PLOAD) PINT ∑ Pdriver inputs(D,CLK) – Can be merged into PINT EE371 Lecture 6 18 9 High-Performance Issues 19 EE371 Lecture 6 Clock Skew and Jitter tDRVCLK Ref_Clock tskew t− jit tskew t+ jit Received Clock T tRCVCLK EE371 Lecture 6 20 10 The Idea of Clock Uncertainty Absorption 300 D-Q delay [ps] 280 260 tCU 240 DDQM 220 DDQm 200 -30 -20 -10 0 10 20 Nominal Clk 30 40 50 60 Clk arrival time [ps] Clk Change in D-Q delay is much smaller than the clock uncertainty (CSE absorbs a part of the uncertainty) EE371 Lecture 6 21 Clock Uncertainty Absorption Worst-case DDQ Nominal DD-Clk D Early DD-Clk Late DD-Clk Clock uncertainty tCU TNominal=0 Clk Q DDQm DDQM EE371 Lecture 6 22 11 Example: HLFF [Partovi et al, ISSCC’96] EE371 Lecture 6 23 State-of-the-Art CSEs in CMOS Technology EE371 Lecture 6 24 12 Requirements in the Flip-Flop Design ! ! ! ! Small Clk-Output delay, Narrow sampling window Low power Small clock load High driving capability (increased levels of parallelism) » Typical flip-flop load in a 0.18µm CMOS ranges from 50fF to over 200fF, with typical values of 100-150fF in critical paths ! ! ! Integration of logic into the flop Multiplexed or clock scan Crosstalk insensitivity - dynamic/high impedance nodes are affected 25 EE371 Lecture 6 Low-Energy Designs: Master-Slave Latch Examples C2MOS Latch Master-Slave Latch (MSL) Clk Clk1 D QM Clk Clk1 Clk SM D Clk Clk Clk1 Clk1 Clk1 Clk Q Clk1 Clk1 Q SS QM Clk Clk Clk Clk1 [Suzuki et al, JSSC 1973] [G.Gerosa et al, JSSC 1994] Feedback for pseudo-static operation State node SS protection in PPC (decoupled Q) EE371 Lecture 6 26 13 Master-Slave Latches ! ! ! ! Positive setup times Two clock phases: » distributed globally » generated locally Small penalty in delay for incorporating MUX Some circuit tricks needed to reduce the overall delay 27 EE371 Lecture 6 High-Performance Designs: Flip-Flop Examples 1/2 HLFF (Hybrid Latch-Flip-Flop) Clk SDFF (Semi-Dynamic Flip-Flop) Clk S S Q Q I Clk D D Clk Clk1 [Partovi et al, JSSC 1996] Clk1 [F.Klass et al, JSSC 1998] Pulse-generating first stage (precharge-evaluate) Keepers for pseudo-static operation Output load decoupled from internal nodes EE371 Lecture 6 28 14 High-Performance Designs: Flip-Flop Examples 2/2 SAFF (Sense-Amplifier-Based Flip-Flop) S Fully-differential circuit R D D Clk First stage sense-amp can take reduced-swing inputs 2nd stage is capturing latch Q delay to Q and !Q not equal Q [Matsui et al, 1994] EE371 Lecture 6 29 SAFF with Improved S-R Latch ! ! ! ! The first stage is unchanged sense amplifier Second stage is sized to provide maximum switching speed Driver transistors are large Keeper transistors are small and disengaged during transitions [Nikolic & Stojanovic ISSCC ’99] EE371 Lecture 6 30 15 Flip-Flops ! First stage is a pulse generator » generates a pulse (glitch) on a rising edge of the clock ! Second stage is a latch » captures the pulse generated in the first stage ! ! Pulse generation potentially results in a negative setup time and soft-edge property Must check for hold time violations Note: power is always consumed in the clocked pulse generator 31 EE371 Lecture 6 Delay Comparison: M-S Latches and Flip-Flops Delay [FO4] Min D-Q Delay Comparison 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Flip-Flops are faster MSL C2MOS HLFF SDFF SAFF M-SAFF 0.18um, high load (14 min inverters) EE371 Lecture 6 32 16 Energy Comparison: M-S Latches and Flip-Flops Energy breakdown (50% activity) 120 Energy [fJ] 100 Ext. clock Ext. data Int. clock Internal non-clk Latches are lower energy 80 60 40 20 0 MSL C2MOS HLFF SDFF SAFF M-SAFF 0.18um, high load (14 min inverters) EE371 Lecture 6 33 Summary ! CSE topology depends on target application » Master-Slave Latches for low-energy » Flip-Flops & Pulsed latches for high-performance ! ! Delay is critical in high-speed systems, although minimizing Clk energy is of increasing importance Methods for reducing Clk energy » Clock gating (more effective in high-performance than in low-energy designs) » Reduced-swing clocking » Dual-edge clocking EE371 Lecture 6 34 17 Microprocessor Examples 35 EE371 Lecture 6 Sun UltraSPARC-III Basic Flip-Flop (SDFF) Vdd Vdd MP1 S MN3 NAND MN2 D MN1 MP2 Q Inv4 Inv2 Inv5 Q MN5 Clk1 Inv6 Inv3 Inv1 MN4 Clk [Klass, 1998] EE371 Lecture 6 36 18 Sun UltraSPARC-III Dynamic Flip-Flops Vdd MP1 Inv5 S Q Inv4 MN3 NAND D Differential Inv3 MN2 Vdd Clk MP1 MN1 Inv1 Inv2 Vdd MP2 MP4 MP3 Inv5 S Q Single-Ended MN6 Inv6 R Q MN3 MN5 Inv1-2 D Inv3-4 MN2 MN4 MN7 D Clk MN1 37 EE371 Lecture 6 Sun UltraSPARC-III Flip-Flops with Logic Embedding Vdd MP1 Vdd S NAND MN3 MN2a D1 Inv3 MN5 Inv5 Q Inv6 MN2d D2 Clk Q M N2c D1 MN2b MP2 Inv4 MN4 D2 MN1 2-input XOR Inv1 Inv2 EE371 Lecture 6 38 19 Latches used in 20164 Alpha [Gronowski et al, 1998] P1 P5 P1 P2 X D P2 D Q P3 Clk X N3 Clk N4 N1 N1 N2 P4 Q N2 N5 21064 modified TSPC Latches 39 EE371 Lecture 6 Logic Embedding in 21064/21164 Alpha D1 D2 D1 D2 X X1 Q Q Clk Clk D3 D4 X2 Clk 1 level of logic (21064 Alpha) 2 levels of logic (21164 Alpha) EE371 Lecture 6 40 20 Flip-Flop used in 21264 Alpha Q Q Clk Differential D 41 EE371 Lecture 6 IBM Processors: Level-Sensitive Scan Design (LSSD) +Clock Data Out -Clock Hazard-free level-sensitive polarity-hold latch [Eichelberger 1983] EE371 Lecture 6 42 21 The PowerPC 603 Master-Slave Latch VDD ACLK SCANin C2 ACLK C1 C2 Dout Din C1 C2 ACLK [Gerosa et al, 1994] 43 EE371 Lecture 6 IBM Power4TM processor Scannable Split Latch with LSSD Capability [Warnock et al, 2002] scan_clk Scan_in scan_clk c1 c1 Attached Logic Gate c2 scan_clk c2 scan_clk c1 c2 Data_out Data_in c1 L1 EE371 Lecture 6 c2 L2 44 22 Low-Energy Issues: Clock & Clocked elements burn more than 60% of the processor power 45 EE371 Lecture 6 Low-Swing Clocking: Clock Driver Re-design [H. Kojima, JSSC, April 1995] VDD GND Clk Cp1 CPT CNT Cn1 Cp2 CPB CNB Cn2 CA VDD Vthp CPT CPB CNT CNB H-VDD CB Vthn GND 50% power reduction with half-swing clock (minus some penalty in clock drivers) EE371 Lecture 6 46 23 Low-Swing Clocking: CSE Re-design Vwell > VDD VDD E(a),(b)~VDD(VDD-Vth) E(c)~(VDD-Vth)2 Clk (VDD-Vth) VDD-Low (a) VDD D Clk (VDD-Low) n (c) Clk (VDD-nVth) Clk (b) Q Q Clock drivers PMOS does not fully turn off [H. Kawaguchi and T. Sakurai, JSSC, May 1998] 47 EE371 Lecture 6 Low-Swing Clocking: N-only CSEs Clk D Clk Clk Q QM SS SM N1 N2 Clk N3 N4 Clk N-only clocked transistors, M-S Latch Example (N1 and N2 improve pull-up on SM) [D.Markovic, J.Tschanz, V.De, 2001, patent pending] EE371 Lecture 6 48 24 Clock Gating: Global Clock Gating (a) (b) 0 In D 1 S Load Clk In Q D EN Clk REG Q REG Time-mux (no gating!) Global Clk Gating Used to save clocking energy when data activity is low 49 EE371 Lecture 6 Clock Gating: Local Clock Gating [M. Nogawa and Y. Ohtomo, JSSC, May 1998] CP CP CP QM D Q CP CP CP Clk Pulse Generator CPI Clock Control P1 Data-Transition Look-Ahead CP CP Used to save clocking energy when data activity is low EE371 Lecture 6 50 25 Dual-Edge Triggering: Latch-Mux [R.P. Llopis and M. Sachdev, ISPLED Aug. 1996] Concept D Circuit Example D Q C Q Clk Clk Clk 0 Clk D Q C Q Q 1 S D Clk Clk Clk Clk Clk Q Clk Clk Used to save clocking energy regardless of data activity! 51 EE371 Lecture 6 Dual-Edge Triggering: Pulsed-Latch Concept D Pulse Gen D Q Q C Q Q Circuit Example C Pulse Gen Clk Clk Clk2 Clk C Clk Clk1 Clk1 Clk1 Clk Clk Clk2 Clk Clk1 Clk1 Clk1 Clk D D Clk2 Clk Q Q Clk2 Clk Clk2 Clk2 Clk1 Clk Clk Clk1 Clk Clk (a) (b) Single-Edge EE371 Lecture 6 Dual-Edge 52 26 Dual-Edge Triggered Flip-Flop Concept D Circuit Example D S C R D S C R 1st stage: PG Latch X Q Q Q Q Clk Clk SX CL Clk 1st stage: PG Latch Y CL SY D D Clk1 Clk2 Clk Q Pulse-generating latches trigger capturing latch Clk Clk Clk1 Clk Clk1 Clk2 [N.Nedovic, V.G.Oklobdzija, ESSCIRC 2002] 53 EE371 Lecture 6 Design goals ! Apply » Small clock load » Short direct path » Reduced node swing » Low-power feedback » Pulsed design » Optimization of both Master and Slave latch ! Avoid » Positive setup time » Sensitivity to clock slope and skew » Dynamic (floating) nodes » Dynamic Master latch Conduct Energy - Delay optimizations Take into account all sources of power dissipation ALWAYS use Clk-Q + setup time for max delay For more details on storage elements check prof. Oklobdzija’s ISSCC’02 workshop: under Presentations EE371 Lecture 6 54 27 What to Expect in the Future? ! ! ! ! ! ! Incorporating logic into the CSE Absorbing clock skew Pipeline boundaries will start to blur – pulsed latches Latch-less domino style clocking, signals used to clock Synchronous design only in a limited domain Asynchronous communication between synchronous domains EE371 Lecture 6 55 28