Timing Issues Mohammad Sharifkhani Reading • Textbook II, Chapter 10 • Textbook I, Chapters 12 and 13 Motivation • Time is the essence! – We do things in order, do does the processors • Procedural dependency • Resource Reusability • Synchronous architectures are preferred – Ease of implementation – Predictability – Compatibility with well known arithmetic algorithms • A reference clock plays a key role – We usually neglect the non-idealities in the clock in the design cycle Timinng Clock frequency Two signals Signals that can only transition at predetermined times with respect to a signal clock are called “{syn,meso,plesio}chronous” An asynchronous signal can transition at any arbitrary time. Definitions data passed between two different clock domains Mesochronous Timing Mesochronous Timing unknown interconnect delay Pelsichronous • two interacting modules have independent clocks generated from separate crystal oscillators Asynchronous Interconnect •No clock is needed •Speed is determined by job completion Hand Shaking • The four-phase handshake is level-sensitive while the twophase handshake is edgetriggered (lower transitions at the expense of edge triggered circuitry). • System A places data on the bus. It then raises Req to indicate that the data is valid. • System B samples the data when it sees a high value on Req and raises Ack to indicate that the data has been captured. System A lowers Req, then system B lowers Ack. Req is not synch to clkB synchronizer is needed Hand Shaking (Cont’) Synchronous Timing CLK In R1 Cin Combinational Logic Cout R2 Out A quick look Timing Definitions and Basics Latch Parameters Transparent Opaque T Clk PWm D Q tsu thold tc-q td-q Delays can be different for rising and falling data transitions Register Parameters T Clk thold D tsu Q tc-q Delays can be different for rising and falling data transitions Clock Uncertainties 4 Power Supply 3 Interconnect Devices 2 5 Temperature 1 Clock Generation Sources of clock uncertainty 6 Capacitive Load 7 Coupling to Adjacent Lines Clock Nonidealities • Clock skew – Spatial variation in temporally equivalent clock edges; deterministic + random, tSK • Clock jitter – Temporal variations in consecutive edges of the clock signal; modulation + random noise – Cycle-to-cycle (short-term) tJS – Long term tJL • Variation of the pulse width – Important for level sensitive clocking Clock Skew and Jitter Clk tSK Clk tJS • Both skew and jitter affect the effective cycle time • Only skew affects the race margin Clock Skew and Jitter Clk tSK Clk tJS • Do not touch the clock signal if not necessary! – Sometimes the simplest architecture is the safest – But not necessarily the lowest power! Clock skew and Jitter • Data and state independent clock distribution is desired • Enabled FF is a popular choice in the design • Consider clock load on power! Clock Skew # of registers Earliest occurrence of Clk edge Nominal – /2 Latest occurrence of Clk edge Nominal + /2 Insertion delay Max Clk skew Clk delay Positive and Negative Skew In R1 D Q CLK Combinational Logic R2 D Q tCLK1 Combinational Logic R3 D Q tCLK2 delay ••• tCLK3 delay (a) Positive skew In R1 D Q Combinational Logic tCLK1 R2 D Q Combinational Logic tCLK2 delay D Q tCLK3 delay (b) Negative skew R3 CLK ••• Positive Skew TCLK + d CLK1 CLK2 TCLK 1 3 d 2 4 d + th Launching edge arrives before the receiving edge Negative Skew TCLK + d 1 CLK1 CLK2 2 d TCLK 3 4 Receiving edge arrives before the launching edge Timing Constraints (positive skew) In R1 D R2 Q Combinational Logic tCLK1 CLK tc - q tc - q, cd tsu, thold D Q tCLK2 tlogic tlogic, cd Minimum cycle time: T + > tc-q + tsu + tlogic More time to process the data Worst case is when receiving edge arrives early (positive ) Timing Constraints (positive skew) In 1 R1 D R2 Q Combinational Logic tCLK1 CLK tc - q tc - q, cd tsu, thold D Q tCLK2 tlogic tlogic, cd Hold time constraint: t(c-q, cd) + t(logic, cd) > thold + Otherwise it can not latch In1 before it changes after CLK1 edge Worst case is when receiving edge arrives late Race between data and clock (positive skew) < t(c-q, cd) + t(logic, cd) > thold independent of the T Considerations • δ > 0—This corresponds to a clock routed in the same direction as the flow of the data through the pipeline. The skew has to be strictly controlled. If this constraint is not met, the circuit does malfunction independent of the clock period. Question • Would there be any race if the skew is negative? • What would you do to avoid race? Negative Skew • δ < 0—When the clock is routed in the opposite direction of the data , the skew is negative and condition to avoid race is unconditionally met. The circuit operates correctly independent of the skew. The skew reduces the time available for actual computation so that the clock period, T, has to be increased by |δ|. If race (hold time) is a problem, route the clock in the opposite direction Impact of Jitter CLK TC LK t j itter -tji tte r In Combinational Logic REGS CLK tc-q , tc-q, ts u, thold tjitter cd t log ic t log ic, cd Both skew and jitter should be accounted for in feedback structures Longest Logic Path in Edge-Triggered Systems TSU Clk TClk-Q Latest point of launching considering jitter TLM T Earliest arrival of next cycle TJI + Clock Constraints in Edge-Triggered Systems If launching edge is late and receiving edge is early, the data will not be too late if: Tc-q + TLM + TSU < T – TJI,1 – TJI,2 - Minimum cycle time is determined by the maximum delays through the logic Tc-q + TLM + TSU + + 2 TJI < T Skew can be either positive or negative Shortest Path Earliest point of launching Clk Clk TClk-Q TLm TH Nominal clock edge Data must not arrive before this time Clock Constraints in Edge-Triggered Systems If launching edge is early and receiving edge is late: Tc-q + TLM – TJI,1 > TH + TJI,2 + Minimum logic delay Tc-q + TLM > TH + 2TJI+ False path Path 1 (5 tgate) never exercised. If A = 1, the critical path goes through OR1 and OR2. If A = 0 and B = 0, the critical path is through I1,OR1 and OR2 (corresponding to a delay of 3 tgate). For the case when A= 0 and B =1, the critical path is through I1,OR1, AND3 and OR2. Does not depend on C,D. How to counter Clock Skew? . REG REG In REG REG Negative Skew Positive Skew Clock Distribution Data and Clock Routing log Out Sources of uncertainity Device variation • Variation • Matching – Poly orientation – Dopant profiles • Can be modeled and compensated for Interconnect variation (ILD) Pattern and ILD correlation Use of fillers is necessary Temp. and Power • Temp. – Time varying (milisecond) – Effect of clock gating – Has a gradient systematic compensated for • Power – Instantaneous IR Drop (switching activity) – Jitter (short pulses, data dependent) – Can not be compensated for (only decoupling caps) Data dependent loading Capacitive coupling and X-talk works the same way. It is modeled as a form of jitter due to its random nature Clock Distribution H-tree CLK Clock is distributed in a tree-like fashion Example • Clock H-Tree – Clock skew: time difference between the arrival time of the clock signal between two leaves – Identical branches and leaves Example • Considering three parameters: – Both FETs and wires; 64 samples + main buffer – All deterministic factors are nulled out only within chip variation is considered – Random ΔL of FET with distribution stat: N(0, 0.035um) – Random ΔW of wires with N(0,0.25um) – Spatial ΔL; ΔL = w0+wx.x+wy.y Example Example • Results – In case of Random ΔL 139ps vs. 171ps without considering spatial constraints – In case of Random ΔW 41ps vs. 49ps – Without considering spatial constraints; worst case is too pessimistic More realistic H-tree 10 Balanced segments Each segments contain 580 drivers All-RC matched If we leave Clock Tree for last minute we may end-up with multiple timing constraints violations! [Restle98] The Grid System GCL K Driver Driver GCLK Driver Absolute delay is minimized Allows late design changes GCLK •No rc-matching •Large power Driver GCL K Examples • Alpha 21064 (0.75um) 200MHz • Clock load 3.25nF (40%) • Skew < 200pSec (10%) Example: DEC Alpha 21164 Clock Frequency: 300 MHz - 9.3 Million Transistors Total Clock Load: 3.75 nF Power in Clock Distribution network : 20 W (out of 50) Uses Two Level Clock Distribution: • Single 6-stage driver at center of chip • Secondary buffers drive left and right side clock grid in Metal3 and Metal4 Total driver size: 58 cm! 21164 Clocking • 2 phase single wire clock, distributed globally tskew = 150ps • 2 distributed driver channels tcycle= 3.3ns trise = 0.35ns Clock waveform final drivers pre-driver Location of clock driver on die – – – – Reduced RC delay/skew Improved thermal distribution 3.75nF clock load 58 cm final driver width • Local inverters for latching • Conditional clocks in caches to reduce power • More complex race checking • Device variation • Skew: 90pSec (65pSec effective) 21164 Clocking • Clock buffers carefully sized to minimize the skew • The direction of the clock is considered • One gate between the latches • Dummy fillers (increase cap) – Dummies are shielded Reducing Skew • • • 1. balance clock paths from a central distribution source to individual clocking elements using H-tree structures 2. The use of local clock grids (instead of routed trees) can reduce skew at the cost of increased capacitive load and power dissipation. 3. If data dependent clock load variations causes significant jitter, differential registers that have a data independent clock load should be used. – The use of gated clocks to save also results in data dependent clock load and increased jitter. In clock networks where the fixed load is large (e.g., using clock grids), the data dependent variation might not be significant. • 4. If data flows in one direction, route data and clock in opposite directions. This eliminates races at the cost of performance. • 5. shielding clock wires from adjacent signal wires • 6. ILD: Dummy fills • 7. Temperature: delay locked loops as discussed later in this chapter can easily compensate for temperature variations. • 8. Power supply variation : on-chip decoupling capacitors. Unfortunately, decoupling capacitors require a significant amount of area and efficient packaging solutions must be leveraged to reduce chip area. Clock Drivers Clock Skew in Alpha Processor EV6 (Alpha 21264) Clocking 600 MHz – 0.35 micron CMOS tcycle= 1.67ns trise = 0.35ns Global clock waveform tskew = 50ps • 2 Phase, with multiple conditional buffered clocks – 2.8 nF clock load – 40 cm final driver width PLL • Local clocks can be gated “off” to save power • Reduced load/skew • Reduced thermal issues • Multiple clocks complicate race checking 21264 Clocking Hierarchical clocking Trade-off between power and skew Flexibility in types of clocks at each reagion Not shielded EV6 Clock Results ps 5 10 15 20 25 30 35 40 45 50 ps 300 305 310 315 320 325 330 335 340 345 GCLK Skew GCLK Rise Times (at Vdd/2 Crossings) (20% to 80% Extrapolated to 0% to 100%) EV7 Clock Hierarchy Active Skew Management and Multiple Clock Domains + widely dispersed drivers DLL DLL DLL NCLK (Mem Ctrl) + DLLs compensate static and lowfrequency variation GCLK (CPU Core) SYSCLK L2R_CLK (L2 Cache) PLL L2L_CLK (L2 Cache) + divides design and verification effort - DLL design and verification is added work + tailored clocks Latch based timing • We can have comb. Circuits between the two latches of a FF – More flexibility in terms of timing Flip-Flop – Based Timing Skew Flip-flop delay Logic delay TSU TClk-Q Flip -flop =0 Logic Representation after M. Horowitz, VLSI Circuits 1996. =1 Latch timing When data arrives to transparent latch tD-Q D Q Latch is a ‘soft’ barrier Clk tClk-Q When data arrives to closed latch Data has to be ‘re-launched’ Single-Phase Clock with Latches Latch Logic Tskl Clk Tskl Tskt Tskt latch transparent PW P Preventing late arrivals Case 1: - The LM can start ahead of time - c2q limits Case 2: d2q limits Lgk can still operate Preventing late arrivals Preventing Premature Arrivals Data should not pass through the latch more than once during its transparent mode Otherwise the data loops within the transparent window of time Single latch timing Latch-Based Design L1 latch is transparent when = 0 L2 latch is transparent when = 1 L1 Latch Logic Logic L2 Latch Latch-Based Timing Skew Static logic L1 Latch Logic Path1 L2 trans. L1 latch Logic Can tolerate skew! L2 latch =1 L2 Latch L1 trans. =0 Long Path 1 Hits L2 transparent goes through L2 Short Path 1 Hits L2 latch has to wait till L2 becomes transparent Latch based timing Trans. when high Trans. when low Slack-borrowing In Trans. when high L1 D Q CLB_A t p d,A a b CLK1 L2 D Q CLB_B t p d,B c L1 d D CLK2 Q e CLK1 TC LK CLK1 CLK2 tpdA tpdB slack passed to next stage t pd,A tD Q tpd,B t DQ e valid d valid CLB_B starts before (3) kicks to latch its input. ie, since CLB_A finished earlier than (3), the extra time is passed to CLB_B again e is valid before (4) to latch the input of the next CLB a valid b valid c valid Example T=125 L4 L4 Becomes transp. at edge no problem when exactly f arrives Design consideration Hold time violation Data available for CLL If the falling edge of clk2 comes with too much skew, THL might not be able to latch the previous data because of hold time violation (ie, D2 is overwritten too quickly after the edge) Domino logic with delays Clock skew No time slack borrowing Skew tolerant domino Can we borrow time? Multiphase Time borrowing is possible Self-timed and Asynchronous Design Functions of clock in synchronous design 1) Acts as completion signal 2) Ensures the correct ordering of events Truly asynchronous design 1) Completion is ensured by careful timing analysis 2) Ordering of events is implicit in logic Self-timed design 1) Completion ensured by completion signal 2) Ordering imposed by handshaking protocol Synchronous Pipelined Datapath In CLK R1 D Q tpd,reg Logic Block #1 tpd1 R2 D Q Logic Block #2 R3 D Q tpd2 Logic Block #3 tpd3 What clock does is that: 1- physical timing constraints are met 2- Clock events serve as a logical ordering mechanism for the global system events If we guarantee these two items, we can remove the clock: -power, area, complexity of clock tree… R4 D Q Synch. design • It assumes that all clock events or timing references happen simultaneously over the complete circuit. This is not the case in reality, because of effects such as clock skew and jitter. • significant current flows over a very short period of time • linking of physical and logical constraints has some obvious effects (e.g. throughput) Self-Timed Pipelined Datapath Req Req HS Ack Req HS Ack Start Done Start Req HS Ack Done ACK Start Done Hand shaking blocks In R1 F1 tpF1 R2 F2 tpF2 R3 F3 Out tpF3 The logical ordering of the operations is What each signal does? ensured by the acknowledge-request scheme, often called a handshaking protocol. Asynch. properties • Timing signals are generated locally… no high precision clock distribution over the chip (skew, etc) • Separating the physical and logical ordering Performance (data dependency and no worst case design) • The automatic shut-down of blocks that are not in use can result in power savings.(power) • Robust to variations in manufacturing and operating conditions such as temperature. Completion Signal Generation LOGIC In Out NETWORK Start DELAY MODULE Using Delay Element (e.g. in memories) Done Completion Signal Generation Using Redundant Signal Encoding Completion Signal in DCVSL VDD VDD B0 Start Done B1 B0 B1 In1 In1 In2 In2 PDN Start PDN Self-Timed Adder VDD VDD Start C0 C0 P0 C1 G0 P1 C2 G1 P2 C3 G2 P3 Start C4 C4 G3 Start VDD C4 C4 C3 C3 C2 C2 C1 C1 Start Start C0 C0 Done P0 K0 C1 P1 K1 C2 P2 K2 Start (a) Differential carry generation C3 P3 K3 C4 C4 (b) Completion signal Completion Signal Using Current Sensing Inputs Start Input Register VDD Start Output Static CMOS Logic tdelay A GNDsense Current Sensor toverlap A B tMDG Done Done Min Delay Generator B Output Data independent reference! tpd-NOR Minimum delay valid Hand-Shaking Protocol Two Phase Handshake The four events, data change, request, data acceptance, acknowledge proceed in a cyclic order. Every transition means that the action is valid! Event Logic – The Muller-C Element A F C B (a) Schematic VDD A A B S R (a) Logic Q A B Fn+1 0 0 1 1 0 1 0 1 0 Fn Fn 1 Seq. element (b) Truth table VDD VDD B F B F B A A F B B (b) Majority Function (c) Dynamic 2-Phase Handshake Protocol Start from DataReady, Ack=0,0. when go to 1,0 , Req=1. The C-element is blocked (and locked), and no new data is sent to the data bus (Req stays high) as long as the transmitted data is not processed by the receiver, no matter what DataReady is. Advantage : FAST - minimal # of signaling events (important for global interconnect) Disadvantage : requires the detection of transitions that may occur in either direction initialization is important Problem: Self-timed FIFO Out In R1 En R2 R3 Done Reqi Req0 C C C Acki All 1s or 0s -> pipeline empty Alternating 1s and 0s -> pipeline full Acko 2-Phase Protocol Example Assume there is a register at the input which loads the data at the beginning of Eval phase From [Horowitz] Example DataReady1 is asserted. Req to the second block is asserted, First C-element is locked. The second block loads data and starts the evaluation process. Example DataReady2 is asserted. Req to the third block is asserted, Second C-element is locked. The third block loads data and starts the evaluation process. The first C-element is released. Can accept a DataReady from the previous stage. (If Req has already come, the first Req is unleashed and goes to eval phase.) Example 4-Phase Handshake Protocol Also known as RTZ Slower, but unambiguous Problem: 4-Phase Handshake Protocol Implementation using Muller-C elements Example Latches: positive edge-triggered or a levelsensitive implementation (latch when level=1) Self-Resetting Logic completion detection (L1) Precharged Logic Block (L1) completion detection (L2) Precharged Logic Block (L2) completion detection (L3) Precharged Logic Block (L3) VDD int out A B C Post-charge logic Self- reseting Clock-Delayed Domino GND CLK2 (to next stage) CLK1 VDD Q1 (also D2) D1 Pulldown Network This is a style of dynamic logic, where there is no global clock signal. Instead, the clock for one stage is derived from the previous stage. Asynchronous-Synchronous Interface fin Synchronous system Asynchronous system fCLK Synchronization Synchronizers and Arbiters • Arbiter: Circuit to decide which of 2 events occurred first • Synchronizer: Arbiter with clock as one of the inputs • Problem: Circuit HAS to make a decision in limited time - which decision is not important • Caveat: It is impossible to ensure correct operation • But, we can decrease the error probability at the expense of delay A Simple Synchronizer CLK int D I1 Q I2 CLK • Data sampled on rising edge of the clock • Latch will eventually resolve the signal value, but ... this might take infinite time! Synchronizer: Output Trajectories Vout 2.0 1.0 0.0 0 100 200 300 time [ps] Single-pole model for a flip-flop Mean Time to Failure Example Tf = 10 nsec = T Tsignal = 50 nsec tr = 1 nsec t = 310 psec VIH - VIL = 1 V (VDD = 5 V) N(T) = 3.9 10-9 errors/sec MTF (T) = 2.6 108 sec = 8.3 years MTF (0) = 2.5 sec Influence of Noise Uniform distribution around VM p(v) logarithmic reduction T 0 VIL VIH Initial Distribution Still Uniform Low amplitude noise does not influence synchronization behavior Typical Synchronizers 2 phase clocking circuit 2 Q 1 Q 1 Using delay line 2 Cascaded Synchronizers Reduce MTF O1 In Sync f O2 Sync Out Sync Arbiters Req1 Req2 Ack1 Arbiter Req1 A Ack2 B Ack2 Ack1 (a) Schematic symbol Req2 Req1 (b) Implementation Req2 A B metastable Ack1 VT gap (c) Timing diagram t PLL-Based Synchronization Chip 1 Chip 2 Data Digital System Digital System fsystem = N x fcrystal Divider PLL fcrystal , 200<Mhz Crystal Oscillator reference clock PLL Clock Buffer PLL Block Diagram Reference clock Local clock Up Phase detector Charge pump Loop filter vcont VCO Down Divide by N System Clock Phase Detector Output before filtering Transfer characteristic Phase-Frequency Detector Rst D Q B UP B UP = 0 DN = 1 A A UP = 0 DN = 0 UP = 1 DN = 0 Rst D Q DN A B B (a) schematic (b) state transition diagram A A B B UP UP DN DN (c) Timing waveforms A PFD Response to Frequency A B UP DN PFD Phase Transfer Characteristic Average (UP-DN) VDD -2 p 2p phase error (deg) Charge Pump VDD UP DN To VCO Control Input PLL Simulation Clock Generation using DLLs Delay-Locked Loop (Delay Line Based) fREF U Phase Det D Charge Pump DL Filter fO Phase-Locked Loop (VCO-Based) fREF U ÷N PD D CP VCO Filter fO Delay Locked Loop DLL-Based Clock Distribution VCDL ••• Digital Circuit ••• Digital Circuit CP/LF Phase Detector GLOBAL CLK VCDL CP/LF Phase Detector