Clock Distribution Shmuel Wimer Bar Ilan Univ. Eng. Faculty Technion, EE Faculty July 2010 1 Clock System Architecture clk3 External Clock ext_clk Clock Generator Clock Distribution gclk Buffers clk1 Clocked Elements Gaters clk2 Chip Chip receives external clock through I/O pad. Clock generator adjusts the global clock to the external clock. Global clock is distributed across the chip. Local drivers and gaters drive the physical clocks to clocked elements. July 2010 2 Global Clock Generation • Receives external clock signal and produce the global clock distributed across the die. • A large skew occurs between external clock and the physical clocks at clocked elements due to delay of distribution network (wires, buffers, gaters). • Therefore, data at clocked elements is no more in sync with data at I/O pins. • Phased Locked Loop (PLL) compensates this delay. • PLL can perform frequency multiplication to obtain the required on-chip frequencies. July 2010 3 Synchronous Chip Interface with PLL Chip A communicates synchronously with chip B Chip B uses the clock sent by chip A. Data in and out must be synchronized to the common clock. A PLL produces the global clock of chip B such that it is in sync with the external clock. Chip A Chip B CLKin ext_clk CLKout clk Dout Din July 2010 Din ref_clk PLL clk_out gclk fdbk_clk Dout Clock Distribution 4 How PLL Works? Charge Pump Loop Filter C Up I R ref_clk M fdbk_clk Phase Detect Vctrl N Voltage Controlled Oscillator clk_out I Down July 2010 5 Phase – Frequency Detector (PFD) 1 D Q Q_A: B should go faster A CLR CLR B 1 D Q Q_B: B should go slower The two flip-flops receive the signals at their clock input (one is usually a reference and the other is the sampled). The output of the leading flip-flop is 1 for the lead duration. Once the lagging signal arrives, a reset turns both Q_A and Q_B to zero. July 2010 6 What happens when the reference and the sampled signals are a shift of each other? A: reference B: sampled Q_A: sampled should go faster Q_B: sampled should go slower The spikes at Q_B are a result of the delay of the AND gate driving the CLR input of flip-flip and the internal delay from CLR to Q. July 2010 7 What happens when the reference and the sampled signals have different frequencies? A: reference B: sampled Q_A: sampled should go faster Q_B: sampled should go slower Sampled is more often 1-value than the reference is, since rising edge of B occurs more often than rising edge of A. July 2010 8 Charge Pump faster 1 D Q Icp CLK_ref CLR Sup Vctrl Sdn CLR CLK_fdbk 1 D Icp Q slower Converts PFD error (digital) to charge (analog), which then controls PLL VCO. Charge is proportional to PFD widths: Qcp Iup tfaster Idn tslower . July 2010 9 Current Mirror Iin Vcc V Iout N2 N1 P1 Iin Vss V P2 Iout Charge pump consists of current mirrors which are sources of constant current. Device N1 is in saturation since its gate is connected to high voltage. Ids (=Iin) depends only on Vgs. Vgs is similar in N2, hence Iout=Iin. This is an ideal current source with infinite output impedance since Iout is independent of N2 load; a change in output voltage doesn’t affect Iout. Current mirror works similarly for P transistors. July 2010 10 How Charge Pump Works? R load – determines the current through current mirror Vcc I I Vcc I C P-type current mirror faster R Switch – open when faster = 1 Switch – open when slower = 1 Vout slower I I I N-type current mirror Vss July 2010 11 Faster Mode Vout → Vcc Vcc Vcc C faster=1 R Vout slower=0 Vss July 2010 12 Slower Mode Vout → Vss Vcc Vcc C faster=0 R Vout slower=1 Vss July 2010 13 Loop Filter Vcc C R Vctrl + Vctrl - Differential amplifier connected as a unity-gain follower is used. July 2010 14 Voltage Controlled Oscillator (VCO) A Ring Oscillator cascades an odd number of inverters and feeds back the last output to first inverter (even number of inverters will be stable). It starts to oscillate spontaneously. Vout If tinv is the delay of an inverter and n is the number of inverters, oscillation frequency is f 1 2ntinv Frequency can be controlled by number of inverters and supply voltage of inverter (higher voltage obtains faster inverter). July 2010 15 Components of VCO Buffering for driving clk_out clk_out Vctrl Vcc - cc V V-cc + - +- Vss Ring of 5 inverters July 2010 Level Converter from Vctrl-Vss to Vcc-Vss 16 Delay Locked Loop (DLL) • It is a variant of PLL that uses voltage-controlled delay line rather than oscillator. • It adjusts phase only. Frequency multiplication is impossible. • It is simpler than PLL, less sensitive to Vctrl noise and requires simpler loop filter. • It is very difficult to correctly design PLL and DLL. It requires expertise in control systems and analog circuit design. July 2010 17 How DLL Works? Charge Pump Loop Filter C Up I R ref_clk Phase Detect Vctrl fdbk_clk Voltage Controlled Delay Line clk_out I Down July 2010 18 Delay Line The signal from input to output is delayed In 1 or not according to the control bit. Out 0 In 2n1 1 2n2 1 0 0 Dn1 Dn2 1 1 Out 0 D0 Connecting a series of n delay elements, each of delay 2i , n 1 i 0, it is possible to control the dely of the line to any value from 0 to 2n 1 dealy units by an n-bit control word Dn 1 , Dn 2 , July 2010 , D0 . 19 Clock Distribution Networks July 2010 20 Tree Clock Network (Unconstrained) clk1 clk2 clk3 clkn No constraints imposed on buffers and wires. Used mostly by automatic tools in automatic synthesis flows. Can be used for small blocks within large design. Tools aim at minimizing the variance of clock delays. July 2010 21 If TCLK is the clock delay at a leaf, the variance exapression i 2 1 n T T i1 CLKi n i1 CLKi should be minimized. n Serpentine routing or extra buffers may be introduced to obtain small variance. Constraints on power can be imposed by limiting number and size of clock buffers and width of wires. July 2010 22 Clock Distribution with Grids Grid feeds flops directly, no local buffers Clock driver tree spans height of chip Internal levels shorted together Low skew but high power July 2010 23 Clock drivers are on perimeter July 2010 Clock drivers are on grid points 24 Delay and Skew in Grid Distribution July 2010 25 DEC’s Alpha Microprocessor Clocking July 2010 26 DEC Alpha 21264 Microprocessor Clock distribution July 2010 27 Clock Distribution with Spines July 2010 28 Intel’s Pentium4 Clock Distribution July 2010 29 July 2010 30 Clock Distribution with Trees RC-Tree Each branch is individually routed to balance RC delay H-Tree Recursive pattern to distribute signals uniformly with equal delay over area More skew but less power July 2010 31 Clock H-Tree chip / functional block / IP sequential elements clock / PLL July 2010 32 IBM / Motorola PowerPC Clock Distribution July 2010 33 Delay Calculation We use Elmore delay model. Sub trees are modeled as capacitive loads July 2010 34 Clock Skew and Jitter • Clock should theoretically arrive simultaneously to all sequential circuits. • Practically it arrives in different times. The differences are called clock skews. • Skews result from paths mismatches, process variations and ambient conditions, resulting physical clocks. • Most system distribute a global clock and then use local clock gaters located near clocked elements. Clock skew consists of the following components: July 2010 35 • Systematic is the portion existing under nominal conditions. It can be minimized by appropriate design. • Random is caused by process variations like devices’ channel length, oxide thickness, threshold voltage, wire thickness, width and space. It can be measured on silicon and adjusted by delay components. • Drift is caused by time-dependent environmental variations, occurring relatively slowly. Compensation of those must takes place periodically. • Jitter is rapid clock changes, occurring by power noise and clock generator jitter. It cannot be compensated. July 2010 36 Factors affecting clock skew, Intel 1998, 0.25u. July 2010 37 Skew, Clock Cycle and Design Margins Clock Jitter is the same order as skew, but far more difficult to compensate. July 2010 38 Skew Modeling Point of divergence Q 1 2 Clock Generator 1 2 Tclk1 m CL Tclk2 n D TCLK1 m T i 1 i m TCLK2 n T i 1 i n Cl - capacitive load, Id - drive current, VCC - voltage swing Cl VCC Id July 2010 39 Consider a small change in the delay, taking the linear term. VCC ClVCC Cl VCC Cl I d VCC Cl I d 2 VCC Cl I d Id Id Id VCC Cl I d Cl Id VCC Standard deviation of stage delay is = . is around 5%. If clock buffers delays are normally distributed and independent of each other, then TCLK1 m , and TCLK2 n . Skew is: Tskew | TCLK2 TCLK1 | July 2010 m n skew . 40 Clock Distribution Switching Power Consider m - level clock tree. Let Cl_m be the total load of its far-end driven sequential elements. Assuming a fixed fanout k of each clock tree buffer, the dynamic power is: 2 PCLK_m Cl_mVCC f, PCLK_ m j July 2010 Cl_m kj 2 VCC f , 0 j m 1. 41 Summing over whole clock tree: 2 PCLK j 0 Cl_ m j VCC f m 1 m 1 Cl_m 2 VCC f j 0 kj m 1 1 k 2 VCC fCl_m 1 1 k How much of the power is consumed by the far end drivers of clock tree? PCLK _ m PCLK July 2010 1 1 k 1 1 k m . 42 Given the number of sequential elements in a block, at least 50% of the switching power is consumed by the far end drivers (clock tree is binary, k=2). This number approaches 1 rapidly with k growth. Example: Assume a block with 214 sequential elements and H-tree clock distribution. Then k=4 and m=7. The far end drivers consume nearly 75% of the clock tree switching power, while adding the next upper level drivers brings it to more than 90%. July 2010 43 Active Clock De-Skewing • Compensates process variability, temperature gradients, imperfect design. • Can be implemented for global fixes (small HW overhead) or local fixes (high HW overhead). • Can be used at testing for one time fix (variability occurring during manufacturing), or dynamically concurrently with chip operation. • Its implementation is a difficult design challenge. July 2010 44 Intel’s Pentium2 De-Skewing System 1998, 450MHz Clock 0.25u process 60pSec skew w/o fix 15pSec skew with fix Two clock spines for two clock regions. A phase detector detects relative shifts. Clock of a region is shifted by a delay line. July 2010 45 Delay line consists of two cascaded inverters. Each has a programmable load consists of eight parallel P-N gate capacitors. The shift register stores a thermometer code for load programming in steps of 12pSec. July 2010 46 Intel’s IA64 Itanium1 De-Skewing System 2000, 800MHz clock 0.18u process 28pSec skew with fix X4 increase w/o fix 30 independent de-skew regions. PLL Each cluster is driven from a global H-tree. Delay circuit in de-skew region are similar to Pentium3 with 20-bit registers. July 2010 47 July 2010 48 Proposal for H-Tree Clock De-Skew – Hierarchical Approach If a phase detector (PD) has a skew guard band g, then guard bands may accumulate along tree paths. For example, if a logic stage is shared between region B and C, it may add 7g time units to path delay. July 2010 49 Proposal for H-Tree Clock De-Skew – Mesh Approach Clock is distributed by Htree, but de-skew takes place by neighbor leaves phase detection. A delay buffer accepts phase inputs from its 4 neighbors and then decides of whether to increase, decrease or not change its delay. July 2010 50 Clock Characteristics of Commercial Processors July 2010 51 Power Consumption in Chips • Clock power may reach 50% of total (dynamic + static). • Clock gating is very useful and standard design practice • Four gating methods: – Synthesis based, automated by EDA tools, RTL compilers, inserted into clock-tree – Clock enable signals manually defined by designer, inserted into clock-tree, FFs’ clock input – Data-Driven clock gating, inserted at FF-level – Auto-gated FF, inserted at latch-level 52 FF Data Toggling (DSP core) 23k FFs, Test bench of 250k CLK cycles 10% of application run-time CLK is enabled. Of which, only 1.6% CLK pulses are useful! 53 FF Data Toggling 0.3 1.2 63 control blocks of ϻP, 200k FFs 0.25 1 0.2 0.8 0.15 0.6 0.1 0.4 a 0.05 0.2 0.03 0 0 1 6 11 16 21 26 31 36 block count 41 46 51 56 61 cumulative normalized clock capacitive load data activity / clock activity b Data-Driven CLK Gating 55 tight constraint 56 How Many FFs To Group ? k: # flip-flops, q: FF probability of D=Q q=1-p Worst case : All FF are toggling independently of each other. Latch overhead amortized over k FFs Net saving per FF csaving q k cFF cw clatch k 1 q cw cOR FF Gater’s disabling probability Derivate by k: Probability of enabling FF qk ln q cFF cW clatch k 2 0 57 58 Optimal Flip-flop k-size Grouping Given n flip-flops and m 1 clock cycles a a1, , am is the activity (toggling) of flip-flop FF1: 0 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 0 FF2: 0 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1 ai a j is the number of redundant clock pulses ocurring by jointly clocking FFi and FFj 59 FF Pairwise Activity Model G V , E, w: FF pairwise activity graph. vi V corresponds to FFi . eij vi , v j E is FF pairing. ai | a j is joint toggling. w eij ai a j is redundant clock pulses, hence a waste. E E: vertex matching 60 Total power: P 2e E ai | a j ij v V i v V i ai e E e E ai ai | a j a j ai | a j ij ij ai e E ij Essential + Waste ai a j v V ai e E w eij i ij 61 Flip-Flop Grouping Algorithm FFj Minimize: FFi FFj FFi FF j FFi Optimal FFs pairing can be solved by minimal cost perfect graph matching. Can be repeated for groups of size 4, 8, 16 … 62 Is repeated perfect matching optimal ? a7 | a8 a1 | a2 a1 a3 a3 | a4 a6 | a5 a2 a7 a4 a6 a5 a8 63 No! Here is the optimal 4-size grouping 64 Multi-Bit Flip-Flop Saving the power of internal CLK drivers 1-bit flip-flop 2-bit flip-flop CLK CLK D Master Latch CLK Slave Latch + Q = CLK CLK D Master Latch Master Latch CLK Slave Latch Q1 CLK CLK D2 CLK Slave Latch D1 Master Latch Slave Latch Q2 Q 65 66 MBFF should be combined with data-driven CG to maximize energy savings. Toggling vectors (VCD) are unfortunately not always available. Data-to-clock toggling ratio (probability) is more often available. How to utilize it for MBFF optimal grouping? 67 What is the energy waste in 2-bit MBFF? p j 1 pi pi 1 p j pi p j 2 pi p j For n FFs grouped in n/2 MBFFs it is n n2 j 1 i 1 p j 2 psi pti What is the optimal MBFF grouping? Group the FFs such that p1 p2 pn 68 Auto-Gated FF 69 Look-Ahead CLK Gating 70 Relaxed constraint 71 Power Saving Per FF save cdyn 1 p cFF+CLK cFF cO k cFF+CLK p cX kcO cAint cFF cO 3 72 Which FF to Gate? 73 Results 74 75