CSE241 VLSI Digital Circuits Winter 2003 Lecture 06: Timing CSE241 L3 ASICs.1 Kahng & Cichy, UCSD ©2003 This Class + Logistics Timing Reading Storage elements, Clock distribution, Clock tree synthesis Whitepapers/datasheets on STA; papers on clock tree synthesis Schedule MT in one week (lab/recitation fair game); Lab #2 due Mon 1/27 HW #9: As a block’s layout is compacted down to fit into a smaller and smaller region, the timing of the block at first improves, but then worsens. Explain. HW #10: Hold time violations mean that the chip doesn’t work at any frequency. Propose several distinct methods for fixing hold time violations (guided by postrouting static timing analysis), and explain the pros and cons of each. HW #11: Compare DEC’s first Alpha and first StrongArm processors (look up transistor counts, supply voltage, frequency, etc.). (a) How much of StrongArm’s power efficiency can be attributed to process, supply, and frequency scaling? (b) What factors might contribute to the remainder? CSE241 L3 ASICs.2 Slide courtesy of S. P. Levitan, U. Pittsburg Kahng & Cichy, UCSD ©2003 Review Static timing analysis (Lecture 4) Pin-based timing graph Directed acyclic graph (DAG) of timing arcs Longest path in DAG time linear in #arcs (edges) Slack = required arrival time – actual arrival time (long path analysis) Logic synthesis (Lecture 5) CSE241 L3 ASICs.3 Slide courtesy of S. P. Levitan, U. Pittsburg Kahng & Cichy, UCSD ©2003 Static Analysis vs. Dynamic Analysis Why static analysis when dynamic simulation is more accurate? Drawbacks of simulation Requires input vectors (stimuli for circuit) Long runtimes Example: calculate worst-case rising delay from a to z Exponential explosion with number of possible design input states a b z c b=0 b=1 CSE241 L3 ASICs.4 c=0 a-z delay1 a-z delay3 c=1 a-z delay2 a-z delay4 Kahng & Cichy, UCSD ©2003 STA Terminology (Actual) arrival time (AAT, or AT) = time at which a pin switches state Usually 50% point on voltage curve, i.e., AT = t50 Slew time = time over which signal switches Usually difference between 10% and 90% on voltage curve, i.e., tslew = t90 – t10 Required arrival time (RAT) = time at which a signal must arrive in order to avoid a chip fail Slack = RAT – AAT Positive slack good (= margin), negative slack bad Vdd CSE241 L3 ASICs.5 90 50 10 Time Kahng & Cichy, UCSD ©2003 Example: What is slack at PO? d=1 at=0 temp at=3 at=1 d=2 d=2 at=0 at=2 d=3 at=5 at=6 temp at=7 d=1 d=1 at=5 d=3 at=0 CSE241 L3 ASICs.6 d=5 at=8 d=3 at=11 rat=10 Slack= -1 Kahng & Cichy, UCSD ©2003 Example: Incremental Timing Analysis d=1 at=0 temp at=3 at=1 d=2 d=2 at=0 at=2 d=3 at=0 d=1 d=5 at=5 at=6 temp at=7 d=1 d=1 at=5 d=3 d=1 at=3 d=1 at=8 at=7 at=10 d=3 at=11 rat=10 Slack = 0 Amount of work is bounded by sizes of fanin, fanout cones of logic CSE241 L3 ASICs.7 Kahng & Cichy, UCSD ©2003 Early-Mode Analysis Definitions change as follows RAT = lower bound on arrival time Propagate shortest possible instead of longest possible delays Slack = Arrival – Required Example: negative slack because ATc is too small (early) SL y 1 1 0 SLa 0 0 0 ATa 0 a ATb 1 b RATx 2 y 1 SLb 1 0 1 CSE241 L3 ASICs.8 AT y 1 1 c ATc 0 SLc 0 1 1 x ATx 1 SLx 1 2 1 Kahng & Cichy, UCSD ©2003 Enhancements of STA Incremental timing analysis Nanometer-scale process effects – variation ( probabilistic timing analysis) Interference – crosstalk Multiple inputs switching Conservatism of delay propagation (Old: HW #8: Suppose you change the size of one (combinational) gate in your design, thus invalidating the previous timing analysis. How much work must be done to regain a correct timing analysis?) CSE241 L3 ASICs.9 Courtesy K. Keutzer et al. UCB Kahng & Cichy, UCSD ©2003 Timing Correction Driven by STA Fix electrical violations “Incremental performance analysis backplane” Resize cells Buffer nets Copy (clone) cells Fix timing problems Local transforms (bag of tricks) Path-based transforms CSE241 L3 ASICs.10 DAC-2002, Physical Chip Implementation Kahng & Cichy, UCSD ©2003 Local Synthesis Transforms Resize cells Move critical signals forward Buffer or clone to reduce load on critical nets Decompose large cells Swap connections on commutative pins or among equivalent nets Pad early paths Area recovery CSE241 L3 ASICs.11 DAC-2002, Physical Chip Implementation Kahng & Cichy, UCSD ©2003 Transform Example ….. Double Inverter Delay = 4 Removal ….. ….. Delay = 2 CSE241 L3 ASICs.12 DAC-2002, Physical Chip Implementation Kahng & Cichy, UCSD ©2003 Resizing ? b 0.2 e 0.2 f 0.3 d a d 0.05 0.04 0.03 0.02 0.01 0 0 a 0.2 A b 0.8 0.6 0.4 1 load 0.035 A B C a C b CSE241 L3 ASICs.13 0.026 DAC-2002, Physical Chip Implementation Kahng & Cichy, UCSD ©2003 d Cloning 0.05 0.04 0.03 0.02 0.01 0 0 0.2 0.4 0.6 0.8 1 load A a ? b d 0.2 e 0.2 f 0.2 g h CSE241 L3 ASICs.14 0.2 0.2 B C d A e f a B b DAC-2002, Physical Chip Implementation g h Kahng & Cichy, UCSD ©2003 d Buffering 0.05 0.04 0.03 0.02 0.01 0 0 0.2 0.4 0.6 0.8 1 load A a ? b d 0.2 e 0.2 f 0.2 g h CSE241 L3 ASICs.15 B C d 0.2 e 0.2 a B b 0.2 0.2 DAC-2002, Physical Chip Implementation 0.1 B f 0.2 g 0.2 0.2 h Kahng & Cichy, UCSD ©2003 Redesign Fan-in Tree Arr(a)=4 Arr(b)=3 a b 1 e 1 Arr(c)=1 Arr(d)=0 c Arr(e)=6 1 d a b c d CSE241 L3 ASICs.16 1 e 1 Arr(e)=5 1 DAC-2002, Physical Chip Implementation Kahng & Cichy, UCSD ©2003 Redesign Fan-out Tree 3 3 1 1 1 1 1 1 1 1 2 1 Longest Path = 5 CSE241 L3 ASICs.17 1 Longest Path = 4 Slowdown of buffer due to load DAC-2002, Physical Chip Implementation Kahng & Cichy, UCSD ©2003 Decomposition CSE241 L3 ASICs.18 DAC-2002, Physical Chip Implementation Kahng & Cichy, UCSD ©2003 Swap Commutative Pins 1 0 a 1 1 2 b 5 1 c 2 Simple sorting on arrival times and delay works 1 2 3 c 1 1 0 b 1 a 2 CSE241 L3 ASICs.19 DAC-2002, Physical Chip Implementation Kahng & Cichy, UCSD ©2003 Outline Clocking Storage elements Clocking metrics and methodology Clock distribution Package and useful-skew degrees of freedom Clock power issues Gate timing models CSE241 L3 ASICs.20 Kahng & Cichy, UCSD ©2003 Why Clocks? Clocks provide the means to synchronize By allowing events to happen at known timing boundaries, we can sequence these events Greatly simplifies building of state machines No need to worry about variable delay through combinational logic (CL) All signals delayed until clock edge (clock imposes the worst case delay) FSM Comb Logic register register CSE241 L3 ASICs.21 register Comb Logic Dataflow Kahng & Cichy, UCSD ©2003 Clock Cycle Time Cycle time is determined by the delay through the CL Signal must arrive before the latching edge If too late, it waits until the next cycle - Synchronization and sequential order becomes incorrect tcycle > tprop_delay + toverhead Can change circuit architecture to obtain smaller Tcycle CSE241 L3 ASICs.22 Kahng & Cichy, UCSD ©2003 Pipelining For dataflow: Instead of a long critical path, split the critical path into chunks Insert registers to store intermediate results This allows 2 waves of data to coexist within the CL Can we extend this ad infinitum? Overhead eventually limits the pipelining - E.g., 1.5 to 2 gate delays for latch or FF Granularity limits as well - Minimum time quantum: delay of a gate t cycle > tpd + toverhead A tpd1 CL B register CSE241 L3 ASICs.23 CL > max(tpd1, tpd2) + toverhead register tpd register A+B register register CL t cycle tpd2 Kahng & Cichy, UCSD ©2003 FO4 INV Delays Per Clock Period Number of FO4 inverter delays 120.00 100.00 386 486 DX2 DX4 80.00 Pentium Pentium MMX Pentium Pro 60.00 Pentium II Celeron 40.00 Pentium III Pentium 4 20.00 0.00 1982 1987 1993 1998 2004 Year FO4 INV = inverter driving 4 identical inverters (no interconnect) Half of frequency improvement has been from reduced logic stages, i.e., pipelining CSE241 L3 ASICs.24 Kahng & Cichy, UCSD ©2003 Outline Clocking Storage elements Clocking metrics and methodology Clock distribution Package and useful-skew degrees of freedom Clock power issues Gate timing models CSE241 L3 ASICs.33 Kahng & Cichy, UCSD ©2003 Clock Skew t1 t2 Most “high-profile” of clock network metrics Skew Maximum difference in arrival times of clock signal to any 2 latches/FF’s fed by the network CLK2 Time Skew = max | t1 – t2 | Sylvester / Shepard, 2001 CSE241 L3 ASICs.37 CLK1 Clock Source (ex. PLL) Time Time Latency Fig. From Zarkesh-Ha Kahng & Cichy, UCSD ©2003 Clock Skew Causes Designed (unavoidable) variations – mismatch in buffer load sizes, interconnect lengths Process variation – process spread across die yielding different Leff, Tox, etc. values Temperature gradients – changes MOSFET performance across die IR voltage drop in power supply – changes MOSFET performance across die Note: Delay from clock generator to fan-out points (clock latency) is not important by itself BUT: increased latency leads to larger skew for same amount of relative variation Sylvester / Shepard, 2001 CSE241 L3 ASICs.38 Kahng & Cichy, UCSD ©2003 Clock Jitter Clock network delay uncertainty From one clock cycle to the next, the period is not exactly the same each time Maximum difference in phase of clock between any two periods is jitter Must be considered in max path (setup) timing; typically O(50ps) for high-end designs Sylvester / Shepard, 2001 CSE241 L3 ASICs.39 Kahng & Cichy, UCSD ©2003 Clock Jitter Causes PLL oscillation frequency Various noise sources affecting clock generation and distribution E.g., power supply noise dynamically alters drive strength of intermediate buffer stages Jitter reduced by minimizing IR and L*(di/dt) noise Courtesy Cypress Semi Sylvester / Shepard, 2001 CSE241 L3 ASICs.40 Kahng & Cichy, UCSD ©2003 Clocking Methodology (Edge-Triggered) Logic FlipFlop Comb tper Max(tpd) < tper – tsu – tc2q – tskew Delay is too long for data to be captured Min(tpd) > th-tc2q+tskew Delay is too short and data can race through, skipping a state CSE241 L3 ASICs.41 Kahng & Cichy, UCSD ©2003 Example of tpdmax Violation Suppose there is skew between the registers in a dataflow (regA after regB) “i” gets its input values from regA at transition in Ck’ CL output “o” arrives after Ck transition due to skew To correct this problem, can increase cycle time Ck’ i Comb Logic o regB regA tskew Ck tpdmax Ck Too late! Ck’ i CSE241 L3 ASICs.42 o tpdmax Kahng & Cichy, UCSD ©2003 Example of tpdmin Violation: Race Through Suppose clock skew causes regA to be clocked before regB “i” passes through the CL with little delay (tpdmin) “o” arrives before the rising Ck’ causes the data to be latched Cannot be fixed by changing frequency have rock instead of chip Ck’ Ck tskew Comb Logic o regB regA i tpdmin Ck Ck’ i Too early! tpdmin o CSE241 L3 ASICs.43 Kahng & Cichy, UCSD ©2003 Outline Clocking Storage elements Clocking metrics and methodology Clock distribution Package and useful-skew degrees of freedom Clock power issues Gate timing models CSE241 L3 ASICs.45 Kahng & Cichy, UCSD ©2003 Clock Distribution General goal of clock distribution Deliver clock to all memory elements with acceptable skew Deliver clock edges with acceptable sharpness Clocking network design is one of the greatest challenges in the design of a large chip Clocks generally distributed via wiring trees (and meshes) Low-resistance interconnect to minimize delay Multiple drivers to distribute driver requirements Use optimal sizing principles to design buffers Clock lines can create significant crosstalk CSE241 L3 ASICs.46 Kahng & Cichy, UCSD ©2003 Clock Distribution Problem Statement Objective Minimum skew (performance and hold time issues) Minimum cell area and metal use (sometimes) minimal latency (sometimes) particular latency (sometimes) intermixed gating for power reduction (sometimes) hold to particular duty cycle: e.g. 50:50 +- 1 percent Subject to: Process variation from lot-to-lot Process variation across the die Radically different loading (ff density) around the die Metal variation across the die Power variation across the die (both static IR and dynamic) Coupling (same and other layers) CSE241 L3 ASICs.47 Kahng & Cichy, UCSD ©2003 Issues in Clock Distribution Network Design Skew Process, voltage, and temperature Data dependence Noise coupling Load balancing Power, CV2f – (no ½ or a) Clock gating Flexibility/Tunability Compactness – fit into existing layout/design Reliability Electromigration CSE241 L3 ASICs.48 Kahng & Cichy, UCSD ©2003 Skew: Clock Delay Varies With Position CSE241 L3 ASICs.49 Kahng & Cichy, UCSD ©2003 Clock Distribution Methods RC-Tree Less capacitance More accuracy Flexible wiring CSE241 L3 ASICs.50 Grids Reliable Less data dependency Tunable (late in design) Shown here for final stage drivers driving F/F loads Kahng & Cichy, UCSD ©2003 RC-Trees H-Tree X-Tree Binary-Tree Asymmetric trees can and are used due to uneven sink distribution, hard macros in floorplan ( hierarchical clock distribution), etc.; the basic goal is to have even RC delays CSE241 L3 ASICs.51 Kahng & Cichy, UCSD ©2003 Grids Gridded clock distribution common on earlier DEC Alpha microprocessors Advantages: Skew determined by grid density, not too sensitive to load position Clock signals available everywhere Tolerant to process variations Usually yields extremely low skew values Disadvantages: Predrivers Global grid Huge amount of wiring and power To minimize such penalties, need to make grid pitch coarser lose the grid advantage Sylvester / Shepard, 2001 CSE241 L3 ASICs.52 Kahng & Cichy, UCSD ©2003 Trees H-tree (Bakoglu) One large central driver, recursive structure to match wirelengths Halve wire width at branching points to reduce reflections Disadvantages Slew degradation along long RC paths Unrealistically large central driver courtesy of P. Zarkesh-Ha - Clock drivers can create large temperature gradients (ex. Alpha 21064 ~30° C) Non-uniform load distribution Inherently non-scalable (wire R growth) Partial solution: intermediate buffers at branching points Sylvester / Shepard, 2001 CSE241 L3 ASICs.53 Kahng & Cichy, UCSD ©2003 Buffered Tree L2 Drives all clock loads within its region L3 NGBuf WGBuf PLL Sylvester / Shepard, 2001 CSE241 L3 ASICs.54 EGBuf SGBuf Other regions of the chip Kahng & Cichy, UCSD ©2003 Buffered H-tree Advantages Ideally zero-skew Can be low power (depending on skew requirements) Low area (silicon and wiring) CAD tool friendly (regular) Disadvantages Sensitive to process variations - Devices Want same size buffers at each level of tree - Wires Want similar segment lengths on each layer in each source-sink path !!! Local clocking loads inherently non-uniform Sylvester / Shepard, 2001 CSE241 L3 ASICs.55 Kahng & Cichy, UCSD ©2003 Tree Balancing Some techniques: Con: Routing area often more valuable than Silicon a) Introduce dummy loads b) Snaking of wirelength to match delays Sylvester / Shepard, 2001 CSE241 L3 ASICs.56 Kahng & Cichy, UCSD ©2003 Examples From Processor Chips H-Tree, Asymmetric RC-Tree (IBM) Grids DEC [Alphas] Serpentines Intel x86 [Young ISSCC97] CSE241 L3 ASICs.57 Kahng & Cichy, UCSD ©2003 Examples From Processor Chips DEC-Alpha 21064 clock spines DEC-Alpha 21064 RC delays DEC-Alpha 21164 RC local delays DEC-Alpha 21164 RC delays for Global Distribution (Spine + Grid) CSE241 L3 ASICs.58 Kahng & Cichy, UCSD ©2003 ReShape Clocks Example (High-End ASIC) Balanced, shielded H-tree for pre-clock distribution Mesh for block level distribution CSE241 L3 ASICs.59 Kahng & Cichy, UCSD ©2003 Pre-clock 2 Level H-tree All routes 5-6u M6/5, shielded with 1u grounds ~10 buffers per node E.g., ganged BUFx20’s Output mesh must hit every sub-block output mesh CSE241 L3 ASICs.60 Kahng & Cichy, UCSD ©2003 Block Level Mesh (.18u) Clumps of 1-6 clock buffers, surrounded by capacitor pads Shielded input and output m6 shorting straps Pre-clock connects to input shorting straps 1u m5 ribs every 20 - 30 u (4 to 6 rows) Max 600u stride CSE241 L3 ASICs.61 Kahng & Cichy, UCSD ©2003 Problems with Meshes Burn more power at low frequencies Difficult for ‘spare’ clock domains that will not tolerate regioning Post placement (and routing) tuning required Blocks more routing resources (solution, integrated power distribution with ribs can provide shielding for ‘free’) No ‘beneficial skew’ possible CSE241 L3 ASICs.62 Kahng & Cichy, UCSD ©2003 Problems with Meshes (#2) Clock gating only easy at root Fighting tools to do analysis: Clumped buffers a problem in Static Timing Analysis tools Large shorted meshes a problem for STA tools What does Elmore delay calculation look like for a non-tree? Need full extractions and spice-like simulation (e.g. Avant! Star-Sim) to determine skew CSE241 L3 ASICs.63 Kahng & Cichy, UCSD ©2003 Benefits of Meshes (#3) Deterministic since shielded all the way down to rib distribution No ECO placement required: all buffers preplaced before block placement Low latency since uses shorted (= ganged, parallel) drivers, therefore lower skew ECO placements of FFs later do not require rebalance of tree “Idealized” clocking environment for concurrent RTL design and timing convergence dance CSE241 L3 ASICs.64 Kahng & Cichy, UCSD ©2003 Mesh Example ~ 100k flops 6 blocks CSE241 L3 ASICs.65 Kahng & Cichy, UCSD ©2003 Clock Skew Thermal Map Pre-tuning CSE241 L3 ASICs.66 Kahng & Cichy, UCSD ©2003 Clock Skew Thermal Map #2 50ps block/ 100ps global skew, post tuning CSE241 L3 ASICs.67 Kahng & Cichy, UCSD ©2003 Alternative Clock Network Strategy Globally – Tree Power requirements reduced relative to global grid Smaller routing requirements, frees up global tracks Trees balanced easily at global level Keeps global skew low (with minimal process variation) Sylvester / Shepard, 2001 CSE241 L3 ASICs.68 Kahng & Cichy, UCSD ©2003 Vertex Locations in a Bounded-Skew Tree Given a skew bound, where can internal nodes of the given topology (e.g., a, b, v) be placed? skew 0 a 2 4 6 6 2 4 2 2 v 6 s0 v a CSE241 L3 ASICs.69 4 skew 0 b Topology s1 s2 s3 s4 4 b 6 Kahng & Cichy, UCSD ©2003 Deferred-Merge Embedding (DME) Algorithm Bottom-Up: build tree of merging regions corresponding to given topology B=4 s0 a b mr(a) mr(v) s3 mr(b) Special case: skew = 0 merging segments CSE241 L3 ASICs.70 Topology s1 s2 s3 s4 s2 s0 s1 v s4 Kahng & Cichy, UCSD ©2003 Top-Down Embedding Phase of DME s0 Top-Down: choose embedding points within merging regions a s0 s1 a b Topology s1 s2 s3 s4 s2 B=4 v v s3 b s4 CSE241 L3 ASICs.71 Kahng & Cichy, UCSD ©2003 Zero-Skew Example (555 sinks, 40 obstacles) CSE241 L3 ASICs.72 Kahng & Cichy, UCSD ©2003 Outline Clocking Storage elements Clocking metrics and methodology Clock distribution Package and useful-skew degrees of freedom Clock power issues Gate timing models CSE241 L3 ASICs.73 Kahng & Cichy, UCSD ©2003 Skew Reduction Using Package • Most clock network latency occurs at global level (largest distances spanned) • Latency Skew • With reverse scaling, routing low-RC signals at global level becomes more difficult & areaconsuming Sylvester / Shepard, 2001 CSE241 L3 ASICs.74 Kahng & Cichy, UCSD ©2003 Skew Reduction Using Package mP/ASIC Solder bump substrate System clock Incorporate global clock distribution into the package Flip-chip packaging allows for high density, low parasitic access from substrate to IC Sylvester / Shepard, 2001 CSE241 L3 ASICs.75 • RC of package-level wiring up to 4 orders of magnitude smaller than on-chip wiring • Global skew reduced • Lower capacitance lower power • Opens up global routing tracks • Results not yet conclusive Kahng & Cichy, UCSD ©2003 Useful Skew (= cycle-stealing) Zero skew FF fast FF Useful skew slow FF FF fast FF slow FF Timing Slacks hold setup hold setup Zero skew • Global skew constraint • All skew is bad W. Dai, UC Santa Cruz CSE241 L3 ASICs.76 hold setup hold setup Useful skew • Local skew constraints • Shift slack to critical paths Kahng & Cichy, UCSD ©2003 Skew = Local Constraint Timing is correct as long as the signal arrives in the permissible skew range FF -d + thold race condition < D : longest path d : shortest path Skew FF < safe Tperiod - D - tsetup cycle time violation permissible range W. Dai, UC Santa Cruz CSE241 L3 ASICs.77 Kahng & Cichy, UCSD ©2003 Skew Scheduling for Design Robustness Design will be more robust if clock signal arrival time is in the middle of permissible skew range, rather than on edge Can solve a linear program to maximize robustness = determine prescribed sink skews FF FF 2 ns 6 ns 4 FF T = 6 ns 0 “0 0 0”: at verge of violation 4 0 “2 0 2”: more safety margin 2 W. Dai, UC Santa Cruz CSE241 L3 ASICs.78 -2 Kahng & Cichy, UCSD ©2003 Potential Advantages of Useful Skew Reduce peak current consumption by distributing the FF switch point in the range of permissible skew CLK CLK 0-skew U-skew Affords extra margin to increase clock frequency or reduce sizing (= power) W. Dai, UC Santa Cruz CSE241 L3 ASICs.79 Kahng & Cichy, UCSD ©2003 Conventional Zero-Skew Flow Synthesis Placement 0-Skew Clock Synthesis Clock Routing Signal Routing Extraction & Delay Calculation Static Timing Analysis W. Dai, UC Santa Cruz CSE241 L3 ASICs.80 Kahng & Cichy, UCSD ©2003 Useful-Skew Flow Permissible range generation Existing Placement Initial skew scheduling U-Skew Clock Synthesis Clock tree topology synthesis Clock net routing Clock Routing Clock timing verification Signal Routing Extraction & Delay Calculation Static Timing Analysis W. Dai, UC Santa Cruz CSE241 L3 ASICs.81 Kahng & Cichy, UCSD ©2003 Outline Clocking Storage elements Clocking metrics and methodology Clock distribution Package and used-skew degrees of freedom Clock power issues Gate timing models CSE241 L3 ASICs.82 Kahng & Cichy, UCSD ©2003 Clock Power Power consumption in clocks due to: Clock drivers Long interconnections Large clock loads – all clocked elements (latches, FF’s) are driven Different components dominate Depending on type of clock network used Ex. Grid – huge pre-drivers & wire cap. drown out load cap. Sylvester / Shepard, 2001 CSE241 L3 ASICs.83 Kahng & Cichy, UCSD ©2003 Clock Power Is LARGE P = a C Vdd2 f Not only is the clock capacitance large, it switches every cycle! Sylvester / Shepard, 2001 CSE241 L3 ASICs.84 Kahng & Cichy, UCSD ©2003 Low-Power Clocking Gated clocks Prevent switching in areas of chip not being used Easier in static designs Edge-triggered flops in ARM rather than transparent latches in Alpha Reduced load on clock for each latch/flop Eliminated spurious power-consuming transitions during latch flowthrough (transparency) Sylvester / Shepard, 2001 CSE241 L3 ASICs.85 Kahng & Cichy, UCSD ©2003 Clock Area Clock networks consume silicon area (clock drivers, PLL, etc.) and routing area Routing area is most vital Top-level metals are used to reduce RC delays These levels are precious resources (unscaled) Power routing, clock routing, key global signals Reducing area also reduces wiring capacitance and power Typical #’s: Intel Itanium – 4% of M4/5 used in clock routing Sylvester / Shepard, 2001 CSE241 L3 ASICs.86 Kahng & Cichy, UCSD ©2003 Clock Slew Rates To maintain signal integrity and latch performance, minimum slew rates are required Too slow – clock is more susceptible to noise, latches are slowed down, setup times eat into timing budget [Tsetup = 200 + 0.33 * Tslew (ps)], more short-circuit power for large clock drivers Too fast – burns too much power, overdesigned network, enhanced ground bounce Rule-of-thumb: Trise and Tfall of clock are each between 1020% of clock period (10% - aggressive target) 1 GHz clock; Trise = Tfall = 100-200ps Sylvester / Shepard, 2001 CSE241 L3 ASICs.87 Kahng & Cichy, UCSD ©2003 Example: Alpha 21264 Grid + H-tree approach Power = 32% of total Wire usage = 3% of metals 3 & 4 4 major clock quadrants, each with a large driver connected to local grid structures Sylvester / Shepard, 2001 CSE241 L3 ASICs.88 Kahng & Cichy, UCSD ©2003 Alpha 21264 Skew Map Ref: Compaq, ASP-DAC00 Sylvester / Shepard, 2001 CSE241 L3 ASICs.89 Kahng & Cichy, UCSD ©2003 Power vs. Skew Fundamental design decision Meeting skew requirements is easy with unlimited power budget Wide wires reduce RC product but increase total C Driver upsizing reduces latency ( reduces skew as well) but increases buffer cap SOC context: plastic package power limit is 2-3 W Sylvester / Shepard, 2001 CSE241 L3 ASICs.90 Kahng & Cichy, UCSD ©2003 Clock Distribution Trends Timing Clock period dropping fast, skew must follow Slew rates must also scale with cycle time Jitter – PLL’s get better with CMOS scaling but other sources of noise increase - Power supply noise more important - Switching-dependent temperature gradients Materials Cu reduces RC slew degradation, potential skew Low-k decreases power, improves latency, skew, slews Power Complexity, dynamic logic, pipelining more clock sinks Larger chips bigger clock networks Sylvester / Shepard, 2001 CSE241 L3 ASICs.91 Kahng & Cichy, UCSD ©2003 Outline Clocking Storage elements Clocking metrics and methodology Clock distribution Package and useful-skew degrees of freedom Clock power issues Gate timing models CSE241 L3 ASICs.92 Kahng & Cichy, UCSD ©2003 Gate Timing Characterization A CL B D F CL “Extract” exact transistor characteristics from layout Transistor width, length, junction area and perimeter Local wire length and inter-wire distance Compute all transistor and wire capacitances CSE241 L3 ASICs.93 Kahng & Cichy, UCSD ©2003 Cell Timing Characterization Delay tables generated using a detailed transistor-level circuit simulator SPICE (differential-equations solver) For a number of different input slews and load capacitances simulate the circuit of the cell Propagation time (50% Vdd at input to 50% at output) Output slew (10% Vdd at output to 90% Vdd at output) tslew Vdd tpd Time CSE241 L3 ASICs.94 Kahng & Cichy, UCSD ©2003 Delay and Transition Measurement Transition 80% 50% 20% Cell Delay CSE241 L3 ASICs.95 Kahng & Cichy, UCSD ©2003 Non-linear effects reflected in tables DG = f (CL, Sin) and Sout = f (CL, Sin) Non-linear Interpolate between table entries Interpolation error is usually below 10% of SPICE Output Capacitance Output Capacitance Input Slew Intrinsic Delay Delay at the gate CSE241 L3 ASICs.96 Input Slew Output Slew Resulting waveform Kahng & Cichy, UCSD ©2003 Timing Library Example (.lib) library(my_lib) { fall_transition(load) { delay_model : table_lookup; cell("INV") { library_features (report_delay_calculation); pin(A) { index_1( "0.0326, 0.1614, 0.4192, 1.5017" ); time_unit : "1ns"; max_transition : 1.500000; index_2( "0.0010, 0.4249, 2.1491, 8.1881" ); voltage_unit : "1V"; direction : input; values ( \ current_unit : "1mA"; rise_capacitance : 0.0739000; leakage_power_unit : 1uW; fall_capacitance : 0.0703340; capacitive_load_unit(1,pf); "0.011974, 0.071668, 0.317800, 1.189560", \ "0.033212, 0.101182, 0.328540, 1.189562", \ capacitance : 0.07278646; pulling_resistance_unit : "1kohm"; } default_fanout_load : 1.0; pin(Z) { "0.059282, 0.155052, 0.389900, 1.202360", \ "0.162830, 0.317380, 0.628160, 1.441260" ); default_inout_pin_cap : 1.0; direction : output; default_input_pin_cap : 1.0; function : "!A"; } default_output_pin_cap : 0.0; max_transition : 1.500000; rise_transition(load) { default_cell_leakage_power : 0.0; max_capacitance : 5.1139; index_1( "0.0375, 0.1650, 0.5455, 1.5078" ); timing() { nom_voltage : 1.08; related_pin : "A"; index_2( "0.0010, 0.4449, 1.7753, 5.1139" ); nom_temperature : 125.0; cell_rise(load) { values ( \ nom_process : 1.0; index_1( "0.0375, 0.2329, 0.6904, 1.5008" ); slew_derate_from_library : 0.500000; index_2( "0.0010, 0.9788, 2.2820, 5.1139" ); "0.016690, 0.115702, 0.418200, 1.189060", \ "0.038256, 0.139336, 0.422960, 1.189081", \ values ( \ operating_conditions("slow_125_1.08") { process "0.013211, 0.071051, 0.297500, 0.642340", \ : 1.0 ; temperature : 125 ; voltage } "0.170992, 0.353120, 0.694740, 1.384760" ); "0.053289, 0.165930, 0.496550, 0.860400", \ : 1.08 ; tree_type : "worst_case_tree" ; "0.076248, 0.213280, 0.491820, 1.203700", \ "0.028657, 0.110849, 0.362620, 0.707070", \ } "0.091041, 0.234440, 0.661840, 1.091700" ); } } cell_fall(load) { default_operating_conditions : slow_125_1.08 ; index_1( "0.0326, 0.1614, 0.5432, 1.5017" ); index_2( "0.0010, 0.4249, 3.6538, 8.1881" ); lu_table_template("load") { values ( \ variable_1 : input_net_transition; "0.009472, 0.072284, 0.317370, 0.688390", \ variable_2 : total_output_net_capacitance; "0.009992, 0.095862, 0.360530, 0.731610", \ index_1( "1, 2, 3, 4" ); "0.009994, 0.126620, 0.477260, 0.867670", \ index_2( "1, 2, 3, 4" ); } "0.009996, 0.144150, 0.644140, 1.127700" ); } CSE241 L3 ASICs.97 Kahng & Cichy, UCSD ©2003 Delay Calculation Cell Fall Cap\Tr 0.05 0.2 0.5 0.01 0.02 0.16 0.30 0.5 0.04 0.32 0.60 2.0 0.178 0.08 0.64 1.20 0.147ns 0.1ns Cell Rise Cap\Tr 0.05 0.2 0.5 0.01 0.03 0.18 0.33 0.5 0.06 0.36 0.66 2.0 0.09 0.261 0.72 1.32 Fall Transition Cap\Tr 0.05 0.2 0.5 0.01 0.01 0.09 0.15 0.5 0.03 0.27 0.45 2.0 0.06 0.147 0.54 0.90 CSE241 L3 ASICs.98 0.12ns 1.0pf Fall delay = 0.178ns Rise delay = 0.261ns Fall transition = 0.147ns Rise transition = … Kahng & Cichy, UCSD ©2003 PVT (Process, Voltage, Temperature) Derating Actual cell delay = Original delay x KPVT CSE241 L3 ASICs.99 Kahng & Cichy, UCSD ©2003 PVT Derating: Example + Min/Typ/Max Triples Proc_var (0.5:1.0:1.3) Voltage (5.5:5.0:4.5) Temperature (0:20:50) KP = 0.80 : 1.00 : 1.30 KV = 0.93 : 1.00 : 1.08 KT = 0.80 : 1.07 : 1.35 KPVT = 0.60 : 1.07 : 1.90 Cell delay = 0.261ns Derated delay = 0.157 : 0.279 : 0.496 {min : typical : max} CSE241 L3 ASICs.100 Kahng & Cichy, UCSD ©2003 Conservatism of Gate Delay Modeling True gate delay depends on input arrival time patterns STA will assume that only 1 input is switching Will use worst slope among several inputs Vdd A A B tpd F B D F CL Time Vdd A CSE241 L3 ASICs.101 tpd F Time Kahng & Cichy, UCSD ©2003