L22-Clock Tree Synthe..

L22 : Clock Issues in Deep Submircron Design 1999. 10 Jun Dong Cho Sungkyunkwan Univ. Dept. ECE Mail : Jdcho@skku.ac.kr Homepage : vada.skku.ac.kr 1 Agenda  The Issues of Clock Tree Synthesis  The Basic Consideration of Clock Tree Synthesis  Traditional Clock Routing Algorithm  Recent Approaches in Clock Tree Synthesis 2 The Issues of Clock Tree Synthesis  Global Distribution of Clocks and Power  Consideration of Timing/Density/Clock/Power  Clock Management Scheme  Clock Power Reduction  Clocking Techniques in Deep Submicron Design  Multiple Clock Design  Need of PLL 3 Low-Power Clock Distribution   Clock is a major source of dynamic power dissipation in synchronous systems. Characteristics P  f CV 2  c dd Clock’s =1  Clock is loaded by every latches & Flip-Flops.  Clock is distributed throughout the chip. Clock power  = Power consumed in the clocking system  = Power consumed in clock network + Power consumed in the blocks directly connected to the clock.   4 Low-Power Clock Distribution  Clock driving schemes  single driver scheme : clock drivers are at clock source and no buffers else where.  Distributed buffers scheme : intermediate buffers are distributed in the clock tree Single driver scheme Distributed buffers scheme 5 Low-Power Clock Distribution   widening the line will make it a capacitive line.  requires increasing the sizes of buffers and the short-circuit current increases. inserting intermediate buffers partition the line into short segments.  little penalty on short-circuit current. Two ways of reducing delay the long clock line 6 Global Distribution of Clocks and Power (1)   When ASICs are built on a deep submicron process with over tens of million gates and a clock frequency below 1GHz, the designer must consider many details in the clock and power circuitry. Normally these circuit elements are not given much thought; power is drawn from the rails drawn across the top and bottom of the page, and the clock has ideal characteristics: a square wave running at the specified frequency. In reality, many other effects need to be evaluated in the clocking and power distribution areas of the design when the total chip power consumption will be in the range of tens to over 100 W and the clock power can be as much as half of the total power consumption. The clocking scheme cannot be assumed to be a clean, uniform signal network. It might be a complicated distribution structure with architectures ranging from a large distributed clock buffer for the high-performance chips to a complex system with multiple derived sub-clocks to help manage power consumption. 7 Global Distribution of Clocks and Power (2)   The interaction between clocks and power consumption may require the ability to generate clock signals which can be stopped in the inactive sections to minimize power consumption. The power and ground system will take up about half of the available package pins to be able to handle the tens of amperes of average current consumed by the IC. Many side effects of the basic IC process will have to be addressed to make the chip meet all the requirements of speed, power, and silicon area. If the supply voltage is reduced to take advantage of the power savings available at a lower supply voltage, noise margins and leakage currents may become significant problems. The various secondary effects within the system, like voltage drops on the supply lines, ground bounce, crosstalk, and glitches, may exacerbate the problems by adding enough noise to the system to decrease the clock slew rates and the clock rise and fall times. 8 Consideration of Timing/Density/Clock/Power (1)   This further exacerbates the power consumption problems by making the big clock buffers stay in the high current linear portions of their transfer curves for greater amounts of each clock cycle. In addition, the clock network has many signals switching simultaneously, adding large power surges and a very large potential for crosstalk and interference to the clock and power distribution sub-systems. The most basic problems facing designers are managing the skew in the clock system (which entails getting the clock everywhere at the same time), supplying sufficient clock drive to operate all of the clocked elements in the system, and getting operating power to all active circuit elements. For single-frequency clock systems, the tradeoffs are speed, power consumption, and area. The problems with skew and the process of balancing the delays across the chip occur in parallel with the increases in density and complexity. 9 Consideration of Timing/Density/Clock/Power (2)  Tom Katsioulas, marketing director of the IC design group of Cadence Design Systems (San Jose, CA), notes that the timing, density, clock, and power are intricately related in the following ways:  Downsizing cells to reduce power may degrade timing.  Upsizing cells to improve timing degrades power and area.  Timing-driven placement may increase wire length and power.  Clock generation before placement yields high skew.  Clock creation without wire delays affects power and delay.  Clock creation after placement adds area and affect density.  Large load count yields high clock delay and affects timing.  Post placement ECOs may require clock redesign. 10 Clock Management Scheme (1)   Some of the clocking problems of complex, high speed circuits are associated with the physics of the devices and interconnections. At 250 MHz, the clock period is only 4ns. The amount of time available after accounting for clock skew and set-up and hold times leaves very little time for buffer and propagation delays. The large clock buffers lead to high power consumption, often as large as 30 to 50 percent of the total chip power consumption, as well as noise problems due to the large current spikes generated when the buffers switch. An alternative approach is to distribute the buffers into the clock tree. This reduces the power consumption by requiring buffers of smaller size and also helps the reliability aspects by reducing the size of the current spike. 11 Clock Management Scheme (2)   The clock system and the wide word datapaths all switching at the same time increase the possibility for glitches and higher peak switching currents. They put additional loads on the power delivery systems. The resulting datapath skews will require close scrutiny of the datapath localization and grouping, as well as careful analysis of pipeline lengths. The careful analysis of the signal paths relative to the clocks is critical to making a working integrated circuit. "In synthesized circuits," says Charlie Xiaoli Huang, a senior architect at Epic Design Technology (Santa Clara, CA), "the software tries to make all paths the same length. This makes all data paths complete at the same time, which generates glitches and power surges at the end of each clock cycle. This effect gets worse at higher speeds." 12 Clock Management Scheme (3)    Clocking schemes and power distribution are going to be affected by the system requirements. The areas for compromise are power, area, and performance. If one of the areas is defined as much more critical than the others, it will drive the design. For example, if performance is the key parameter, a single point clock with sufficient buffers to drive all the circuitry would be the best choice. The tradeoff would be in a clock system which draws up to half of the total system current. An intermediate solution might be a multiply driven clock spine If all of the circuitry did not need to run at same speed, derived multiple clocks could be generated from the master reference clock. The sections will get clocks appropriate for their functions. Why have a 250 MHz clock for a serial I/O channel controller? This could save some more power since the frequency term in the power equation has now been reduced for much of the on-chip circuitry. Obviously, if the designer gates the clock signals to unused sections of the chip, with the understanding that the gate delay will exacerbate the clock skew and clock edge uncertainty for those sections, this keeps the clocks from toggling the inputs of sections with no data changes. If the gate is used in place of a buffer in the clock tree section, the clock tree does not require an additional level of buffers to match the delays due to the extra gate levels. 13 Clock Power Reduction   If power consumption and/or management is the most important concern, then the complicated scheme described in the introduction should be considered. This could be multiple clocks, with multiple frequencies so only those circuits requiring extremely high performance would get the highestspeed clocks. Other areas would have lower-speed clocks and gated clocks and power-down circuitry to minimize the capacitive charging currents. Analyzing the intricacies of multiple clock interactions requires more detail and different techniques than is available in the standard ASIC flow. If power consumption is minimized in the design through whatever techniques are available, it ameliorates the power distribution problems. The use of the "unused" gates as local decoupling capacitors mitigates the package isolation problems and minimizes the local IR drops. This additional on-chip capacitance reduces the effects of the synchronous power surges and the associated noise on the power and ground lines. The additional metal to the distributed local decoupling devices helps to reduce total supply and ground resistance, which reduces the potential for electromigration and improves overall manufacturability. 14 Clocking Techniques in Deep Submicron Design (1)   Physical implementation of a clock network requires novel approaches to balance the tradeoffs between minimization of skew, small latency and power usage. One innovative approach is a clock network driven from multiple clock driver pads, also known as a multiply-driven clock spine network. Its benefit is that it can reduce both skew and latency. One reason this technique produces low skew is because the clock signal is driven from multiple points on the chip, thereby reducing the effective distance between drivers and clock signal receivers (otherwise known as flip-flops). Additionally, the clock signal arrival time difference between the first flip-flop and the last flip-flop is much smaller, minimizing the skew. In multiply-driven clock networks, latency is reduced because fewer layers of buffer trees are needed to drive the clock net from multiple ends. 15 Clocking Techniques in Deep Submicron Design (2)   Physical design manager Herman Lam of Fujitsu (San Jose, CA), says that they are encouraging place and route of the clock system first, then the rest of the signals. For high performance functions, a large clock buffer driving a minimum size clock tree is the best way to accomplish the clocking. They place virtual flip-flops at the ends of the clock lines for loads, then let the software move the virtual flip-flops to optimal locations based on the actual logic use. When people try to get the logic interconnections first, then try to balance the clock trees for matched delays, the resulting circuit has a much larger clock tree and its associated parasitics which increase power consumption. Clock networks for deep submicron designs are typically inserted during physical layout. This may be done with a clock tree place and route tool or manually inserted in physical layout of the design. After place and route of the design the RC values for the clock network are extracted and measured. 16 Multiple Clock Design   Multiply-driven clock spine network delays are very difficult to model because analytical RC algorithms only work for a net with a single driver. Circuit (Spice) simulation has been used as an alternative to analyze multiple driven clock nets, but the Spice results must be manually analyzed and backannotated to timing analysis tools. One alternative is a manual solution that breaks the multiply driven net into multiple subnets and extracts the subnet segments for RC analysis. This method totally breaks down for more than a few drivers which drive a single clock net. For accurate skew and latency analysis, special EDA tools are needed to model multiply-driven clock networks automatically and the extracted data needs to be back-annotated to timing analysis tools. Multiply-driven clock networks can be designed with very small skew and latency, but special tools beyond RC extraction and analysis are required to ensure that such networks meet the requirements of high-performance deep submicron designs. 17 Need of PLL    A phase-locked loop (PLL) is useful to resynchronize clocks and to generate multiples of the base system clock. The PLL can develop a clock with zero or even negative effective skew by adjusting the phase comparator response. One caveat is that one must monitor the phase jitter and noise associated with the PLL and clock regeneration circuitry. The jitter and synchronization can create repeatable phase relationships within the clock network for continuous signals. However, PLLs consume a lot of power making them less attractive for low power applications. According to John Harrington, manager of ASIC products at AT&T Microelectronics (Reading, PA), "PLLs are useful for clock doublers and triplers [and other multiples]. This can help by reducing external clock frequencies and allow lower cost crystals which can normally go up to 40 MHz. Three-fourths of their designs have a PLL to synchronize and or align clock edges. The designer needs to be careful of PLL latency and lock times for those situations where the clock is not continuous." Jim Smith, ASIC product manager at Hitachi America (Brisbane, CA), agrees, noting,"We try to add PLLs to compensate and resync the clocks where possible. For multiple clocks, the problem is the latency and lock times for the clocks as well as the added jitter errors. The jitter errors add to the total clock skew." 18 The Basic Consideration of Clock Tree Synthesis  Basic Feature of Clock Skew  Consideration of Synchronous Design  Design Style Specific Problems 19 Basic Feature of Clock Skew  Circuit operation speed is increasingly limited by clock skew which is the maximum difference in arrival times of the clocking signal at the logic gates. Figure shows the definition of clock skew. This is seen from the below inequality governing the clock period of a clock signal net.  TGATE(min) + TRC(min) - THOLD(max) > TSKEW  TGATE(max) + TRC(max) + TSETUP(max) + TSKEW < t where:  TGATE = signal propagation delay from clock input to data output of a logic gate  TRC = signal propagation delay because of metalinterconnect RC effects between for a logic gate  THOLD = data-valid hold time requirement for for a logic gate  TSETUP = data-valid setup time requirement for for a logic gate  TSKEW = maximum amount of skew between clock signals, and  t = time for one period of the clock The clock, t, in most VLSI ASIC design is getting faster and tolerance of THOLD and TSETUP is getting smaller. In deep submicron and submicron technologies, the effect of TRC becomes important. The goal of balance clock tree distribution is to make the clock skew, TSKEW, as small as possible. 20 Consideration of Synchronous Design Assuming signals, A and B, arrive at both identical D flip-flops simultaneously, as well as the clock signal reaches the D flip flops within t seconds, this circuit will produce correct output, Y3, if the circuit is built on non-submicron technology. This is because in non-submicron technology the main delay and cause of skew is due to propagation delay of logic gates. Figure illustrates that the unequal length distance of wires from the clock source to the D-flip flops will not contribute much unbalance wire delay in nonsubmicron technology. The wire delay can be neglected compared to logic gate delay. However, in submicron and deep submicron technologies, logic gate delay is no longer the sole cause of delay. The wire load delay also contributes a large proportion of delay. The wire distance between logic gates can cause substantial delay. Since the distance from the clock source to the clock input of the D flip-flop D1 is longer than the distance from the clock source to the clock input of D flipflop D2, clock skew will occur. Y3 may generate incorrect results due to the clock skew.   21 Design Style Specific Problems  Full Custom : The clock routing problem in full custom style depends on the availability of a routing layer for clocks.    Standard Cell : The clock routing problem in standard cell designs is somewhat easier than full-custom in some aspects.    If a dedicated layer is available for routing with free of obstacles, the clock routing problem in full custom design is exactly the same as CRP(Clock Routing Problem) : minimizing total delay and minimizing skew between buffer. But, if obstacles are present, we refer to that problem as the BBCRP(Building Block Clock Routing problem) : minimizing both total delay and skew and constraint(wires does not intersect with any rectangles) exists. Since, clock lines have to be routed in channels and feedthroughs. Conventional methods do not work in standard cell design since terminals are neither uniformly distributed (as in full-custom), nor are they symmetric in nature(as in gate array). Gate Array : Gate arrays are symmetrically arranged in a plane and allow the clock to be routed in a symmetric manner as well. The algorithms for clock routing in such symmetric structures have been well studied and well analyzed. 22 Balanced Clock Tree  A balanced clock tree distribution is the fundamental requirement for synchronous systems. It can minimize the clock skew and ensure that the clock signals arriving at any logic gates are within the clock skew specification. A typical balanced clock tree is like a binary tree where all children nodes at the same level have the same distance from the root (parent) node.  If the period of time for passing signals down a level is identical for all children nodes, then all children nodes will receive the signal from the root (parent) node at the same instant. 23 Load Balancing  Load balancing method is the method which balances the clock delay by the number of clock needed component. It can equalize the delay of clock by the method that If one node has more clock needed component than the other side. Then, It shortens the length of clock feed line for more clock needed component. And assign long clock feed line for less clock needed component. 24 Load Balancing using Elmore Delay 25 Load Balanced Clustering 26 Balanced Tree + Mesh 27 Single vs. Distributed 28 Clock Tree Distribution Algorithms   An optimal balance clock tree distribution is to connect all logic gates directly to the clock source. Assuming that there is no buffer between any logic gate and the clock source, and the wire width is constant, the furthest logic gate will experience the largest delay. The delay time can be equalized for all logic gates by adding logic gate delay and interconnect delay to the faster signal paths. Then all signal paths will experience the same delay. This approach not only has a near zero clock skew, but also has the fastest speed. However, this approach is not feasible because the drive strength of the clock source is limited, and there is not enough room to route wires around the clock source. Logic gates are usually being placed by cell placement program at the early stage of layout. The positions of the buffers and the clock source; however, are determined by the clock tree distribution algorithm. Two general clock tree distribution algorithms are discussed here. It should be noted that a few major assumptions are made for the following discussion: the wire resistance and wire capacitance have linear relationship with the clock signal delay; all buffers are identical and they contribute the same delay. 29 비교  There are other clock tree distribution algorithms proposed, such as buffer distribution algorithm [3], general zero-skew clock net [4] and process-variation-tolerant zero skewclock routing [5]. Each algorithm has its own distinct characteristics. It is difficult, if not impossible, to determine which algorithm is the best. If logic gates are evenly distributed, the clock trees generated by these algorithms may look similar. If the placement pattern of the logic gates is unique, clock trees built by different algorithms may have noticeable difference in clock skew, clock signal speed, wire length and design flexibility. 30

L22-Clock Tree Synthe..

Related documents

Products

Support

L22-Clock Tree Synthe..

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib