Proc. 2002 IEEE Workshop on VLSI Design Automation, New Paltz, New York, Dec. 20-23, 2002 Minimum Wirelength Zero Skew Clock Routing Trees with Buffer Insertion John Thompson, Kurt Ting, and Simon Wong Department of Electrical and Computer Engineering University of Wisconsin - Madison {jdthompson, kting, wangwong}@wisc.edu ABSTRACT signals are driven with a temporal reference to the on-chip clock signals, these clock signals must be particularly clean and Zero skew clock routing is an issue of increasing importance in the realm of VLSI design. As a result of the increasing speeds of on-chip clocks, zero skew clock tree construction has become critical for the correct operation of high performance VLSI circuits. In addition, in an effort to both reduce power consumption and the deformation of clock signals at synchronizing elements on a chip, a minimum wirelength characteristic of clock tree networks is highly desirable. In an effort to provide a solution to the current issues dealing with zero skew clock tree construction, we present an efficient two-phase algorithm based on the Elmore delay model, which successfully constructs zero skew clock routing trees with buffer insertion and minimum wirelength. The results of an implementation of this algorithm have been verified to display zero skew characteristics in conformance with the Elmore delay model equations. The first phase of the algorithm is a bottom-up delayed merge embedding (DME) with buffer insertion procedure which enumerates all of the possible zero skew clock trees for consideration in the second phase. In the second phase, a top-down procedure of merged embedding is performed with the objective of minimizing wirelength. sharp. In addition, as technology scales down in feature size, clock signals in particular become affected by increased resistance due to their long interconnect lengths and decreasing line dimensions. The point is illustrated by the following equation for a wire’s resistance. 1. INTRODUCTION In a synchronous sequential circuit, the clock signal is used to define a time reference for the movement of data from one storage element to another, through the circuit. Due to the extreme importance of the signal in synchronous circuits, much attention has been paid to the characteristics of these clock signals and the networks used to route them on-chip. While they are often regarded in the same light as any other control signals, further inspection makes it obvious that special consideration must be given when dealing with clock signals. These signals can typically be characterized as having the largest fan out, longest routing distances, and the highest operational speeds of any onchip signal, control or data. Furthermore, since data W02-1 R l / A (1) where R is the resistance of the wire, ρ is the resistivity of the wire in Ohms-meter, l is the length of the wire, and A is the cross sectional area of the wire is meters2. From the above equation, (1), it can be seen that in general, as the feature size of VLSI chips decreases by a factor of f in a single dimension, the resistance of all wires increases by a factor of f. This is because the length of the wire (l) decreases by a factor f, while the cross sectional area of the wire (A) also decreases, but by a factor of f2. This increased line resistance is one of the primary reasons for the growing importance of clock distribution on synchronous performance. Finally, the control of any differences in the delay of the clock signals can severely limit the maximum performance of the entire system as well as create catastrophic race conditions which may cause incorrect data values to be latched into the circuits registers. In addition to the trend of decreasing feature sizes in VLSI circuits, a second trend of heightening clock speeds is also prevalent. Currently, clock speeds achievable in synchronous VLSI design are mainly limited by two factors, i) the longest delay path through any block of combinational logic and ii) clock skew. While the first contributing factor, that of the maximum delay through combinational logic can only be solved by further design considerations, the notion of clock skew can largely be dealt with using clock routing algorithmic techniques. Clock skew can be defined as the maximum difference in arrival times of a clocking signal at the synchronizing elements which it triggers in a design. Its limiting characteristics on clock cycle period can be observed from the following well-known inequality that governs the clock period of a clock signal net. Proc. 2002 IEEE Workshop on VLSI Design Automation, New Paltz, New York, Dec. 20-23, 2002 clock period td tskew tsu tds (2) where td is the delay on the longest path through combinational logic, tskew is the maximum clock skew, tsu is the set-up time of the synchronizing elements (assuming they are edge triggered), and tds is the propagation delay within the synchronizing elements. Furthermore, the term td can further be broken down into two disjoint components according to the following equation. td td interconnect td gates (3) Figure 1. H-tree over four points where td-interconnect represents that portion of the delay through the longest path of combinational logic (td) that can be attributed to interconnect and td-gates represents that portion of the delay which can be attributed to the actual delay through the gates. Increased switching speeds in VLSI circuits due to decreasing feature sizes will decrease the td-gates term but not the td-interconnect term. From the above analysis, it can be inferred that the two dominant terms determining the minimum clock period (maximum clock speed) are those associated with clock skew, tskew, and wire interconnect through the longest block of combinational logic, td-interrconnect. In a paper authored by Bakoglu [1], it was noted that clock skew may account for over 10% of the system’s cycle time in high-performance VLSI circuits. Working to lessen the effect of clock skew, a vast amount of research has been devoted solely to reducing skew in clock routing networks. 2. RELATED WORK In the past few decades, much research has been devoted to the issue of minimizing clock skew to zero. Several algorithms have been proposed which all achieve zero clock skew at the cost of varying complexity, wire lengths, and other measures. The simplest approach was first proposed by Dhar, Franklin, and Wang [2]. Their proposed H-tree algorithm is based on the construction of a completely balanced clock tree as can be seen in Figure 1. While the algorithm proposed by Dhar et al. will result in zero skew clock tree construction with a minimum amount of computational complexity, the approach results in unacceptably large lengths of wire that are required to route the resulting clock network. Therefore, this approach is no longer considered for clock tree construction. More recently, several others have proposed much more complex algorithms which all succeed in constructing clock trees with a characteristic zero skew and varying costs in terms of complexity and resulting wirelengths. Examples of these algorithms are the Method of Means and Medians (MMM) algorithm, which was proposed by Jackson et al. [3], and the Geometric Matching Algorithm, which was proposed by Cong et al. [4]. In addition to development of new algorithms that achieve overall zero skew clock tree construction, other recent research has aimed at changing the most prominent view that the entire clock network should have a characteristic zero skew. The most notable research fitting into this category was conducted by Chen et al [5]. They researched and formulated the notion of associative skew. In a paper published in 1999, they proposed that it is not necessary that all synchronizing elements receive clock signals with relative zero skew. Instead, what is important is that closely related synchronizing elements receive clock signals with relative zero skew. For example, flip flops in a shift register, which are directly connected, should all receive identical clock signals; that is, clock signals with the same period and zero skew. On the other hand, synchronizing elements which are disjoint from one another in a circuit’s s-graph may not need to receive relative zero skew clock signals. Illustrations of these two scenarios are provided in Figures 2(a, b). In their paper, they also discuss strategies and algorithms to determine which synchronizing elements should be clustered together to have relative zero skew, and provide a version of DME which will construct appropriate clock routing trees. D Q CLK Figure 2(a). 4bit shift register (relative zero skew should be enforced for all of the synchronizing elements) W02-2 Proc. 2002 IEEE Workshop on VLSI Design Automation, New Paltz, New York, Dec. 20-23, 2002 S0 tED(u, w) S1 Figure 2(b): An unconnected s-graph (synchronizing element s2 may have clock skew relative to elements s0 and s1) 4. PROBLEM FORMULATION As was stated previously, the problem considered in this paper is that of constructing a zero skew clock tree with buffer insertion and minimum wirelength. Stated more formally, the problem can be formulated as follows: Given a set of clock sinks, {S}, construct a clock network which stems from a single root node, R, and terminates on all sinks, si {S}. In the construction of this network, all zero skew equations according to the Elmore delay model must be observed and satisfied. In addition, buffers must be inserted where needed, so as to limit the capacitance seen at any node to some upper threshold. Finally, wirelength should be minimized. The Elmore delay model, which is used in order to assure the zero skew characteristics is based on the first-order moment of the impulse response, and is developed as follows. Let α and β respectively denote the resistance and capacitance per unit length of interconnect wire, so that the resistance, rev, and capacitance, cev, of a given interconnect wire, ev, are given by α| rev| and β| cev|, respectively. For each sink si in the tree T(S), there is a loading capacitance cL, which is the input capacitance of the functional unit driven by si. We let Tv denote the subtree of T(S) rooted at v, and let cv denote the node capacitance of v. The tree capacitance of Tv is denoted by Cv and equals the sum of the capacitances in Tv. Cv is calculated using the following recursive formula: (v is a sink node si ) cL Cv cv (cew Cw) (v is an internal node) (4) w children( v) The Elmore delay can be calculated as follows: (5) Finally, it should be noted that Elmore delay is additive. That is, if v is a vertex on the u-w path, then tED(u, w) = tED(u, v) + tED(v, w), and in particular, if v is a child of u on the u-si path, then tED(u, si) = rev((1/2)cev + Cv) + tED(v, si). A sink node si may be treated as a trivial zero skew subtree with capacitance cL, and delay zero. S2 3. 1 rev cev Cv ev path(u , w) 2 DME WITH BUFFER INSERTION ALGORITHM We propose an algorithm for zero skew clock tree routing which both minimizes wirelength and inserts buffers at the appropriate locations simultaneously. We accomplish this task by combining the well-known Deferred Merge Embedding (DME) algorithm with a simple buffer insertion heuristic. The DME algorithm, which was first proposed by Chao et al. [6], is based on the idea that finalized interconnect embedding for a clock network should be postponed as long as possible in hopes that better embedding choices will be able to be made given a greater view of the problem at hand. With this in mind, the DME algorithm is performed in two phases, one of which determines all possible zero skew embedding tapping point locations for the clock network by working bottom-up, from the network sinks to the root tapping point. The other phase is a top-down merge embedding procedure which acts to choose optimal tapping point locations among those determined to provide zero skew in the first phase of the algorithm. More formal descriptions of the two phases implemented in our research are provided in the sections that follow. 4.1 Bottom-up Phase I As was stated previously, the purpose of the first phase of our two phase algorithm is to determine all of the possible zero skew tapping locations which will later most likely result in a minimum wirelength tree construction outcome. This phase of our algorithm was first described by Masato Edahiro [7] in 1993. His proposed bottom-up phase, which is what we have chosen to use, is described below. Let K be the current set of points and segments for consideration in an iteration of the algorithm; initially K will be the set of all clock sinking locations (K = {S}). On each iteration of the procedure, the nearest neighbor pair in the current set, K, is first found. Next, a merging segment is constructed between the two nearest neighbors. The process involved for merging segment construction was first described by Chao et al. [5]. In their work, two general cases of merging segment construction are W02-3 Proc. 2002 IEEE Workshop on VLSI Design Automation, New Paltz, New York, Dec. 20-23, 2002 discussed. These two cases correspond to two scenarios, the normal case and the case in which interconnect snaking is required. Before determining which merging segment scenario is the one present, the zero skew tapping length between the two selected Manhattan segments must be found. To do so, let TSa and TSb represent the two trees of merging segments to be merged. Let TSa and TSb, respectively, have capacitances C1 and C2 and delays t1 = tED(a) and t2 = tED(b). In addition, let pl(v) be a merging point with minimum merging cost. From the Elmore delay model equation given by tED(v, a) = rea((½)cea + C1), it can be seen that pl(v) satisfies the following equation. 1 1 rea Cea C1 t1 reb Ceb C 2 t 2 2 2 Figure 3(a). Merging Manhattan segments (Case #1) (6) Now, let the Manhattan distance between the two selected merging segments, d(ms(a), ms(b)), be equal to κ. Supposing that TSa and TSb can be merged with merging cost κ, that is |ea| = x and |eb| = κ – x for 0 x κ, then we have the resistances rea = x and reb = (κ – x) and the capacitances cea = x and ceb = (κ – x). After substituting into (6) and solving for x, the following is obtained. t 2 t1 C 2 x 1 2 C1 C 2 (7) From here, there are two cases, the first of which requires no interconnect snaking, and occurs when 0 x κ. In this case, |ea| = x and |eb| = κ – x. In this case, the merging Manhattan segment can be determined following a procedure shown in Figure 3(a). It can be seen from the figure that two Manhattan radii are drawn, one around each Manhattan segment or core, with characteristic radii as determined by the values ea and eb obtained previously. In this case, the new merged Manhattan segment is the overlap of the two Manhattan radii. In the other case, in which x 0 or x 1, the assumption of merging cost κ results in a negative edge length for either ea or eb. In this case, an extended distance κ’ κ is required to balance the delays of the two trees. If x 0, which means that t1 t2, then we choose pl(a) as the merging point and set |ea| = 0 and |eb| = κ’. Solving the following equation for κ’, Figure 3(b). Merging Manhattan segments (Case #2) 1 'C 2 t 2 2 t1 ' (8) one obtains the following. C 2 2 t t ' 1/ 2 1 1 2 C 2 (9) Similarly, if x κ, we set |eb| = 0 and C 2 2 t t | ea | 1/ 2 1 2 W02-4 1 C1 (10) Proc. 2002 IEEE Workshop on VLSI Design Automation, New Paltz, New York, Dec. 20-23, 2002 In the interconnect snaking case, only one Manhattan radius is drawn, for whichever of ea and eb is nonzero. In this case, the resulting merged Manhattan segment is the portion of the Manhattan core that is enclosed in the other Manhattan core’s Manhattan radius. This idea is illustrated in Figure 3(b). Now we present a formal description of our bottom-up phase of clock tree construction, which is the same algorithm originally used in Edahiro’s 1993 work [6], with modifications made to include a simple buffer insertion heuristic. Algorithm Find_Center([NS]) Input: Set of clock sinks {S} Output: Tree of merged Manhattan segments, TS K {S} while |K| != 1 (if |K| = 1, then the element in K is the segment for the center vc) Choose the nearest neighbor pair of Manhattan segments, v1 and v2, from K Calculate the segment for v from v1 and v2 using the zero skew merge Delete v1 and v2 from K Add v to K if v’s node capacitance > specified maximum node capacitance Insert a buffer at node v Set v’s capacitance to zero endif endwhile 4.2 Top-down Phase II At this point, the bottom-up phase has completed and a resulting tree of merged Manhattan segments has been obtained. The next step is to determine the exact final embeddings of internal nodes in the zero skew clock tree. This is accomplished in a top-down phase two of the algorithm. For a node v in topology G, i) if v is the root node, then select any point in ms(v), the root nodes Manhattan segment, to be pl(v), the new tapping or merging point; or ii) if v is an internal node other than the root, choose pl(v) to be any point in ms(v) that is at a distance |ev| or less from the placement of v’s parent p. Because the merging segment ms(p) was constructed such that d(ms(v), ms(p)) |ev|, there must exist some choice of pl(v) satisfying the previously mentioned condition. In case ii), the algorithm first creates a square tilted rectangular region (TRR), trrp, with a radius of |ev| and a core equal to {pl(p)}; then pl(v) can be any point from ms(v) trrp. This is illustrated in Figure 4. A more formal definition of this procedure, termed Find_Exact_Placement by Chao et al. [6], is provided below. In our implementation of this bottom-up phase, the nearest neighbor pair contained in a set of Manhattan segments was found by using the Delaunay triangulation graph method. This method will not be discussed any further in this paper, as it is out of the scope of the real research conducted; however, an excellent reference on the subject can be found in the book, Computational Geometry: An Introduction, by Preparata and Shamos [8]. Once the nearest neighbors have been found, the zero skew merge between the selected Manhattan segments is performed as was previously described. In addition, the consideration of a buffer insertion is performed at this point. This procedure of pairing and merging with buffer insertion is performed until only one node, the root node, remains. At this point, the top-down phase of our algorithm is performed. W02-5 Algorithm Find_Exact_Placement Input: Tree of merged Manhattan segments, TS, containing ms(v) and |ev| for each node v in G Output: Zero skew tree T(S) for each internal node v in G (top-down order) if v is the root Choose any pl(v) ms(v) else Let p be the parent node of v Construct trrp as follows: core(trrp) {pl(p)} radius(trrp) |ev| Choose any pl(v) ms(v) trrp endif Proc. 2002 IEEE Workshop on VLSI Design Automation, New Paltz, New York, Dec. 20-23, 2002 1 2 4 3 Figure 5. Side numbering convention for Manhattan radii Manhattan radius #1 side 1 2 3 4 Figure 4. Exact embedding illustration 5. IMPLEMENTATION DETAILS In the following subsections, some of the more noteworthy details of our implementation of the algorithm described in Section 4 are provided. Again, the content has been split to reflect the two phases of the algorithm. 5.1 Bottom-up Phase I There are three points to make about our implementation of the first phase of the algorithm. The first is to credit the use of the DCT Delaunay triangulation code, authored by Geoff Leach from the Department of Computer Science at RMIT. The second concerns the methods used to avoid mishaps due to rounding errors that are inherent in heavily computationally based computing. The primary concern of rounding errors in our implementation was that they would make overlapping Manhattan radii impossible to find consistently. Therefore, rather than looking for an exact overlap between Manhattan radii, an alternative stance was taken. It was first noted that although the truly overlapping segments would most likely not truly overlap in our representation, they would be very close to being truly overlapped. Therefore, it was decided that it would be best to determine which two sides of the two Manhattan radii should be overlapping, and then approximate the resulting merged Manhattan segment as best as possible. In order to determine the overlapping sides between two Manhattan radii, all four sides of each radius are projected onto the x-axis to find their xintercepts. Then the differences of all x-intercept combinations which result in valid overlapping pairs are computed. Table 1 provides the valid overlapping side pairs in reference to Figure 5, which gives the reference notation we used for numbering the sides of a Manhattan radius. Manhattan radius #2 side 3 4 1 2 Table 1. Legal overlapping Manhattan radii sides Once the four differences, in accordance to Table 1, are computed, the minimum of the four is found and the two corresponding sides are determined to be the truly overlapping sides. From here, a simple comparison of the four corresponding sides’ corner points, which searches for two points of inclusion, can be made to determine the newly merged Manhattan segment. The third point to be addressed concerns the handling of the Delaunay triangulation for line segments. Because the triangulation package used supports the nearest neighbor calculation for points only, each line segment was approximated by a series of points which lie on the corresponding line segment. 5.2 Top-down Phase II In the Top-down phase, the exact tapping point on each merging segment is located according to the algorithm steps presented in Section 4.2. The implementation of this algorithm is straight forward. However, there is one major concern about proper wiring in the case of a snaking situation. When there is required snaking between a parent and one of its child segments, the resulting wire length obtained from the Elmore delay model is larger than the Manhattan distance between them. Therefore, a plain Manhattan connection would violate the specific requirements on wire length that were determined in the first phase of our algorithm. To fulfill these requirements, detour wiring is implemented. W02-6 Proc. 2002 IEEE Workshop on VLSI Design Automation, New Paltz, New York, Dec. 20-23, 2002 The detour wire unit is a pre-defined wire pattern intended to provide the extra wire length required by the Elmore model. A sample is shown in Figure 6. The unit routes from tapping point one to point five, where the “source” is the parent’s tapping point, and the “target” is the tapping point on a child segment. Connections b and d constitute the actual Manhattan distance between the source and target, while connections a and c serve as the additional snaking wires. Detour wiring is accomplished by concatenating the detour units into the Manhattan path, as shown in Figure 6. Target 5 d c 4 3 b 1 a 2 Source Figure 7. Notation of a detour wiring unit on y-plane 2 b 3 a After connecting all of the detour units, the remaining endpoint is connected directly to the target. The tapping points in snaking case are accurately connected with required wire length. Figure 8 shows an example of a snaking case on tapping points with different coordinates. c Target 1 4 Source d 5 Target Figure 6. Notation of a detour wiring unit in the xplane Source In the above figure, a detouring example on the x-plane is demonstrated. By defining the number of detour units as n , the lengths of each connection l i are calculated by the following equations (11 – 14). Figure 8. Example of detour wiring 6. difference between x - coordinates of source and target; (11) E Required wire length; (12) lb ld / 2; n (13) la lc E / 2; n (14) RESULTS AND OBSERVATIONS In order to evaluate the performance of our algorithm, we implemented a version of it on a target SPARC Ultra 10 machine. After completing the implementation, several of the UCLA benchmarks, provided by C. W. Tsao, were run with our implementation to observe our results and to compare them to those of Tsao’s implementation. In what follows, several tables of data and figures are presented along with explanations and the implications of our findings. For a detouring example in the y-plane, represents the difference in y-coordinates between the two tapping points, and the notation of connections is illustrated in Figure 7. Name # Sinks Wirelength Prim1 Prim2 R1 R2 R3 R4 R5 269 603 267 598 862 1903 3101 167,621 388,351 1,374,093 2,689,683 3,440,234 7,010,219 10,408,226 Tsao’s Wirelength 132,120 313,613 1,320,665 2,602,907 3,388,951 6,828,510 10,242,660 # Buffers 135 300 30 64 84 183 285 Table 2. Implementation results and comparison to Tsao’s implementation W02-7 Proc. 2002 IEEE Workshop on VLSI Design Automation, New Paltz, New York, Dec. 20-23, 2002 From the data presented in Table 2, it can be seen that while our implementation does not perform quite as well as Tsao’s, who has a several year head start on us, it does perform quite well. We partially attribute the differences in our results to the approximations used that were discussed in Section 5. Other areas of improvement are discussed in Section 8. In addition to obtaining raw data on wirelength and the number of buffers inserted, Matlab plots of our merged Manhattan segment trees and routed zero skew clock trees were also generated in code. Examples of these plots for benchmark Prim1 are provided in Figures 9(a, b). In figure 9(b), the blue mark is the root node, and all boxes represent buffers. and more importantly, difficult to accomplish well. In our research, we both formulated and implemented a zero skew clock routing algorithm with minimum wire length and onthe-fly buffer insertion based on the famous deferred merge embedding approach. From the data results obtained from our runs on the UCLA benchmarks, it is apparent that while our implementation does quite well at solving the specified problem, it is by no means state-of-the-art. However, in the same light, we have noted several improvements that could be made to our implementation in the future work section (Section 8). 8. Figure 9(a). Merged Manhattan segment tree for benchmark R1 FUTURE WORK AND IMPROVEMENTS As one can imagine, a research project such as the one discussed in this paper deserves a large amount of time in order to obtain optimal results. However, we have only been able to spend roughly one month delving into both the theory behind our clock routing algorithm and its implementation. As a result, we feel that this project, while making some points about clock routing blatantly apparent, could be improved in several ways. First and foremost, it would be desirable to use a version of Delaunay triangulation code specifically designed for the task at hand. While line segment approximation through a series of points is fairly reliable, it is not fool proof. Secondly, we feel that a better means to verify the zero skew characteristic are necessary. We have attempted to prove that our trees are nearly zero skew by reapplying the Elmore delay model to our final results. While this can be argued to be sufficient to prove roughly zero skew (any model is not exact), in reality it is redundant since it is what we use to construct the tree in the first place. Thirdly, it would be interesting to investigate the use of more sophisticated buffer insertion heuristics with our algorithm. Finally, our top-down procedure uses a simple “connect the dots” type of router which does not look to share wires between interconnect paths. However, one which does could save further on interconnect wire length and should thus be employed in the future. 9. Figure 9(b). Zero skew clock tree for benchmark R1 7. CONCLUSIONS Several conclusions can be drawn from our work. First, it is obvious from all of the preceding discussion that clock routing is by no means trivial. The task of routing such a large critical net so that its signal arrives at many sinks literally scattered on a chip is both difficult to accomplish, REFERENCES [1] H. B. Bakoglu. Circuits, Interconnections, and Packaging for VLSI. Reading, MA: Addison-Wesley, 1990. [2] Dhar, Franklin, and Wang. Reduction of Clock Delays in VLSI Structure. ICCAD, 1984. [3] Jackson, Sirinivasan, and Kuh. Clock Routing for High-Performance ICs. DAC, 1990. W02-8 Proc. 2002 IEEE Workshop on VLSI Design Automation, New Paltz, New York, Dec. 20-23, 2002 [4] Cong, Kahng, and Robins. Matching Based Models for High-Performance Clock Routing. IEEE TCAD, 1993. [5] Y. Chen, A. B. Kahng, G. Qu, and A. Zelikovsky. The Associative-Skew Clock Routing Problem. Proc. IEEE/ACM Intl. Conference on Computer-Aided Design, 168-172, November, 1999. [6] T.-H. Chao, Y.-C. Hsu, J.-M. Ho, K. Boese, and A. Kahng. Zero Skew Clock Routing with Minimum Wirelength. IEEE Transactions on Circuits and Systems - II: Analog and Digital Signal Processing 39 (1992), 799-814. [7] Masato Edahiro. A Clustering-Based Optimization Algorithm in Zero-Skew Routings. Proceedings of the 30th international on Design automation conference, p.612-616, June 14-18, 1993, Dallas, Texas, United States [8] F. P. Preparata and M. I. Shamos. Computational Geometry: An Introduction. Springer-Verlag, New York, New York, 1985. W02-9