This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 A Minimum-Skew Clock Tree Synthesis Algorithm for Single Flux Quantum Logic Circuits Soheil Nazar Shahsavani and Massoud Pedram, Fellow, IEEE Abstract—This paper presents a synchronous minimum-skew clock tree synthesis algorithm for single flux quantum circuits considering splitter delays and placement blockages. The proposed methodology improves the state-of-the-art by accounting for splitter delays and creating a fully-balanced clock tree structure in which the number of clock splitters from the clock source to all the sink nodes is identical. Additionally, a mixed integer linear programming (MILP) based algorithm is presented that removes the overlaps among the clock splitters and placed cells (i.e., placement blockages) and minimizes the clock skew, simultaneously. Using the proposed method, the average clock skew for 17 benchmark circuits is 4.6ps, improving the stateof-the-art algorithm by 70%. Finally, a clock tree synthesis algorithm for imbalanced topologies is presented that reduces the clock skew and the number of clock splitters in the clock network by 56% and 37%, respectively, compared with a fullybalanced clock tree solution. Index Terms—Single flux quantum (SFQ), superconducting electronics, physical design, placement, legalization, clock tree synthesis. I. I NTRODUCTION ONVENTIONAL computing based on CMOS technology and metal interconnects has faced substantial issues in terms of total power consumption and energy efficiency [1]. Superconducting computing based on the Josephson effect is a promising replacement for CMOS technology aiming at highperformance and energy-efficient computing [4]. Josephson junctions (JJs), basic circuit elements in single flux quantum (SFQ) technology, have a rapid switching speed (∼ 1ps) and low switching energy (∼ 10−19 J/bit) at temperatures about 4 K [2], [3]. Rapid single flux quantum (RSFQ) technology was introduced in the 1980s. It uses quantized voltage pulses in digital data generation and memorization [4]. RSFQ circuits have been shown to be functional at operating frequencies of up to 770 GHz [5]. Recent developments introduce new SFQ logic families, such as energy-efficient single flux quantum technology (ERSFQ/eSFQ) [6], dual-rail RSFQ [7], selfclocked complementary logic (SCCL) [8], reciprocal quantum logic (RQL) [9], novel approaches including re-design of the current biasing network for RSFQ [10], [11], [12], and application of low supply voltage for RSFQ circuits [13]. In spite of extraordinary characteristics of SFQ logic (including but not limited to high frequency and low energy dissipation), design automation methodologies and tools are less C Manuscript received April 15, 2019; accepted August 14, 2019. The research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the U.S. Army Research Office grant W911NF-17-1-0120. This project is also supported in part by the Software and Hardware Foundations program of the National Science Foundation (NSF) under Grant No. 1619473. S. Nazar Shahsavani and M. Pedram are with the Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90007 USA (e-mail: nazarsha@usc.edu; pedram@usc.edu). sophisticated than those of CMOS technology, preventing the SFQ logic to become a realistic option for realizing large-scale, high-performance, and energy-efficient computing systems of the future [3]. Although many advanced techniques have been developed for computer-aided design (CAD) for CMOS technology, these techniques cannot be directly applied to the design of SFQ circuits due to key differences between the two technologies. Some of these differences are (i) different active and passive components (JJs and inductors vs. transistors and capacitors), (ii) various types of logic gates and clocking structures, and (iii) the need for path-balancing D flip-flips (DFF), splitters, and biasing networks which increases the total cost of integration in terms of area and power consumption [14]. To address the aforementioned issues, researches have started focusing on the development of front-end and backend tools and methodologies for design automation of superconducting electronics to enable very large scale integration (VLSI) design and verification of superconductive electronics (SCE) as a step toward the development of energy-efficient and high-performance computers [15]. Physical design of logic circuits, especially the synthesis of clock distribution network (CDN), plays an important role in designing high-performance circuits robust to process-induced variations. The layout of large circuits requires automated placement, clock network synthesis, and routing tools. Recent efforts by researches have introduced effective techniques for placement, design of CDNs, and routing for large SFQ circuits [16]–[18]. Clock network synthesis is a crucial task in physical design of logic circuits as the clock network takes up substantial routing resources, consumes significant power, and determines the maximum frequency of the circuits. Minimizing the clock skew (i.e., the maximum difference in the arrival time of the clock signal at two different clock sinks) is of great importance since the clock skew directly limits the maximum achievable frequency of a circuit [19]. In SFQ logic circuits, the clock signal should be delivered to nearly all logic cells in the design. Therefore, to maximize the performance, a wellbalanced minimum-skew clock tree structure is an absolute requirement. Previous zero-skew clock tree synthesis methods for SFQ circuits fail to produce high-quality solutions because they do not consider the delay of splitter cells (which are required to distribute the clock signal to sequential gates) and placement blockages (already placed logic cells) [20]. Additionally, the population density of cells in different regions of the chip can be very different which can result in a highly-imbalanced clock tree topology, i.e., one where the maximum difference between splitter counts from the root of the clock tree to any pair of leaf nodes is large. 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 This paper presents an algorithm for a fully-balanced clock tree topology construction and a min-skew clock tree placement and legalization algorithm using a mixed integer linear programming (MILP) formulation to perform clock tree construction, splitter insertion, and skew minimization under the given placement blockages, considering both splitter and interconnect delays. The proposed clock tree topology generation algorithm guarantees that the maximum difference between the number of splitters from the clock source to any pair of sinks is zero. The effectiveness of the proposed CTS algorithm is verified using multiple SFQ circuits. The main contributions of this paper can be summarized as follows. • An algorithm is presented that creates a fully-balanced clock tree in which the maximum difference between the number of splitters from the root of the tree to any pair of leaf nodes is zero. • A min-skew clock tree placement and legalization algorithm is presented that places the clock splitters in the routing channels, i.e., empty spaces between the placement rows, and eliminates the overlaps among the clock splitters and logic cells while minimizing the skew. • Using the proposed technique, the average clock skew for 17 benchmark circuits is 4.6ps. This approach improves the state-of-the-art method by 70%. • The proposed CTS algorithm is extended to generate a minimum-skew solution given imbalanced clock tree topologies. The modified algorithm reduces the clock skew and the number of clock splitters in the clock network by 56% and 37%, respectively, compared with a fully-balanced clock tree solution. The rest of the paper is organized as follows. Background and prior work are discussed in Section II. Our SFQ specific clock tree synthesis methodology including topology construction, splitter placement, and legalization algorithms are discussed in Section III. Simulation results obtained by applying the proposed method to multiple benchmark circuits are reported in Section IV. A clock tree synthesis methodology for imbalanced topologies is presented in Section V. Finally, the paper is concluded in Section VI. II. BACKGROUND A. Definitions In this section, we summarize some definitions and notations used throughout this paper. • Clock phase delay refers to the delay from the clock source to any of the clock sinks (i.e., sequential elements such as flip-flops or latches). Phase delay, also known as insertion delay, increases as the feature size decreases and chip size increases. The phase delay is typically a combination of gate delay (e.g., buffers, clock gating elements, and clock dividers) and interconnect delay. As the feature size decreases, the effect of process and on-chip variations (OCV) on phase delay increases, which in turn affects the clock uncertainty [19]. Accordingly, minimizing phase delay values is beneficial in reducing the clock uncertainty. • Clock skew: Two flip-flops i and j connected by combinational gates and interconnects are called sequentially adjacent flip-flops (cf. Fig. 1). Clock skew between nodes i and j is defined as the difference between clock arrival Data Clock FFi FFj Comb. Fig. 1: A pair of sequentially adjacent flip-flops i and j connected by a combinational gate and interconnects. times (phase delay values) at these two nodes. In this paper, clock skew for a circuit is defined as the maximum skew between any two flip-flops. Equations for calculating clock skew are as follows. skewi,j = Ti − Tj skewmax = max |Ti − Tj | 1≤i,j≤n (1) (2) Where Ti and n denote the clock arrival time at sink i and the total number of clock sinks, respectively. Timing constraints can be categorized as setup and hold time constraints, defined as follows. Setup time: is the amount of time that the input to the capturing flip-flop (F Fj ) should stay valid before the next triggering clock edge arrives [21]. The following inequality summarizes the relation between clock skew, clock period and setup time. max max Tp ≥ skewi,j + tmax c2Q + tcomb + tsetup (3) where tmax c2Q denotes the maximum clock-Q delay of a flip-flop, tmax comb accounts for the maximum delay through combinational logic (which also includes the interconnect delay), Tp represents the clock cycle time, and tmax setup denotes the maximum setup time for a flip-flop. As shown, a positive clock skew increases the clock cycle time. On the other hand, a negative clock skew (if the clock signal is received at the launching flip-flop earlier than the capturing flip-flip) decreases the effective clock period. Hold time: To ensure the proper propagation of an input signal through a flip-flop, the input must remain valid or hold steady for a short duration after the clock edge, referred to as the hold time [22]. The hold time of the capturing flip-flop imposes an additional constraint on the total propagation delay of a signal through the launching flip-flop and the combinational logic as follows. min min skewi,j ≥ tmax hold − tc2Q − tcomb (4) In the worst case, the input signal at the capturing flip-flop (j) should remain stable for tmax hold after the clock edge of the same clock cycle arrives at node j. If the clock signal arrives at the F Fi earlier than the F Fj , it causes the input signal to F Fj to change before F Fj can capture that. As shown above, the clock skew directly limits the maximum clock frequency of a circuit and reduces the available positive time slack for setup constraints. Additionally, a negative clock skew may result in hold time violations. 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 Unlike setup time violations, hold time violations cannot be fixed by increasing the clock period. Actual Arrival Time (AAT): is defined as the latest transition time at a given node in a circuit, measured from the beginning of the clock cycle [21]. Required Arrival Time (RAT): is defined as the latest time at which a signal should arrive at a given node, such that the circuit works correctly, given setup or hold constraints. Timing Slack: For each node in a circuit (e.g., pins or gates) timing slack is calculated as the difference between RAT and AAT at that node. While a positive slack means that the timing constraint is satisfied (i.e., the signal arrives earlier than it is required), a negative slack is an indicator of a (setup or hold) timing violation (i.e., the signal arrives after its required time) [21]. • Clock Tree Topology: A clock tree topology is defined as a binary tree G, in which each node has a maximum of 2 children, is rooted at the clock source R, and has a total number of |S| leaf nodes representing the set of clock sinks S. We define this tree topology to be a directed graph in which edges are directed from parents to the children. The level of each node i is defined as the number of nodes in the longest path from the root of the tree to node i, denoted by Li . The height of a node is the number of nodes on the longest path from that node to a leaf node, denoted by Hi . The height of a tree is defined as the height of the root node of the tree. • Clock Tree Embedding: A clock tree embedding determines the location of each internal (non-sink node) v of the clock tree topology, denoted by pl(v), in the Manhattan plane. If there is a connection between a parent node p and a child node c, the cost of the edge ep,c , denoted as lp,c , is defined as the Manhattan distance between pl(p) and pl(c). The total wirelength of a tree is calculated as the sum of the cost of all the edges of the tree. Based on the above definitions, reducing the maximum clock skew increases the maximum frequency of the circuit, reduces the number of hold time violations, and facilitates the timing closure of the design (i.e, fixing the timing violations of the circuit which is typically done after the placement and clock tree synthesis steps.) B. Delay Model Single flux quantum pulses are typically propagated over long distances using passive transmission lines (PTL). PTL micro-strips transmit the pulses with extremely low losses, with a speed of approximately 1/3 of the speed of light in a vacuum [3]. Equation (5) models the propagation delay as a function of the length of the PTLs. D= L L (µm) ≈ 1 µm c 100 3 ps (5) In Equation (5), D represents the delay over the PTL, L represents the length of the PTL, and c denotes the speed of light in a vacuum. As a result, phase delay from the root of the tree R to a sink node Ci over a path path(R, Ci ) is calculated as follows. DR,Ci = 1 × 100 X lj,k (6) ej,k ∈ path(R,Ci ) where lj,k denotes the Manhattan distance between clock nodes Cj and Ck . Although the SFQ signals can also be propagated using Josephson transmission lines (JTLs), we do not use JTLs in the global clock network, due to low propagation speed and difficulties introduced in the routing stage. The introduced delay model for PTLs is similar to the path-length delay model used in CTS algorithms for early CMOS technology nodes [23]. Although the proposed linear delay model is used throughout the paper, other delay models can be integrated into the proposed design flow. C. Prior Work Multiple clock topologies have been proposed and the tradeoffs are discussed in [24]. Synchronous clock tree synthesis using an H-tree structure has been proposed as the best option for large circuits in terms of max clock frequency [24]. In [25], a layout driven CTS method is presented that groups cells by logic level and propagates a skewed clock signal to each logic group. For the first logic level, a clock tree is built to propagate the clock signal to each gate such that timing constraints are met. Then, the clock signal is passed to the root of the clock tree for the next logic level. This work employs splitter and JTL insertion and replacement of the logic cells in the same logic level for timing adjustments. An earlier SFQ design methodology presented in [26] first synthesizes a zero-skew clock network utilizing an H-tree structure. The proposed algorithm then places the cells on predefined rectangular grid bins at the leaves of the clock tree, using the min-cut placement algorithm. However, since the placement slots are limited to grid bins and the placement is done after the CTS, the quality of placement in terms of the total wirelength and routability is degraded significantly. A novel clocking methodology (called hierarchical chains of homogeneous clover-leaves clocking) was proposed to improve the robustness of clock networks to timing variations [17]. In the proposed clock topology, frequency of the circuit is determined by the structure and number of gates within each clover-leaf. However, since this clocking method does not consider the wire delays and no algorithms for placement of the clock elements are presented, quantifying the maximum clock frequency, overhead in terms of the size of the clock network, and comparison with synchronous zero-skew clocking algorithms is not possible [17]. In [20], a CTS algorithm for H-tree and HL-tree clock structures was presented and results were discussed. In HLtree structures, an H-tree is used for global clock distribution and a Linear tree (L) for local clock distribution [20]. Authors in [16] provide an algorithm for optimizing the placement of logic cells such that the HL-tree clock structures can be utilized efficiently. 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 III. P ROPOSED C LOCK T REE S YNTHESIS M ETHODOLOGY Conventional clock tree synthesis methods for CMOS circuits do not consider the placement of the splitters and their associated delay in clock skew minimization, as they use zeroarea and zero-delay branching points for splitting the global clock signal. However, in SFQ circuits, the splitter delay adds to the overall insertion delay at the sink nodes. Additionally, the placement of these clock splitters determines the delay of each edge in the clock tree and therefore affects the insertion delays. Moreover, clock splitters should be placed in legal locations, i.e., should not have any overlaps with the placed logic cells and should not violate the layout rules. To address the aforementioned challenges, we propose a minimum-skew clock tree synthesis considering splitter delays and placement blockages. In the following subsections, the overall flow of the proposed algorithm is presented and details of each step are explained. A. Overall Design Flow The overall design flow of our clock tree synthesis algorithm (called qCTS) is shown in Fig. 2a. The proposed approach generates a minimum-skew clock tree such that all the clock splitters are mapped to the routing channels between the placement rows (i.e., the logic cells are placed inside the rows and the clock splitters are placed between the rows). The proposed methodology can be employed for both H-tree and HL-tree clock structures. A sample output of the proposed algorithm for a placed netlist is shown in Fig. 2b. The inputs to the qCTS algorithm are (i) a placed netlist, (ii) a list of clock sink nodes and their locations, and (iii) a delay model. There are 4 steps in the proposed algorithm. • In the first step (cf. Fig. 2a, Topology Generation), a fully-balanced tree topology is generated to minimize the maximum level difference among the clock sinks to zero. After this stage, it is guaranteed that all the sink nodes have the same level. Row Logic Synthesis Cell Placement Splitter qCTS Clk Source Topology Generation Clock Tree Embedding In the second step (cf. Fig. 2a, Clock Tree Embedding), the clock tree embedding algorithm generates a zero-skew clock network and calculates the location of all the internal nodes of the clock network, given the tree topology and the location of the sink nodes. • In the third step (cf. Fig. 2a, Splitter Insertion), the splitter cells in the clock tree are placed at the location of the embedding points of the clock network. • In the final step (cf. Fig. 2a, Min-Skew Clock Tree Placement and Legalization), a MILP based approach is used to map the clock splitters to the routing channels and to remove the horizontal overlaps between the clock splitters. Once the qCTS algorithm generates the CDN, the static timing analysis (STA) tool calculates the maximum clock frequency and hold/setup time slacks and the timing closure flow tries to solve all the timing violations. In the following subsections, each step is explained in details. • B. Clock Topology Generation In SFQ logic circuits, the clock signal is distributed to nearly all the logic cells as most of the cells are sequential elements, i.e., need a clock signal for synchronization. Conventional clock topology generation methods do not consider the delay of splitter cells (needed to distribute the clock signal to multiple fan-outs). Additionally, the population density of cells in different regions of the chip can result in a highlyimbalanced clock tree topology with a large level difference among sink nodes. Consider an example with 8 leaf nodes as depicted in Fig. 3a. As shown, the leaf nodes have different levels (e.g., nodes 9 and 14 have levels 4 and 3, respectively). The maximum level difference among leaves is 2. Fig. 3b depicts a balanced tree topology in which all the leaves have the same level. Insertion delay at each leaf node is a combination of all the splitter delays from the clock source to each leaf node and interconnect delays. By creating a balanced tree topology, the portion of the insertion delay corresponding to splitter delays is balanced out among the leaf nodes, which helps reducing the clock skew. Some of the clock topology generation algorithms for CMOS circuits, such as greedy-DME or geometric matching, create imbalanced topologies as they allow merging two subtrees with different heights [27] [28]. We intend to create a fully-balanced binary tree in which the maximum level difference among leaf nodes is equal to 0. For this purpose, we propose using an algorithm similar to the method of means Splitter Insertion Min-Skew Clock Tree Placement & Legalization STA Timing Closure (a) (b) Fig. 2: (a) The overall flow of the proposed clock tree synthesis algorithm, qCTS. (b) The proposed placement of logic cells (blue rectangles, placed inside the rows) and clock splitters (black rectangles, placed between the rows) for a circuit with 32 logic gates and 8 rows. Rows are shown using red rectangles. (a) (b) Fig. 3: Clock tree topologies for 8 leaf nodes (shown in blue). (a) An imbalanced tree with a max level difference of 2 among leaves. (b) A balanced tree with a max level difference of 0 among leaves. 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 and medians (MMM) [29]. The MMM algorithm, proposed as one of the initial minimum-skew clock tree synthesis algorithms, heuristically minimizes both the clock skew and total wirelength of the CDN. The MMM method performs topology generation and clock embedding simultaneously, in a top-down manner. Note that in this work, we only use the topology generation part of this algorithm. The MMM algorithm recursively bi-partitions the set of sinks in a region, creates a tree node for each of the subregions, and assigns these tree nodes as the children of the parent node (corresponding to the original region) [29]. In each step, sinks are sorted based on their x or y coordinate. Assuming the sinks are ordered in x (y) coordinate, half of the sinks are assigned to the left (bottom) sub-region and the other half are assigned to the right (top) sub-region. For each of the created sub-regions, a new tree node is created and the root of the current tree (node corresponding to the original region) is assigned as the parent of the two newly created nodes (i.e, nodes lef t and right). Additionally, nodes corresponding to the two sub-regions are assigned as children of the root node. The same procedure is repeated for the two created sub-regions, recursively, until the number of sinks in each sub-region becomes less than 2. The MMM algorithm finishes in logn steps and its complexity in terms of run-time is O(n log n), where n is the number of clock sinks [29]. An example is illustrated in Fig. 4. As shown, with 10 sink nodes, the maximum level difference among the leaf nodes is 1. The MMM algorithm always creates a topology in which the max level difference among leaves is at most 1. The reason is it initially creates a fully-balanced binary tree with a height of dlog ne − 1, without adding any sink nodes to the tree. At this stage, in each created sub-region, there are either 2 or 1 sinks left. If there is only 1 sink left, that sink is assigned as a leaf node of the tree, therefore, it does not increase the height of the tree. If there are 2 sinks left, another bi-partitioning adds two children to one of the leaf nodes of the tree and increases the height of the tree to dlog ne. Consequently, leaves of the tree have a max level difference of 1. To further reduce the max level difference to 0, we use JTL cells. If the max sink level is lmax , we find all the sinks with level lmax − 1 and add a JTL cell as their parent. Hence, their level becomes lmax and the output clock tree becomes fully balanced. To balance the insertion delay at the sink nodes and reduce the max clock skew, we design special JTL cells to have the same propagation delay as splitters. Given the location of all the sink nodes, the topology of the clock tree is generated using the outlined method. Next, the generated topology is passed to the clock tree embedding step, which calculates the location of the embedding points of the internal nodes of the clock tree in the Manhattan plane. C. Clock Tree Embedding and Splitter Insertion The generated clock tree topology along with the locations of the sink nodes are the inputs to the clock tree embedding step. In this step, the location of clock splitters is determined. The goal is to construct a zero-skew clock tree, while minimizing the total wirelength of the clock network. We use the deferred merge embedding (DME) algorithm to embed the clock tree as it generates a zero-skew solution with minimum (a) (b) (c) (d) Fig. 4: The MMM clock topology generation algorithm applied to an example with 10 sinks [29]. Blue rectangles and black circles show the splitter cells and sinks of the clock tree. cost in terms of wirelength, assuming a linear delay model [23]. The DME algorithm was developed according to the observation that there are multiple locations for an internal node in a given topology which satisfy the skew specifications [23]. The DME algorithm constructs a clock tree in two phases: (i) a bottom-up pass that finds all potential zero-skew merging locations for two nodes, called merging segments (ms), as a function of the distance between the child nodes and the downstream delay of each child node. Downstream delay is defined as the max delay from a node to its leaf nodes. (ii) a top-down tree traversal in which the DME picks one location on each merging segment. The DME algorithm has linear time complexity given the input topology. For a complete description of this algorithm please refer to [23]. An example of applying DME algorithm to a clock tree with 4 sinks is redrawn from [23] and shown in Fig. 5. Fig. 5a shows a balanced clock tree topology. The goal is to find the location of internal nodes of the tree, a, b, and r. Fig. 5b shows the location of sinks, merging segments, and the embedding points of the internal nodes of the clock tree in the Manhattan plane. In the bottom-up pass, the DME algorithm finds the merging segments for the internal nodes. For instance, the Manhattan distance between sinks s3 and s4 is 4. Therefore, to generate a zero-skew solution, the distance between node b and each of its two children should be 2. Accordingly, a merging segment (msb ) is formed within a distance 2 from s3 and s4 . Similarly, msa is formed with a distance 3 from s1 and s2 . Note that the downstream delay of any node on msa and msb is 3 and 2, respectively, assuming a path-length delay model. Finally, the merging segment for node r is calculated such that the maximum clock skew becomes 0. The clock tree embedding step generates the exact location of internal nodes of the clock tree. Subsequently, we place the clock splitters at these embedding points and add nets between each splitter node and its children. In contrast to CMOS that the embedding points are locations of the branching points of the clock signal (with zero area), in SFQ technology, splitter cells (with non-zero area) are placed in these locations. Gen- 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 s2 msa r s1 a msr a r b msb s4 s1 s2 s3 (a) b s4 s3 (b) Fig. 5: An example of a zero-skew clock tree produced by the DME algorithm for a circuit with 4 sinks. (a) The topology of the tree. (b) The location of sinks, merging segments, and the embedding points of the internal nodes of the tree. Black rectangles and yellow circles represent the clock sinks and splitters, respectively. erated locations for clock splitters may already be occupied by logic cells or other clock splitters. Hence, after placing the clock splitters in these locations, the clock tree should be legalized. In other words, the overlaps among clock splitters and cells should be removed. Note that due to overlaps among clock splitters and placed cells, after the legalization step, the clock skew may not be zero anymore. To remove the overlaps among the clock splitters and the placed cells, we employ a two-step approach: (i) map the clock splitters to routing channels (adjusting y coordinates) (ii) remove the overlaps among splitters in each routing channel (modifying x coordinates). Note that the legalization step changes the insertion delays of the leaf nodes and results in clock skew. Accordingly, the primary objective is to minimize the introduced clock skew. In the next step, we present our algorithms for finding the best routing channel and the best x coordinate for each clock splitter, such that the clock skew is minimized. The output of this step (an illegal solution) along with the a legal placement of the clock splitters that yields a minimum skew solution for a 4-bit Kogge-Stone adder circuit [30] are shown in Figures 6a and 6b, respectively. D. Min-Skew Clock Tree Placement and Legalization In this section, we present a min-skew clock tree placement and legalization algorithm. This algorithm removes the over- (a) (b) Fig. 6: Illustration of a 4-bit Kogge-Stone adder circuit [30], after the placement and CTS steps. Logic cells, clock splitters, and I/O pads are shown using blue, red, and black rectangles, respectively. (a) An illegal zero-skew solution. (b) A legal nonzero-skew solution. laps among logic cells and clock splitters in two steps. In the first step, clock splitters are mapped to the routing channel while their x coordinates are fixed (displacement in vertical direction). The motivations for moving the clock splitters to the routing channels are as follows. (i) we will not need to change the placement of logic cells already placed in the placement rows. This helps ensure the routability of the circuit is not affected. (ii) we will not need to increase the width of the chip to accommodate the placement of clock splitters inside placement rows. Note that in a circuit with n = 2m clock sinks, a total number of n − 1 splitters are needed to build a fully-balanced clock tree. Therefore, by adding splitters to the placement rows, width of the chip may increase significantly. The main objective in this step is to minimize the clock skew, i.e., the difference between the largest and smallest insertion delay value at the sink nodes. As explained in Section II, reducing the phase delay values increases the robustness to on-chip variations. Additionally, considering the delay model for PTLs, reducing the phase delay values also reduces the total wirelength of the clock tree and facilitates the clock routing. Consequently, minimizing the sum of the phase delays is considered as a secondary objective. The variables in this problem are the assignments of splitter cells to the routing channels. For each splitter cell (i), the index of the assigned routing channel (yi ) is an integer value between 1 and the number of available routing channels (nR ). Parameters used for formulating the problem are listed in Table I. The constraints are defined as follows. • • The total number of clock splitters in each routing channel should be less than the capacity of the routing channel (i.e., the sum of the width of the splitters should be less than the width of a routing channel.) The mappings of the cells to the routing channels should not be much different from the solution generated by the embedding algorithm, as it already provides a good initial solution. Subsequently, we present a mathematical formulation of the min-skew clock tree placement problem in the vertical direc- TABLE I: Notations and definitions used for formulating the clock tree placement and legalization problem. Term Ci C0 Li Di (xi , yi ) ek,j δk,j nC nS nR Wa , Ha Wch , Hch Wr , Hr Hr+ch my (mx ) Wspl , Hspl P dspl λ α Definition Clock cell i, including sinks Clock source (root of the clock tree) Level of the clock cell i Phase delay at the sink node i Lower left coordinates of the clock cell i An edge in the clock tree connecting cells Ck and Cj Delay of the edge ek,j The total number of clock cells, excluding the sink nodes The total number of clock sinks The total number of placement rows (same for routing channels) Width and height of the layout area Width and height (≥ 40µm) of the routing channels Width and height (120µm) of the placement rows The sum of the height of a placement row and a routing channel The max difference between the y (x) coordinates of the clock cells Width and height of the clock splitter cells (40µm) The minimum distance between adjacent clock cells Splitter delay (5.5ps) Regularization constant (1e−3 ) ps ) Delay constant (1e−2 µm 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 tion as follows. minimize constraints are added to the problem. max Di − min Dj + λ · i=1...nS j=1...nS subject to yi ∈ {1 . . . nR } X Di = nS X yi = j ⇐⇒ ci,j = 1 δk,j = nR X y x δk,j + δk,j ek,j ∈path(C0 ,Ci ) y δk,j = α ∗ |yk − yj | ∗ Hr+ch |z| ⇒ η + η − j=1 nC X ci,j = 1 ∀i ∈ {1 . . . nC } (13) ci,j ≤ nch ∀j ∈ {1 . . . nR } (14) i=1 (7) In the above formulation, the phase delay of sink node i denoted by Di is calculated as the sum of the wire delays on the path from the clock source (C0 ) to the sink node i. The delay constant α is used to convert PTL length to delay, as described in Section II-B. The main objective is to reduce the max difference between the largest and smallest phase delay values, i.e., the clock skew. The phase delay value at each sink node (Di ) multiplied by a constant value (λ) is added to the objective function as a regularization term. In this formulation, x coordinate of all the clock splitters is kept constant (i.e., x δk,j terms are constant values, a function of the locations calculated in the previous step). Since the absolute values add non-linearity to the problem, the following transformation is used to linearize the constraints. + (12) Dk k=1 X ∀i ∈ {1 . . . nC } ∀j ∈ {1 . . . nR } (8) Using the above transformation, the following constraints should be added to the problem. Constraint (12) ensures that if cell i is assigned to channel j, then ci,j is one. If-then constraints can be transformed to linear constraints in a similar way used to transform the absolute value constraints [31]. Constraint (13) ensures that each cell is only assigned to one row. Constraint (14) controls the channel density by ensuring that the total number of cells assigned to each channel is less than the channel capacity (calculated by Equation (11)). The initial placement generated by the embedding algorithm is a good starting point for the final mapping of clock splitters to routing channels. This initial solution (yi0 ) simply maps each cell to the nearest channel below the original location. Accordingly, we restrict the final mapping of the cell i to either the same row as the initial solution or the row above the initial solution, using the following constraint. yi0 ≤ yi ≤ yi0 + 1 ∀i ∈ {1 . . . nC } (15) Eventually, using the proposed transformations, objective function, constraints, and variables, the problem formulation is summarized as follows. minimize Dmax − Dmin + λ · nS X Di i=1 z = η+ − η− 0 ≤ η + ≤ b.m 0 ≤ η − ≤ (1 − b).m b ∈ {0, 1} subject to (9) ∀i ∈ {1 . . . nS } x y δk,j + δk,j ek,j ∈path(C0 ,Ci ) y δk,j − + ∗ Hr+ch + ηk,j = α ∗ ηk,j − + − ηk,j yk − yj = ηk,j + 0 ≤ ηk,j ≤ bk,j ∗ my − 0 ≤ ηk,j ≤ (1 − bk,j ) ∗ my nR X ci,j = 1 ∀i ∈ {1 . . . nC } j=1 nC X (10) ci,j ≤ nch ∀j ∈ {1 . . . nR } i=1 Furthermore, to control the channel density, i.e., to limit the number of clock splitters mapped to each routing channel, separate constraints are added to the problem. The capacity of a channel, defined as the max number of splitters in each channel, is calculated using the following formula. nch = b ∀i ∈ {1 . . . nS } X Di = As shown, a new boolean variable b is added to the problem. Parameter m represents the maximum value of z. Additionally, in order to transform max and min functions to linear functions, two new parameters Dmax and Dmin are introduced and the following constraints are added to the problem. Di ≤ Dmax Dmin ≤ Di Di ≤ Dmax Dmin ≤ Di Wch c (Wspl + P ) (11) variables yi = j ⇐⇒ ci,j = 1 ∀i ∈ {1 . . . nC } ∀j ∈ {1 . . . nR } yi0 ≤ yi ≤ yi0 + 1 ∀i ∈ {1 . . . nC } yi ∈ {1 . . . nR } bk,j ∈ {0, 1} ci,j ∈ {0, 1} ∀i ∈ {1 . . . nC } ∀k, j → ∃ ek,j ∀i ∈ {1 . . . nC } ∀j ∈ {1 . . . nR } Di ∈ R + ∀i ∈ {1 . . . nS } + δk,j ∈ R To model the assignment of each cell i to each routing channel j, a boolean parameter ci,j is defined and the following set of ∀k, j → ∃ ek,j (16) 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 As observed, there are integer (yi ), binary (bk,j and ci,j ), and real-number (Di and δk,j ) variables in this formulation. Therefore, this problem is an instance of mixed integer linear programming (MILP). Solving this problem yields the optimum assignment of splitter cells to routing channels such that the clock skew is minimized. Once this step is completed, a similar problem can be formulated to minimize the skew while eliminating all the horizontal overlaps among the cells allotted to the same routing channel. Using the horizontal ordering imposed by the embedding algorithm, we can set constraints on the location of each cell in the routing channel, such that no two adjacent cells overlap. Accordingly, after the assignment of cells to routing channels are determined, for each routing channel, cells are sorted based on their x coordinate and the following constraints are used to eliminate the horizontal overlaps among cells. Wspl + P ≤ xj − xi ∀i, j ∈ {1 . . . nC } | yi = yj , x0i ≤ x0j (17) Since the ordering of the cells is determined, we can make use of the transitive relationships. For example, if there are three cells x, y, and z, two constraints x ≤ y and y ≤ z imply that that x ≤ z. Therefore, if there are a total number of n cells in a row, a total number of n − 1 constraints are required to ensure the legality of the placement solution in that row. Using the transformations and variables defined for legalization in the vertical direction, the min-skew placement and legalization problem in the horizontal direction is formulated as follows. minimize Dmax − Dmin + λ · nS X Di i=1 subject to Di ≤ Dmax Dmin ≤ Di ∀i ∈ {1 . . . nS } X Di = x y δk,j + δk,j ek,j ∈path(C0 ,Ci ) − x + + ηk,j δk,j = α ∗ ηk,j − + − ηk,j xk − xj = ηk,j + 0 ≤ ηk,j ≤ bk,j ∗ mx − 0 ≤ ηk,j ≤ (1 − bk,j ) ∗ mx variables xi + Wspl + P ≤ xj yi = yj , x0i ≤ x0j xi ∈ [0, Wa − Wspl ] bk,j ∈ {0, 1} ∀i ∈ {1 . . . nC } ∀k, j → ∃ ek,j Di ∈ R+ ∀i ∈ {1 . . . nS } δk,j ∈ R+ ∀k, j → ∃ ek,j (18) Similar to formulation (16), min-skew placement and legalization in the horizontal direction is also an instance of MILP. Note that assuming the lower left corner of the layout area to be at (0, 0), the xi coordinates are constrained to be within the boundaries of the layout. Solving the above problem yields the final placement of the clock splitters, such that the clock skew is minimized and a legal placement (with no overlaps) is produced. Note that the legalization in the vertical direction should be done before the horizontal direction. The reason is that the assignment of cells to channels and their initial ordering in the horizontal direction determine the necessary constraints for removing overlaps during horizontal legalization. Once the horizontal and vertical legalization problems are solved, a legal minimum-skew solution similar to Fig. 6b is produced. In the next section, our simulation framework along with the results of applying the proposed CTS algorithm to multiple SFQ benchmarks are presented. IV. S IMULATION R ESULTS We used the qPlace package for placing the logic cells in the layout area [16], [20]. We added the support for our proposed delay model (cf. Section II-B) to the implementation of the DME algorithm for embedding the clock trees [23]. We implemented the clock topology generation and the rest of the proposed algorithms in C++ and used the IBM CPLEX v12.8 package for solving the MILP problems [32]. The qSTA tool was used for static timing analysis. We used the clock tree synthesis approach in [20] as the baseline for comparison. This approach essentially uses the DME algorithm for clock tree synthesis. There are two major differences between the proposed method and the baseline approach. • The proposed method maps each splitter to the channel above or below the initial location, such that clock skew is minimized. However, the baseline approach maps the clock splitters to the closest routing channel (either above or below) greedily, minimizing the displacement of each individual splitter, while ignoring the clock skew or the total wirelength of the clock tree. • The proposed approach moves all the cells in the horizontal direction aiming at minimizing the skew and removing the overlaps. Conversely, the baseline approach only moves the cells that have overlap with each other, by shifting the overlapping cell(s), ignoring the effect of displacement on the clock skew. We assume the baseline approach uses a fully-balanced topology to minimize the skew, similar to our proposed approach. Note that this is an advantage for the baseline solution as using an imbalanced clock topology while ignoring the splitter delays increases the clock skew. On the other hand, using a fully-balanced clock tree topology, which makes sure all the sink nodes in the clock tree have the same level, effectively removes the impact of clock splitter delays on the skew. Note that, in this work, it is assumed that all the splitter cells have the same delay and the process variations do not change the delay of splitters. The characteristics of the benchmark circuits, including the number of I/O pads, sink nodes, cells, nets, before and after the clock tree synthesis are listed in Table II. Since the same topology generation algorithm is used for both the proposed and baseline approaches, the number of cells after clock synthesis is equal for both solutions. The clock skew, total negative hold slack, worst negative hold slack, and the maximum achievable clock frequency for each design, after the placement and clock tree synthesis are reported in Table III. We have also listed the clock skew values after the clock routing using qGDR routing tool and 4 metal layers for routing [34]. The clock period for each circuit is calculated as the smallest value such that there are no setup 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 time violations. For solving each of the MILP problems, a time limit of 30 minutes is established. As shown in Table III, the average clock skew value for 17 benchmarks is 4.6ps. The proposed method improves the average clock skew by 70%, compared with the baseline approach. Additionally, the total and worst negative hold slack values are improved by 80% and 60%, respectively. The average values for the total and worst negative hold slack for all the benchmarks are −6.1ps and −1.7ps, respectively. As observed, even the worst case hold time violation among all the benchmarks (cf. Table III, benchmark c3540) can be solved by the post-CTS timing closure flow, without the need for extensive refinement of the circuit and by insertion of a small number of hold buffer/JTL cells. Finally, it should be mentioned that if the time limits for solving MILP problems are increased or movement of clock splitters is not restricted to the top or bottom channels only, lower skew values may be achieved. Post-routing maximum clock skew results are listed in Table III. The average clock skew increases to 8.6ps after the clock routing. The main reason behind this increase in the maximum clock skew is that the routing tool focuses on finishing the routing of all the nets while reducing the total via count used for routing [34]. Therefore, it ignores the propagation delay along the nets, and consequently, the maximum clock skew values. Aside from the routing tool itself, we point out that the competition for limited available routing resources in large benchmarks (e.g., 16-bit Array Multiplier with more than 14,000 logic cells and 13,000 clock sub-nets) results in an increase in the length of routed nets, compared with the ideal Manhattan distance, and hence the clock skew may be increased. Note that the maximum post-routing clock skew among all benchmarks is 15.3ps, which is rather small, and the resulting negative slack values can be easily eliminated by a timing closure flow, which selectively adds a small number of gates on data propagation paths. A timing-driven routing tool will address the increase in maximum clock skew. Aside from post-routing results reported in Table III, throughout the paper, all the delay values are reported after the placement and TABLE II: Benchmark characteristics. KSA stands for Kogg-Stone adder [30], ArrMult stands for array multiplier, and ID stands for integer divider. Rest of the benchmarks are chosen from ISCAS85 benchmark suite [35]. Post-CTS columns report the number of cells and nets in the design after adding clock splitters and clock nets (using the proposed approach). Benchmark KSA4 KSA8 KSA16 KSA32 ArrMult8 ArrMult16 ID4 ID8 c432 c499 c880 c1355 c1908 c2670 c3540 c5315 c6288 #I/O pads 15 27 51 99 33 65 17 33 44 74 87 74 59 206 73 285 65 #Clk Sinks 59 318 414 1049 1404 4798 420 2703 976 566 1133 618 1100 1713 2679 4483 5546 Pre-CTS #Cells #Nets 87 124 230 318 592 803 1486 1988 1875 2296 6206 7646 570 694 3192 3697 1186 1432 875 1225 1469 1865 922 1267 1516 1965 2195 2832 3936 5097 5931 7557 7236 8958 Post-CTS #Cells #Nets 150 246 485 732 1103 1728 3533 5084 3922 5747 14397 20635 1081 1625 7287 10495 2209 3431 1898 2814 3516 5045 1945 2908 3563 5112 4242 6592 8031 11871 14122 20231 15427 22695 clock tree synthesis steps (and before routing), assuming that the net lengths are equal to the Manhattan distance between the corresponding logic gates. Although the proposed clock embedding, placement, and legalization algorithms (cf. steps 2-4 Fig. 2a) minimize the skew given a fully-balanced clock topology, these algorithms can also be applied to imbalanced topologies with fewer number of splitters, to produce minimum-skew clock trees with a fewer number of JJs, compared with a fully-balanced tree topology. In the next section, we present the necessary modifications to the proposed methodology in Section III to minimize the clock skew given imbalanced clock tree topologies. V. C LOCK S YNTHESIS FOR I MBALANCED T REE T OPOLOGIES In RSFQ circuits, the static power dissipation in the resistive bias network is about 100× larger than the dynamic power dissipation of the Josephson junctions [3]. Additionally, the clock network may require large amounts of current, exceeding 10A for large circuits. Consequently, the current delivery is a significant problem in large circuits with more than 100K JJs [33]. One of the possible ways to reduce the required amount of current delivered to the circuit and the static power consumption is to decrease the number of splitters (i.e., JJ count) in the clock network. Although creating a fully-balanced clock tree in which all the sink nodes have the same level reduces the skew significantly, it may introduce a large overhead in terms of the total area, biasing current, and static power consumption of the clock network. To address this issue, imbalanced tree structures with a fewer number of splitters compared with the fully-balanced solution may be utilized. As a consequence of imbalance in the clock tree topology, to minimize the skew, the clock tree embedding, placement, and legalization algorithms should be modified to account for the delay of clock splitters. In the following subsections, first we propose using an algorithm for imbalanced clock tree topology generation [27]. Next, we present splitter-delay-aware zero-skew clock tree embedding and splitter-delay-aware min-skew clock tree placement and legalization algorithms in detail. The goal is to minimize both the clock skew and the number of JTLs in the clock network. A. Imbalanced Topology Generation For topology generation, we use the greedy-DME algorithm [27]. In this approach, topology generation is performed in a bottom-up fashion (in contrast to our proposed approach which was done in a top-down manner, cf. Section III-B). Assume that initially there are n sink nodes. The greedy-DME algorithm starts with a set of nodes, representing the sinks of the clock tree. The algorithm iteratively finds the nearest neighbors u and v in the set of possible nodes, where the distance between nodes u and v is smaller than the distance between any other pairs of nodes. A parent node is then created by merging nodes u and v and these two child nodes are removed from the set of nodes. The next pair of nearest neighbors are detected and the same procedure is repeated until the total number of remaining nodes is 1 (this process 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 TABLE III: Simulation results (clock skew, total negative hold slack (TNS), worst negative hold slack (WNS), and clock frequency) for several benchmarks using the proposed method and the baseline method [20]. Impr. stands for improvements over the baseline. Freq. denotes the max clock frequency. Post-pl. and post-rt. stand for post-placement and post-routing, respectively. Values for skew, WNS, and TNS are in ps. Proposed Benchmark KSA4 KSA8 KSA16 KSA32 ArrMult8 ArrMult16 ID4 ID8 c432 c499 c880 c1355 c1908 c2670 c3540 c5315 c6288 Average Post-Pl. 2.7 5.3 4.5 4.6 4.0 6.1 4.4 4.4 3.6 4.9 3.9 5.6 4.8 4.1 4.3 5.0 5.7 4.6 Skew Imrp. (%) 66.3 47.0 59.5 52.1 70.4 63.5 60.0 73.7 75.5 84.5 78.7 78.7 64.4 72.7 69.1 70.9 67.8 70.5 Post-RT. 3.6 8.4 6.7 9.1 8.3 14.2 7.2 9.7 5.9 9.5 7.0 8.6 9.9 6.6 8.0 9.0 15.3 8.6 TNS 0 0 -1.3 -0.6 0 -3.2 0 -0.5 -0.4 -11 -6 -17.3 -3.6 -13.7 -18.1 -13 -14.8 -6.1 Hold Imrp. (%) WNS N/A 0 100 0 38.1 -1.3 76.0 -0.6 100 0 89.9 -1.8 100 0 98.9 -0.5 94.2 -0.4 87.5 -2.7 73.8 -3.8 77.6 -2.1 76.8 -1.6 79.6 -2.9 20.3 -5.2 84.1 -2.9 71.2 -2.6 80.4 -1.7 takes n − 1 steps). A time complexity of O(n log n) can be obtained for this algorithm, where n denotes the number of sinks [27]. As observed, the greedy-DME algorithm merges the nearest neighbors, which may be sink nodes or roots of partial-trees with different heights. Therefore, as a result of this bottomup approach which tries to minimize the total wirelength heuristically, the generated topology has a fewer number of internal nodes (i.e., clock splitters) compared to a fullybalanced topology which always merges the sub-tree roots with the same height. Consequently, the total number of JTLs in the clock tree generated by the greedy-DME algorithm is smaller than the tree produced by the proposed approach in Section III-B (i.e., MMM [29]). B. Splitter-Delay-Aware Clock Tree Embedding In the proposed zero-skew clock tree embedding algorithm in Section III-C, we did not need to account for the splitter delays during the embedding of the tree. The reason was that the generated clock topology was already fully-balanced, so all the source-sink paths had the same number of splitters. Consequently, all the phase delays included an identical delay value associated with the sum of the delay of the clock splitters in each source-sink path, which did not affect the clock skew. In the DME algorithm, the location of a merging segment is a function of the location and downstream delay of its child merging segments [23]. Accordingly, assuming an imbalanced tree topology, splitter delays play an important role in determining the location of merging segments and embedding points of a clock tree. However, the DME algorithm does not consider the delay of splitter cells in the clock tree. To address this issue, we modify the formulation of the DME algorithm as follows. Similar to the original algorithm, clock tree embedding is done in two phases. In the bottom-up pass, after merging two child nodes, we add the delay of the splitter cell to the total downstream delay of the parent node. Conse- Imrp. (%) N/A 100 -8.3 45.5 100 62.5 100 88.6 83.3 77.1 0.0 85.8 15.8 59.7 -188.9 58.0 43.5 60.5 Freq. 37.6 26.7 25.2 14.2 17.2 9.4 19.5 6.5 4.8 6.5 10 6.9 4.3 3.6 4.4 2.6 4.5 - Skew 8.0 10.0 11.1 9.6 13.5 16.7 11.0 16.7 14.7 31.6 18.3 26.3 13.5 15.0 13.9 17.2 17.7 15.6 Baseline Hold TNS WNS 0.0 0.0 -1.4 -1.4 -2.1 -1.2 -2.5 -1.1 -10.0 -2.8 -31.6 -4.8 -3.3 -2.2 -45.9 -4.4 -6.9 -2.4 -88.3 -11.8 -22.9 -3.8 -77.1 -14.8 -15.5 -1.9 -67.3 -7.2 -22.7 -1.8 -81.9 -6.9 -51.3 -4.6 -31.2 -4.3 Freq. 34.6 21.9 26.5 13.4 16.0 9.1 18.7 6.4 4.8 6.5 9.8 7.0 4.3 3.7 4.5 2.6 4.5 - quently, when merging two sub-trees, the algorithm accounts for both wire delay of each sub-tree and the delay associated with the splitters inserted in each sub-tree. Accordingly, the location of the merging segments is different than the one produced by the original DME algorithm. The top-down phase of the DME algorithm remains the same. Finally, a zero-skew embedding is generated that accounts for both splitter and interconnect delays. Fig. 7 depicts an imbalanced tree topology and the placement of internal nodes, using the original and the modified DME algorithms for an example with 3 sinks. The delay associated with different edges are also shown. Assume the delay of splitter cells to be 2 units (of delay) and a pathlength delay model. As depicted in Fig. 7b, generated using the original DME algorithm that ignores the splitter delays, the phase delays of sinks s1 -s3 are 10, 10, and 8, respectively. Hence, the clock skew is 2. Conversely, Fig. 7c depicts the location of merging segment msr considering the splitter delay values. Once the msa is formed, the downstream delay of msa which is originally set to be 3, is modified and the delay of the splitter cell corresponding to node a is added to this delay. Hence, the downstream delay of node a becomes 5. Accordingly, the merging segment msr is formed further away from node s3 and closer to node a, to create a zero-skew merging segment. Consequently, the phase delays of sinks s1 s3 are all equal to 9 and the clock skew becomes 0. The splitter-delay-aware clock tree embedding algorithm is applied to the imbalanced topology to calculate the location of clock splitters. In the next subsection, we describe the modifications made to the placement and legalization algorithm to properly handle imbalanced tree topologies. C. Splitter-Delay-Aware Clock Tree Placement and Legalization The clock tree placement and legalization algorithm should account for splitter delays, while minimizing the clock skew 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 s2 msa 3 3 r s1 a 3 s1 3 s2 s3 (a) a msr a 3 msr 2 r r s1 s2 msa 6 s3 (b) 7 s3 (c) Fig. 7: An example of applying DME algorithm to a circuit with 3 sink nodes. (a) An imbalanced tree topology. The location of sinks, merging segments, and the embedding points of the internal nodes of the tree using (b) original DME algorithm and (c) splitter-delay-aware DME algorithm are shown. and removing the overlaps. To do so, we modify the formulas calculating the insertion delay at each sink node (i.e., formulations (16) and (18)) as follows. X y x Di = Li ∗ dspl + δk,j + δk,j (19) ek,j ∈path(C0 ,Ci ) where Li denotes the level of sink i and dspl denotes the delay of splitter cells. As observed, the first term in the above formula is a constant value, a function of the tree topology and not one of the variables of the MILP formulation for clock tree legalization. Once the above modifications are done to the flow of Fig. 2a, the proposed clock tree synthesis algorithm can be used for both balanced and imbalanced tree topologies, possibly generating min-skew solutions with a fewer number of clock splitters compared with the approach proposed in Section III. The splitter-delay-aware CTS flow tries to increase the length of the wires along the source-sink paths that have fewer splitters, to minimize the difference between max and min insertion delays and hence, to minimize the maximum clock skew. In the next subsection, the results of applying the CTS algorithm for imbalanced topologies to multiple benchmark circuits are presented and compared to the baseline solution. D. Simulation Results for Imbalanced Clock Tree Topologies We used the greedy-DME algorithm for imbalance clock topology generation [23] and added the aforementioned modifications to the qCTS flow. Table IV lists the clock skew, total negative hold slack (TNS), worst negative hold slack (WNS) (in ps), the number of clock splitters, the maximum clock frequency, and imbalance degree (the maximum level difference among sinks) for several benchmarks obtained by applying the proposed algorithm and compares them with the baseline solution [20]. As shown, the average number of clock splitters and average clock skew value are reduced by 37% and 56%, respectively, compared with the baseline solution [20] described in Section IV. The average clock skew value over all 17 benchmarks is 6.8ps. Additionally, the average total negative hold slack and average worst negative hold slack values are improved by 32% and 51%, respectively. Table IV also lists the imbalance degree of the clock trees for different benchmarks. As shown, some of the tree topologies have an imbalance degree as large as 8, i.e., there exists a source-sink path that has 8 splitters fewer than a path with the maximum number of splitters. Such large differences can potentially cause a large skew (i.e., 8 × dspl = 44ps), if a CTS algorithm ignores the splitter delays. However, the proposed embedding and legalization algorithms modify the location of splitters and the delay of interconnects along all the source-sink paths in a way that the effect of splitter delays on the maximum clock skew is balanced out by the wire delays. Consequently, the maximum clock skew among all the benchmarks is limited to 10.1ps, approximately equal to the delay of only 2 splitters. As shown in Tables III and IV, using the same clock tree embedding and legalization algorithms, imbalanced tree topologies create a trade-off between the number of clock splitters (also the total static power consumption and the total biasing current delivered to the network) and the total negative hold slack values, compared with balanced topologies (cf. Section III). This suggests that the timing closure flow may need more hold buffers to fix all the hold time violations for imbalanced topologies, compared with balanced trees. As a future direction, we plan to quantify the effect of process variations on the timing yield of the circuits and design a clock tree synthesis algorithm and a timing closure flow that try to minimize timing violations and improve the timing yield, in the presence of process variations. VI. C ONCLUSION This paper presents a minimum-skew clock tree synthesis methodology for single flux quantum logic circuits, called qCTS. The qCTS algorithm first builds a fully-balanced tree topology considering the placement of the sequential elements in a circuit, such that there are equal number of splitters from the clock source to any clock sink. The fully-balanced tree topology removes the effect of splitter delays on the clock skew. The location of the clock splitters is then calculated using a zero-skew embedding algorithm. Finally, using a novel mixed integer linear programming based method, overlaps among clock splitters and logic cells are removed while the clock skew is minimized. The qCTS method improves the state-of-the-art by accounting for splitter delays and placement blockages and reduces the clock skew by 70% on average, over 17 benchmarks. Subsequently, this methodology is extended to minimize the clock skew given an imbalanced topology in which there are different number of splitters from the root of the tree to every sink node. The modified algorithm generates solutions in which the average number of splitters and the average clock skew are reduced by 56% and 37%, respectively, compared with a fully-balanced clock tree synthesis and a greedy legalization algorithm. ACKNOWLEDGMENT The authors would like to thank Naveen Katam, Ghasem Pasandi, Ting-Ru Lin, and Bo Zhang from University of Southern California for providing tools and benchmarks used in this paper. R EFERENCES [1] J. Koomey, “Worldwide electricity used in data centers,” Environ. Res. Lett., vol. 3, no. 3, pp. 034 008–1–034 008–8, Jul 2008. 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 TABLE IV: Results of the clock skew, total negative hold slack (TNS), worst negative hold slack (WNS) (in ps), the number of clock splitters, maximum clock frequency, imbalance degree (maximum level difference among sinks), and comparisons with the baseline solution [20]. Impr., Freq., and Imb. Deg. stand for improvement, maximum clock frequency, and the imbalance degree of the clock tree, respectively. Benchmark KSA4 KSA8 KSA16 KSA32 ArrMult8 ArrMult16 ID4 ID8 c432 c499 c880 c1355 c1908 c2670 c3540 c5315 c6288 Average (%) Skew 4.9 5.2 5.5 6.0 6.0 8.6 5.0 6.0 5.8 10.1 6.8 5.8 6.3 9.2 7.5 8.9 8.1 6.8 Impr. (%) 38.8 48.0 50.5 37.5 55.6 48.5 54.5 64.1 60.5 68.0 62.8 77.9 53.3 38.7 46.0 48.3 54.2 56.4 Hold TNS -1.4 -3.1 0 -2.6 -7.6 -42.5 0 -14.8 -2.7 -31.2 -18.2 -17.4 -12.4 -34.9 -35.9 -84.4 -48.5 -21.0 Impr. (%) N/A -121.4 100 -4.0 24.0 -34.5 100 67.8 60.9 64.7 20.5 77.4 20.0 48.1 -58.1 -3.1 5.5 32.7 Hold WNS -1.4 -1.7 0 -1.3 -1.8 -4.3 0 -2.6 -0.7 -5.5 -2.9 4.3 -2.4 -3.5 -5.3 -4.6 -2.8 -2.1 [2] T. V. Duzer and C. W. Turner, Principle of Superconducting Circuits. New York: Elsevier, 1981. [3] D. S. Holmes, A. L. Ripple, and M. A. Manheimer, “Energy-Efficient Superconducting Computing Power Budgets and Requirements,” IEEE Transaction on Applied Superconductivity, vol. 23, no. 3, 1701610, Jun 2013. [4] K. K. Likharev and V. K. Semenov, “RSFQ logic/memory family: A new Josephson-junction technology for sub-terahertz-clock-frequency digital systems,” IEEE Transaction on Applied Superconductivity, vol. 1, no. 1, pp. 3–28, Mar 1991. [5] W. Chen, A. V. Rylyakov, V. Patel, J. E. Lukens, and K. K. Likharev, “Rapid single flux quantum T-flip flop operating up to 770 GHz,” IEEE Transaction on Applied Superconductivity, vol. 9, no. 2, pp. 3212–3215, Jun 1999. [6] A. Mukhanov, “Energy-efficient single flux quantum technology,” IEEE Transaction on Applied Superconductivity, vol. 21, no. 3, pp. 760–769, Jun 2011. [7] S. Polonsky, “Delay insensitive RSFQ circuits with zero static power dissipation,” IEEE Transaction on Applied Superconductivity, vol. 9, pp. 3535–3538, Jun 1999. [8] A. H. Silver and Q. P. Herr, “A new concept for ultra-low power and ultra-high clock rate circuits,” IEEE Transaction on Applied Superconductivity, vol. 11, pp. 333–336, Jun 2001. [9] O. T. Oberg, Q. P. Herr, A. G. Ioannidis, and A. Y. Herr, “Integrated power divider for superconducting digital circuits,” IEEE Transaction on Applied Superconductivity, vol. 21, pp. 571–574, Jun 2011. [10] Y. Yamanashi, T. Nishigai, and N. Yoshikawa, “Study of LR-loading technique for low-power single flux quantum circuits,” IEEE Transaction on Applied Superconductivity, vol. 17, no. 2, pp. 150–153, Jun 2007. [11] L. R. Eaton and M. W. Johnson, “Superconducting constant current source,” 2009. [Online]. Available: U.S. Patent 7 002 366 B2 [12] D. E. Kirichenko, A. F. Kirichenko, and S. Sarwana, “No static power dissipation biasing of RSFQ circuits,” IEEE Transaction on Applied Superconductivity, vol. 21, pp. 776–779, Jun 2011. [13] M. Tanaka, M. Ito, A. Kitayama, T. Kouketsu, and A. Fujimaki, “18GHz, 4.0-aJ/bit operation of ultra-low-energy rapid single-flux-quantum shift registers,” Japan Journal of Applied Physics, vol. 51, no. 5, pp. 053 102–1–053 102–4, May 2012. [14] M. Pedram and Y. Wang, “Design automation methodology and tools for superconductive electronics,” in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov 2018, pp. 1–6. [15] C. J. Fourie, K. Jackman, M. M. Botha, S. Razmkhah, P. Febvre, C. L. Ayala, Q. Xu, N. Yoshikawa, E. Patrick, M. Law, Y. Wang, M. Annavaram, P. Beerel, S. Gupta, S. Nazarian, and M. Pedram, “Coldflux superconducting eda and tcad tools project: Overview and progress,” IEEE Transactions on Applied Superconductivity, vol. 29, no. 5, pp. 1–7, Aug 2019. [16] S. N. Shahsavani, A. Shafaei, and M. Pedram, “A placement algorithm for superconducting logic circuits based on cell grouping and supercell placement,” in 2018 Design, Automation Test in Europe Conference Exhibition (DATE), 2018, pp. 1465–1468. Impr. (%) N/A -21.4 100 -18.2 35.7 10.4 100 40.9 70.8 53.4 23.7 70.9 -26.3 51.4 -194.4 33.3 39.1 51.2 #Clk Spl. 58 158 413 1048 1403 4797 419 2702 975 565 1132 617 1099 1712 2678 4482 5545 1753.1 Impr. (%) 7.9 38.0 19.2 48.8 31.5 41.4 18.0 34.0 4.7 44.8 44.7 39.7 46.3 16.4 34.6 45.3 32.3 37.1 Freq. 38.8 22.7 26.3 14.1 15.3 9.5 19.2 6.5 4.9 6.4 9.9 6.6 4.3 3.6 4.4 2.6 4.5 Imb. Deg. 2 4 3 8 5 8 3 6 6 5 6 4 8 5 6 6 6 [17] R. N. Tadros and P. A. Beerel, “A robust and tree-free hybrid clocking technique for rsfq circuits - csr application,” in 2017 16th International Superconductive Electronics Conference (ISEC), June 2017, pp. 1–4. [18] N. Kito, K. Takagi, and N. Takagi, “A fast wire-routing method and an automatic layout tool for rsfq digital circuits considering wire-length matching,” IEEE Transactions on Applied Superconductivity, vol. 28, no. 4, pp. 1–5, June 2018. [19] K. Han, A. B. Kahng, and J. Li, “Optimal generalized h-tree topology and buffering for high-performance and low-power clock distribution,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 1–1, 2018. [20] S. N. Shahsavani, T. R. Lin, A. Shafaei, C. J. Fourie, and M. Pedram, “An integrated row-based cell placement and interconnect synthesis tool for large sfq logic circuits,” IEEE Transactions on Applied Superconductivity, vol. 27, no. 4, pp. 1–8, June 2017. [21] A. B. Kahng, J. Lienig, I. L. Markov, and J. Hu, VLSI Physical Design: From Graph Partitioning to Timing Closure, 1st ed. Springer Publishing Company, Incorporated, 2011. [22] L.-T. Wang, Y.-W. Chang, and K.-T. T. Cheng, Eds., Electronic Design Automation: Synthesis, Verification, and Test. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2009. [23] J. Cong, A. B. Kahng, C.-K. Koh, and C.-W. A. Tsao, “Bounded-skew clock and steiner routing,” ACM Trans. Des. Autom. Electron. Syst., vol. 3, no. 3, pp. 341–388, Jul. 1998. [24] K. Gaj, E. G. Friedman, and M. J. Feldman, “Timing of multi-gigahertz rapid single flux quantum digital circuits,” J. VLSI Signal Process. Syst., vol. 16, no. 2-3, pp. 247–276, Jul. 1997. [25] K. Takagi, Y. Ito, S. Takeshima, M. Tanaka, and N. Takagi, “LayoutDriven Skewed Clock Tree Synthesis for Superconducting SFQ Circuits.” IEICE Transactions, vol. 94-C, no. 3, pp. 288–295, 2011. [26] S. Y. Yoshio Kameda and Y. Hashimoto, “A New Design Methodology for Single-Flux-Quantum (SFQ) Logic Circuits Using PassiveTransmission-Line (PTL) Wiring,” IEEE Transaction on Applied Superconductivity, vol. 17, no. 2, pp. 508–511, Jun 2007. [27] M. Edahiro, “A clustering-based optimization algorithm in zero-skew routings,” in 30th ACM/IEEE Design Automation Conference, June 1993, pp. 612–616. [28] J. Cong, A. B. Kahng, and G. Robins, “Matching-based methods for high-performance clock routing,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 12, no. 8, pp. 1157–1169, Aug 1993. [29] M. A. Jackson, A. Srinivasan, and E. S. Kuh, “Clock routing for highperformance ics,” in Design Automation Conference, 1990. Proceedings., 27th ACM/IEEE. IEEE, 1990, pp. 573–579. [30] N. H. E. Weste and D. M. Harris, CMOS VLSI Design: A Circuits and Systems Perspective, 4th ed. San fransisco, CA: Addison-Wesley, 2005. [31] I. Griva, S. Nash, and A. Sofer, Linear and Nonlinear Optimization: Second Edition, ser. Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics, 2009. [32] “IBM ILOG CPLEX.” [Online]. Available: www.ilog.com/products/cplex/ 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE Transactions on Applied Superconductivity IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019 [33] S. K. Tolpygo, “Superconductor digital electronics: Scalability and energy efficiency issues (review article),” Low Temperature Physics, vol. 42, no. 5, pp. 361–379, 2016. [Online]. Available: https://doi.org/10.1063/1.4948618 [34] T. R. Lin, T. Edwards, and M. Pedram “qGDR: A Via-MinimizationOriented Routing Tool for Large-Scale Superconductive Single-FluxQuantum Circuits,” IEEE Transactions on Applied Superconductivity, vol. 29, no. 7, pp. 1–12, 2019. [35] M. C. Hansen, H. Yalcin, and J. P. Hayes, “Unveiling the ISCAS-85 benchmarks: A case study in reverse engineering,” IEEE Des. Test, vol. 16, no. 3, Jul. 1999. 1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.