Uploaded by 619215645

A Minimum-Skew Clock Tree Synthesis Algorithm

advertisement
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
A Minimum-Skew Clock Tree Synthesis Algorithm
for Single Flux Quantum Logic Circuits
Soheil Nazar Shahsavani and Massoud Pedram, Fellow, IEEE
Abstract—This paper presents a synchronous minimum-skew
clock tree synthesis algorithm for single flux quantum circuits
considering splitter delays and placement blockages. The proposed methodology improves the state-of-the-art by accounting
for splitter delays and creating a fully-balanced clock tree
structure in which the number of clock splitters from the clock
source to all the sink nodes is identical. Additionally, a mixed
integer linear programming (MILP) based algorithm is presented
that removes the overlaps among the clock splitters and placed
cells (i.e., placement blockages) and minimizes the clock skew,
simultaneously. Using the proposed method, the average clock
skew for 17 benchmark circuits is 4.6ps, improving the stateof-the-art algorithm by 70%. Finally, a clock tree synthesis
algorithm for imbalanced topologies is presented that reduces
the clock skew and the number of clock splitters in the clock
network by 56% and 37%, respectively, compared with a fullybalanced clock tree solution.
Index Terms—Single flux quantum (SFQ), superconducting
electronics, physical design, placement, legalization, clock tree
synthesis.
I. I NTRODUCTION
ONVENTIONAL computing based on CMOS technology and metal interconnects has faced substantial issues
in terms of total power consumption and energy efficiency [1].
Superconducting computing based on the Josephson effect is a
promising replacement for CMOS technology aiming at highperformance and energy-efficient computing [4]. Josephson
junctions (JJs), basic circuit elements in single flux quantum
(SFQ) technology, have a rapid switching speed (∼ 1ps) and
low switching energy (∼ 10−19 J/bit) at temperatures about
4 K [2], [3]. Rapid single flux quantum (RSFQ) technology
was introduced in the 1980s. It uses quantized voltage pulses in
digital data generation and memorization [4]. RSFQ circuits
have been shown to be functional at operating frequencies
of up to 770 GHz [5]. Recent developments introduce new
SFQ logic families, such as energy-efficient single flux quantum technology (ERSFQ/eSFQ) [6], dual-rail RSFQ [7], selfclocked complementary logic (SCCL) [8], reciprocal quantum
logic (RQL) [9], novel approaches including re-design of
the current biasing network for RSFQ [10], [11], [12], and
application of low supply voltage for RSFQ circuits [13].
In spite of extraordinary characteristics of SFQ logic (including but not limited to high frequency and low energy dissipation), design automation methodologies and tools are less
C
Manuscript received April 15, 2019; accepted August 14, 2019. The
research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity
(IARPA), via the U.S. Army Research Office grant W911NF-17-1-0120. This
project is also supported in part by the Software and Hardware Foundations
program of the National Science Foundation (NSF) under Grant No. 1619473.
S. Nazar Shahsavani and M. Pedram are with the Department of Electrical
and Computer Engineering, University of Southern California, Los Angeles,
CA 90007 USA (e-mail: nazarsha@usc.edu; pedram@usc.edu).
sophisticated than those of CMOS technology, preventing the
SFQ logic to become a realistic option for realizing large-scale,
high-performance, and energy-efficient computing systems of
the future [3]. Although many advanced techniques have
been developed for computer-aided design (CAD) for CMOS
technology, these techniques cannot be directly applied to the
design of SFQ circuits due to key differences between the two
technologies. Some of these differences are (i) different active
and passive components (JJs and inductors vs. transistors and
capacitors), (ii) various types of logic gates and clocking
structures, and (iii) the need for path-balancing D flip-flips
(DFF), splitters, and biasing networks which increases the total
cost of integration in terms of area and power consumption
[14]. To address the aforementioned issues, researches have
started focusing on the development of front-end and backend tools and methodologies for design automation of superconducting electronics to enable very large scale integration
(VLSI) design and verification of superconductive electronics
(SCE) as a step toward the development of energy-efficient
and high-performance computers [15].
Physical design of logic circuits, especially the synthesis of
clock distribution network (CDN), plays an important role in
designing high-performance circuits robust to process-induced
variations. The layout of large circuits requires automated
placement, clock network synthesis, and routing tools. Recent
efforts by researches have introduced effective techniques for
placement, design of CDNs, and routing for large SFQ circuits
[16]–[18].
Clock network synthesis is a crucial task in physical design
of logic circuits as the clock network takes up substantial
routing resources, consumes significant power, and determines
the maximum frequency of the circuits. Minimizing the clock
skew (i.e., the maximum difference in the arrival time of
the clock signal at two different clock sinks) is of great
importance since the clock skew directly limits the maximum
achievable frequency of a circuit [19]. In SFQ logic circuits,
the clock signal should be delivered to nearly all logic cells in
the design. Therefore, to maximize the performance, a wellbalanced minimum-skew clock tree structure is an absolute
requirement.
Previous zero-skew clock tree synthesis methods for SFQ
circuits fail to produce high-quality solutions because they do
not consider the delay of splitter cells (which are required to
distribute the clock signal to sequential gates) and placement
blockages (already placed logic cells) [20]. Additionally, the
population density of cells in different regions of the chip
can be very different which can result in a highly-imbalanced
clock tree topology, i.e., one where the maximum difference
between splitter counts from the root of the clock tree to any
pair of leaf nodes is large.
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
This paper presents an algorithm for a fully-balanced
clock tree topology construction and a min-skew clock tree
placement and legalization algorithm using a mixed integer
linear programming (MILP) formulation to perform clock
tree construction, splitter insertion, and skew minimization
under the given placement blockages, considering both splitter
and interconnect delays. The proposed clock tree topology
generation algorithm guarantees that the maximum difference
between the number of splitters from the clock source to any
pair of sinks is zero. The effectiveness of the proposed CTS
algorithm is verified using multiple SFQ circuits. The main
contributions of this paper can be summarized as follows.
• An algorithm is presented that creates a fully-balanced clock
tree in which the maximum difference between the number
of splitters from the root of the tree to any pair of leaf nodes
is zero.
• A min-skew clock tree placement and legalization algorithm
is presented that places the clock splitters in the routing
channels, i.e., empty spaces between the placement rows,
and eliminates the overlaps among the clock splitters and
logic cells while minimizing the skew.
• Using the proposed technique, the average clock skew for
17 benchmark circuits is 4.6ps. This approach improves the
state-of-the-art method by 70%.
• The proposed CTS algorithm is extended to generate a
minimum-skew solution given imbalanced clock tree topologies. The modified algorithm reduces the clock skew and the
number of clock splitters in the clock network by 56% and
37%, respectively, compared with a fully-balanced clock tree
solution.
The rest of the paper is organized as follows. Background
and prior work are discussed in Section II. Our SFQ specific clock tree synthesis methodology including topology
construction, splitter placement, and legalization algorithms
are discussed in Section III. Simulation results obtained by
applying the proposed method to multiple benchmark circuits
are reported in Section IV. A clock tree synthesis methodology
for imbalanced topologies is presented in Section V. Finally,
the paper is concluded in Section VI.
II. BACKGROUND
A. Definitions
In this section, we summarize some definitions and notations
used throughout this paper.
• Clock phase delay refers to the delay from the clock
source to any of the clock sinks (i.e., sequential elements
such as flip-flops or latches). Phase delay, also known as
insertion delay, increases as the feature size decreases and
chip size increases. The phase delay is typically a combination of gate delay (e.g., buffers, clock gating elements,
and clock dividers) and interconnect delay. As the feature
size decreases, the effect of process and on-chip variations
(OCV) on phase delay increases, which in turn affects the
clock uncertainty [19]. Accordingly, minimizing phase delay
values is beneficial in reducing the clock uncertainty.
• Clock skew: Two flip-flops i and j connected by combinational gates and interconnects are called sequentially
adjacent flip-flops (cf. Fig. 1). Clock skew between nodes
i and j is defined as the difference between clock arrival
Data
Clock
FFi
FFj
Comb.
Fig. 1: A pair of sequentially adjacent flip-flops i and j connected
by a combinational gate and interconnects.
times (phase delay values) at these two nodes. In this paper,
clock skew for a circuit is defined as the maximum skew
between any two flip-flops. Equations for calculating clock
skew are as follows.
skewi,j = Ti − Tj
skewmax = max |Ti − Tj |
1≤i,j≤n
(1)
(2)
Where Ti and n denote the clock arrival time at sink i and
the total number of clock sinks, respectively.
Timing constraints can be categorized as setup and hold time
constraints, defined as follows.
Setup time: is the amount of time that the input to the
capturing flip-flop (F Fj ) should stay valid before the next
triggering clock edge arrives [21]. The following inequality
summarizes the relation between clock skew, clock period
and setup time.
max
max
Tp ≥ skewi,j + tmax
c2Q + tcomb + tsetup
(3)
where tmax
c2Q denotes the maximum clock-Q delay of a
flip-flop, tmax
comb accounts for the maximum delay through
combinational logic (which also includes the interconnect
delay), Tp represents the clock cycle time, and tmax
setup denotes
the maximum setup time for a flip-flop. As shown, a positive
clock skew increases the clock cycle time. On the other
hand, a negative clock skew (if the clock signal is received
at the launching flip-flop earlier than the capturing flip-flip)
decreases the effective clock period.
Hold time: To ensure the proper propagation of an input
signal through a flip-flop, the input must remain valid
or hold steady for a short duration after the clock edge,
referred to as the hold time [22]. The hold time of the
capturing flip-flop imposes an additional constraint on the
total propagation delay of a signal through the launching
flip-flop and the combinational logic as follows.
min
min
skewi,j ≥ tmax
hold − tc2Q − tcomb
(4)
In the worst case, the input signal at the capturing flip-flop
(j) should remain stable for tmax
hold after the clock edge of
the same clock cycle arrives at node j. If the clock signal
arrives at the F Fi earlier than the F Fj , it causes the input
signal to F Fj to change before F Fj can capture that.
As shown above, the clock skew directly limits the maximum clock frequency of a circuit and reduces the available
positive time slack for setup constraints. Additionally, a
negative clock skew may result in hold time violations.
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
Unlike setup time violations, hold time violations cannot
be fixed by increasing the clock period.
Actual Arrival Time (AAT): is defined as the latest transition time at a given node in a circuit, measured from the
beginning of the clock cycle [21].
Required Arrival Time (RAT): is defined as the latest time
at which a signal should arrive at a given node, such that
the circuit works correctly, given setup or hold constraints.
Timing Slack: For each node in a circuit (e.g., pins or gates)
timing slack is calculated as the difference between RAT and
AAT at that node. While a positive slack means that the
timing constraint is satisfied (i.e., the signal arrives earlier
than it is required), a negative slack is an indicator of a
(setup or hold) timing violation (i.e., the signal arrives after
its required time) [21].
• Clock Tree Topology: A clock tree topology is defined as
a binary tree G, in which each node has a maximum of 2
children, is rooted at the clock source R, and has a total
number of |S| leaf nodes representing the set of clock sinks
S. We define this tree topology to be a directed graph in
which edges are directed from parents to the children. The
level of each node i is defined as the number of nodes in
the longest path from the root of the tree to node i, denoted
by Li . The height of a node is the number of nodes on the
longest path from that node to a leaf node, denoted by Hi .
The height of a tree is defined as the height of the root node
of the tree.
• Clock Tree Embedding: A clock tree embedding determines the location of each internal (non-sink node) v of
the clock tree topology, denoted by pl(v), in the Manhattan
plane. If there is a connection between a parent node p and
a child node c, the cost of the edge ep,c , denoted as lp,c , is
defined as the Manhattan distance between pl(p) and pl(c).
The total wirelength of a tree is calculated as the sum of
the cost of all the edges of the tree.
Based on the above definitions, reducing the maximum clock
skew increases the maximum frequency of the circuit, reduces
the number of hold time violations, and facilitates the timing
closure of the design (i.e, fixing the timing violations of the
circuit which is typically done after the placement and clock
tree synthesis steps.)
B. Delay Model
Single flux quantum pulses are typically propagated over
long distances using passive transmission lines (PTL). PTL
micro-strips transmit the pulses with extremely low losses,
with a speed of approximately 1/3 of the speed of light in a
vacuum [3]. Equation (5) models the propagation delay as a
function of the length of the PTLs.
D=
L
L (µm)
≈
1
µm
c
100
3
ps
(5)
In Equation (5), D represents the delay over the PTL, L
represents the length of the PTL, and c denotes the speed
of light in a vacuum. As a result, phase delay from the root
of the tree R to a sink node Ci over a path path(R, Ci ) is
calculated as follows.
DR,Ci =
1
×
100
X
lj,k
(6)
ej,k ∈ path(R,Ci )
where lj,k denotes the Manhattan distance between clock
nodes Cj and Ck .
Although the SFQ signals can also be propagated using
Josephson transmission lines (JTLs), we do not use JTLs in
the global clock network, due to low propagation speed and
difficulties introduced in the routing stage. The introduced
delay model for PTLs is similar to the path-length delay
model used in CTS algorithms for early CMOS technology
nodes [23]. Although the proposed linear delay model is used
throughout the paper, other delay models can be integrated
into the proposed design flow.
C. Prior Work
Multiple clock topologies have been proposed and the tradeoffs are discussed in [24]. Synchronous clock tree synthesis
using an H-tree structure has been proposed as the best option
for large circuits in terms of max clock frequency [24].
In [25], a layout driven CTS method is presented that groups
cells by logic level and propagates a skewed clock signal to
each logic group. For the first logic level, a clock tree is built
to propagate the clock signal to each gate such that timing
constraints are met. Then, the clock signal is passed to the root
of the clock tree for the next logic level. This work employs
splitter and JTL insertion and replacement of the logic cells
in the same logic level for timing adjustments.
An earlier SFQ design methodology presented in [26] first
synthesizes a zero-skew clock network utilizing an H-tree
structure. The proposed algorithm then places the cells on
predefined rectangular grid bins at the leaves of the clock tree,
using the min-cut placement algorithm. However, since the
placement slots are limited to grid bins and the placement is
done after the CTS, the quality of placement in terms of the
total wirelength and routability is degraded significantly.
A novel clocking methodology (called hierarchical chains
of homogeneous clover-leaves clocking) was proposed to
improve the robustness of clock networks to timing variations
[17]. In the proposed clock topology, frequency of the circuit
is determined by the structure and number of gates within
each clover-leaf. However, since this clocking method does
not consider the wire delays and no algorithms for placement
of the clock elements are presented, quantifying the maximum
clock frequency, overhead in terms of the size of the clock network, and comparison with synchronous zero-skew clocking
algorithms is not possible [17].
In [20], a CTS algorithm for H-tree and HL-tree clock
structures was presented and results were discussed. In HLtree structures, an H-tree is used for global clock distribution
and a Linear tree (L) for local clock distribution [20]. Authors
in [16] provide an algorithm for optimizing the placement
of logic cells such that the HL-tree clock structures can be
utilized efficiently.
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
III. P ROPOSED C LOCK T REE S YNTHESIS M ETHODOLOGY
Conventional clock tree synthesis methods for CMOS circuits do not consider the placement of the splitters and their
associated delay in clock skew minimization, as they use zeroarea and zero-delay branching points for splitting the global
clock signal. However, in SFQ circuits, the splitter delay adds
to the overall insertion delay at the sink nodes. Additionally,
the placement of these clock splitters determines the delay of
each edge in the clock tree and therefore affects the insertion
delays. Moreover, clock splitters should be placed in legal
locations, i.e., should not have any overlaps with the placed
logic cells and should not violate the layout rules. To address
the aforementioned challenges, we propose a minimum-skew
clock tree synthesis considering splitter delays and placement
blockages. In the following subsections, the overall flow of
the proposed algorithm is presented and details of each step
are explained.
A. Overall Design Flow
The overall design flow of our clock tree synthesis algorithm
(called qCTS) is shown in Fig. 2a. The proposed approach
generates a minimum-skew clock tree such that all the clock
splitters are mapped to the routing channels between the
placement rows (i.e., the logic cells are placed inside the
rows and the clock splitters are placed between the rows).
The proposed methodology can be employed for both H-tree
and HL-tree clock structures. A sample output of the proposed
algorithm for a placed netlist is shown in Fig. 2b.
The inputs to the qCTS algorithm are (i) a placed netlist,
(ii) a list of clock sink nodes and their locations, and (iii) a
delay model. There are 4 steps in the proposed algorithm.
• In the first step (cf. Fig. 2a, Topology Generation), a
fully-balanced tree topology is generated to minimize the
maximum level difference among the clock sinks to zero.
After this stage, it is guaranteed that all the sink nodes have
the same level.
Row
Logic Synthesis
Cell
Placement
Splitter
qCTS
Clk Source
Topology
Generation
Clock Tree
Embedding
In the second step (cf. Fig. 2a, Clock Tree Embedding),
the clock tree embedding algorithm generates a zero-skew
clock network and calculates the location of all the internal
nodes of the clock network, given the tree topology and the
location of the sink nodes.
• In the third step (cf. Fig. 2a, Splitter Insertion), the splitter
cells in the clock tree are placed at the location of the
embedding points of the clock network.
• In the final step (cf. Fig. 2a, Min-Skew Clock Tree Placement and Legalization), a MILP based approach is used to
map the clock splitters to the routing channels and to remove
the horizontal overlaps between the clock splitters.
Once the qCTS algorithm generates the CDN, the static timing
analysis (STA) tool calculates the maximum clock frequency
and hold/setup time slacks and the timing closure flow tries to
solve all the timing violations. In the following subsections,
each step is explained in details.
•
B. Clock Topology Generation
In SFQ logic circuits, the clock signal is distributed to nearly
all the logic cells as most of the cells are sequential elements,
i.e., need a clock signal for synchronization. Conventional
clock topology generation methods do not consider the delay
of splitter cells (needed to distribute the clock signal to
multiple fan-outs). Additionally, the population density of
cells in different regions of the chip can result in a highlyimbalanced clock tree topology with a large level difference
among sink nodes.
Consider an example with 8 leaf nodes as depicted in Fig.
3a. As shown, the leaf nodes have different levels (e.g., nodes
9 and 14 have levels 4 and 3, respectively). The maximum
level difference among leaves is 2. Fig. 3b depicts a balanced
tree topology in which all the leaves have the same level.
Insertion delay at each leaf node is a combination of all the
splitter delays from the clock source to each leaf node and
interconnect delays. By creating a balanced tree topology, the
portion of the insertion delay corresponding to splitter delays
is balanced out among the leaf nodes, which helps reducing
the clock skew.
Some of the clock topology generation algorithms for
CMOS circuits, such as greedy-DME or geometric matching,
create imbalanced topologies as they allow merging two subtrees with different heights [27] [28]. We intend to create
a fully-balanced binary tree in which the maximum level
difference among leaf nodes is equal to 0. For this purpose,
we propose using an algorithm similar to the method of means
Splitter Insertion
Min-Skew Clock
Tree Placement
& Legalization
STA
Timing Closure
(a)
(b)
Fig. 2: (a) The overall flow of the proposed clock tree synthesis
algorithm, qCTS. (b) The proposed placement of logic cells (blue
rectangles, placed inside the rows) and clock splitters (black rectangles, placed between the rows) for a circuit with 32 logic gates and
8 rows. Rows are shown using red rectangles.
(a)
(b)
Fig. 3: Clock tree topologies for 8 leaf nodes (shown in blue). (a)
An imbalanced tree with a max level difference of 2 among leaves.
(b) A balanced tree with a max level difference of 0 among leaves.
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
and medians (MMM) [29]. The MMM algorithm, proposed
as one of the initial minimum-skew clock tree synthesis
algorithms, heuristically minimizes both the clock skew and
total wirelength of the CDN. The MMM method performs
topology generation and clock embedding simultaneously, in
a top-down manner. Note that in this work, we only use the
topology generation part of this algorithm.
The MMM algorithm recursively bi-partitions the set of
sinks in a region, creates a tree node for each of the subregions, and assigns these tree nodes as the children of the
parent node (corresponding to the original region) [29]. In
each step, sinks are sorted based on their x or y coordinate.
Assuming the sinks are ordered in x (y) coordinate, half of
the sinks are assigned to the left (bottom) sub-region and
the other half are assigned to the right (top) sub-region. For
each of the created sub-regions, a new tree node is created
and the root of the current tree (node corresponding to the
original region) is assigned as the parent of the two newly
created nodes (i.e, nodes lef t and right). Additionally, nodes
corresponding to the two sub-regions are assigned as children
of the root node. The same procedure is repeated for the two
created sub-regions, recursively, until the number of sinks in
each sub-region becomes less than 2. The MMM algorithm
finishes in logn steps and its complexity in terms of run-time
is O(n log n), where n is the number of clock sinks [29]. An
example is illustrated in Fig. 4. As shown, with 10 sink nodes,
the maximum level difference among the leaf nodes is 1.
The MMM algorithm always creates a topology in which
the max level difference among leaves is at most 1. The reason
is it initially creates a fully-balanced binary tree with a height
of dlog ne − 1, without adding any sink nodes to the tree. At
this stage, in each created sub-region, there are either 2 or 1
sinks left. If there is only 1 sink left, that sink is assigned as a
leaf node of the tree, therefore, it does not increase the height
of the tree. If there are 2 sinks left, another bi-partitioning adds
two children to one of the leaf nodes of the tree and increases
the height of the tree to dlog ne. Consequently, leaves of the
tree have a max level difference of 1.
To further reduce the max level difference to 0, we use JTL
cells. If the max sink level is lmax , we find all the sinks with
level lmax − 1 and add a JTL cell as their parent. Hence, their
level becomes lmax and the output clock tree becomes fully
balanced. To balance the insertion delay at the sink nodes and
reduce the max clock skew, we design special JTL cells to
have the same propagation delay as splitters.
Given the location of all the sink nodes, the topology of
the clock tree is generated using the outlined method. Next,
the generated topology is passed to the clock tree embedding
step, which calculates the location of the embedding points of
the internal nodes of the clock tree in the Manhattan plane.
C. Clock Tree Embedding and Splitter Insertion
The generated clock tree topology along with the locations
of the sink nodes are the inputs to the clock tree embedding
step. In this step, the location of clock splitters is determined.
The goal is to construct a zero-skew clock tree, while minimizing the total wirelength of the clock network. We use
the deferred merge embedding (DME) algorithm to embed the
clock tree as it generates a zero-skew solution with minimum
(a)
(b)
(c)
(d)
Fig. 4: The MMM clock topology generation algorithm applied to an
example with 10 sinks [29]. Blue rectangles and black circles show
the splitter cells and sinks of the clock tree.
cost in terms of wirelength, assuming a linear delay model
[23].
The DME algorithm was developed according to the observation that there are multiple locations for an internal node in
a given topology which satisfy the skew specifications [23].
The DME algorithm constructs a clock tree in two phases: (i)
a bottom-up pass that finds all potential zero-skew merging
locations for two nodes, called merging segments (ms), as
a function of the distance between the child nodes and the
downstream delay of each child node. Downstream delay is
defined as the max delay from a node to its leaf nodes. (ii) a
top-down tree traversal in which the DME picks one location
on each merging segment. The DME algorithm has linear
time complexity given the input topology. For a complete
description of this algorithm please refer to [23].
An example of applying DME algorithm to a clock tree
with 4 sinks is redrawn from [23] and shown in Fig. 5. Fig.
5a shows a balanced clock tree topology. The goal is to find the
location of internal nodes of the tree, a, b, and r. Fig. 5b shows
the location of sinks, merging segments, and the embedding
points of the internal nodes of the clock tree in the Manhattan
plane. In the bottom-up pass, the DME algorithm finds the
merging segments for the internal nodes. For instance, the
Manhattan distance between sinks s3 and s4 is 4. Therefore,
to generate a zero-skew solution, the distance between node
b and each of its two children should be 2. Accordingly, a
merging segment (msb ) is formed within a distance 2 from s3
and s4 . Similarly, msa is formed with a distance 3 from s1
and s2 . Note that the downstream delay of any node on msa
and msb is 3 and 2, respectively, assuming a path-length delay
model. Finally, the merging segment for node r is calculated
such that the maximum clock skew becomes 0.
The clock tree embedding step generates the exact location
of internal nodes of the clock tree. Subsequently, we place the
clock splitters at these embedding points and add nets between
each splitter node and its children. In contrast to CMOS that
the embedding points are locations of the branching points of
the clock signal (with zero area), in SFQ technology, splitter
cells (with non-zero area) are placed in these locations. Gen-
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
s2
msa
r
s1
a
msr
a
r
b
msb
s4
s1
s2
s3
(a)
b
s4
s3
(b)
Fig. 5: An example of a zero-skew clock tree produced by the DME
algorithm for a circuit with 4 sinks. (a) The topology of the tree. (b)
The location of sinks, merging segments, and the embedding points
of the internal nodes of the tree. Black rectangles and yellow circles
represent the clock sinks and splitters, respectively.
erated locations for clock splitters may already be occupied
by logic cells or other clock splitters. Hence, after placing
the clock splitters in these locations, the clock tree should be
legalized. In other words, the overlaps among clock splitters
and cells should be removed. Note that due to overlaps among
clock splitters and placed cells, after the legalization step, the
clock skew may not be zero anymore.
To remove the overlaps among the clock splitters and the
placed cells, we employ a two-step approach: (i) map the
clock splitters to routing channels (adjusting y coordinates) (ii)
remove the overlaps among splitters in each routing channel
(modifying x coordinates). Note that the legalization step
changes the insertion delays of the leaf nodes and results in
clock skew. Accordingly, the primary objective is to minimize
the introduced clock skew. In the next step, we present our
algorithms for finding the best routing channel and the best x
coordinate for each clock splitter, such that the clock skew is
minimized.
The output of this step (an illegal solution) along with the
a legal placement of the clock splitters that yields a minimum
skew solution for a 4-bit Kogge-Stone adder circuit [30] are
shown in Figures 6a and 6b, respectively.
D. Min-Skew Clock Tree Placement and Legalization
In this section, we present a min-skew clock tree placement
and legalization algorithm. This algorithm removes the over-
(a)
(b)
Fig. 6: Illustration of a 4-bit Kogge-Stone adder circuit [30], after the
placement and CTS steps. Logic cells, clock splitters, and I/O pads
are shown using blue, red, and black rectangles, respectively. (a) An
illegal zero-skew solution. (b) A legal nonzero-skew solution.
laps among logic cells and clock splitters in two steps. In the
first step, clock splitters are mapped to the routing channel
while their x coordinates are fixed (displacement in vertical
direction). The motivations for moving the clock splitters to the
routing channels are as follows. (i) we will not need to change
the placement of logic cells already placed in the placement
rows. This helps ensure the routability of the circuit is not
affected. (ii) we will not need to increase the width of the
chip to accommodate the placement of clock splitters inside
placement rows. Note that in a circuit with n = 2m clock
sinks, a total number of n − 1 splitters are needed to build a
fully-balanced clock tree. Therefore, by adding splitters to the
placement rows, width of the chip may increase significantly.
The main objective in this step is to minimize the clock
skew, i.e., the difference between the largest and smallest
insertion delay value at the sink nodes. As explained in
Section II, reducing the phase delay values increases the
robustness to on-chip variations. Additionally, considering the
delay model for PTLs, reducing the phase delay values also
reduces the total wirelength of the clock tree and facilitates the
clock routing. Consequently, minimizing the sum of the phase
delays is considered as a secondary objective. The variables
in this problem are the assignments of splitter cells to the
routing channels. For each splitter cell (i), the index of the
assigned routing channel (yi ) is an integer value between 1
and the number of available routing channels (nR ). Parameters
used for formulating the problem are listed in Table I. The
constraints are defined as follows.
•
•
The total number of clock splitters in each routing channel
should be less than the capacity of the routing channel (i.e.,
the sum of the width of the splitters should be less than the
width of a routing channel.)
The mappings of the cells to the routing channels should
not be much different from the solution generated by the
embedding algorithm, as it already provides a good initial
solution.
Subsequently, we present a mathematical formulation of the
min-skew clock tree placement problem in the vertical direc-
TABLE I: Notations and definitions used for formulating the clock
tree placement and legalization problem.
Term
Ci
C0
Li
Di
(xi , yi )
ek,j
δk,j
nC
nS
nR
Wa , Ha
Wch , Hch
Wr , Hr
Hr+ch
my (mx )
Wspl , Hspl
P
dspl
λ
α
Definition
Clock cell i, including sinks
Clock source (root of the clock tree)
Level of the clock cell i
Phase delay at the sink node i
Lower left coordinates of the clock cell i
An edge in the clock tree connecting cells Ck and Cj
Delay of the edge ek,j
The total number of clock cells, excluding the sink nodes
The total number of clock sinks
The total number of placement rows (same for routing channels)
Width and height of the layout area
Width and height (≥ 40µm) of the routing channels
Width and height (120µm) of the placement rows
The sum of the height of a placement row and a routing channel
The max difference between the y (x) coordinates of the clock cells
Width and height of the clock splitter cells (40µm)
The minimum distance between adjacent clock cells
Splitter delay (5.5ps)
Regularization constant (1e−3 )
ps
)
Delay constant (1e−2 µm
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
tion as follows.
minimize
constraints are added to the problem.
max Di − min Dj + λ ·
i=1...nS
j=1...nS
subject to yi ∈ {1 . . . nR }
X
Di =
nS
X
yi = j ⇐⇒ ci,j = 1
δk,j =
nR
X
y
x
δk,j
+ δk,j
ek,j ∈path(C0 ,Ci )
y
δk,j
= α ∗ |yk − yj | ∗ Hr+ch
|z| ⇒ η + η
−
j=1
nC
X
ci,j = 1
∀i ∈ {1 . . . nC }
(13)
ci,j ≤ nch
∀j ∈ {1 . . . nR }
(14)
i=1
(7)
In the above formulation, the phase delay of sink node i
denoted by Di is calculated as the sum of the wire delays on
the path from the clock source (C0 ) to the sink node i. The
delay constant α is used to convert PTL length to delay, as
described in Section II-B. The main objective is to reduce the
max difference between the largest and smallest phase delay
values, i.e., the clock skew. The phase delay value at each sink
node (Di ) multiplied by a constant value (λ) is added to the
objective function as a regularization term. In this formulation,
x coordinate of all the clock splitters is kept constant (i.e.,
x
δk,j
terms are constant values, a function of the locations
calculated in the previous step). Since the absolute values add
non-linearity to the problem, the following transformation is
used to linearize the constraints.
+
(12)
Dk
k=1
X
∀i ∈ {1 . . . nC }
∀j ∈ {1 . . . nR }
(8)
Using the above transformation, the following constraints
should be added to the problem.
Constraint (12) ensures that if cell i is assigned to channel j,
then ci,j is one. If-then constraints can be transformed to linear
constraints in a similar way used to transform the absolute
value constraints [31]. Constraint (13) ensures that each cell is
only assigned to one row. Constraint (14) controls the channel
density by ensuring that the total number of cells assigned to
each channel is less than the channel capacity (calculated by
Equation (11)).
The initial placement generated by the embedding algorithm
is a good starting point for the final mapping of clock splitters
to routing channels. This initial solution (yi0 ) simply maps
each cell to the nearest channel below the original location.
Accordingly, we restrict the final mapping of the cell i to either
the same row as the initial solution or the row above the initial
solution, using the following constraint.
yi0 ≤ yi ≤ yi0 + 1
∀i ∈ {1 . . . nC }
(15)
Eventually, using the proposed transformations, objective
function, constraints, and variables, the problem formulation
is summarized as follows.
minimize
Dmax − Dmin + λ ·
nS
X
Di
i=1
z = η+ − η−
0 ≤ η + ≤ b.m
0 ≤ η − ≤ (1 − b).m
b ∈ {0, 1}
subject to
(9)
∀i ∈ {1 . . . nS }
x
y
δk,j
+ δk,j
ek,j ∈path(C0 ,Ci )
y
δk,j
− +
∗ Hr+ch
+ ηk,j
= α ∗ ηk,j
−
+
− ηk,j
yk − yj = ηk,j
+
0 ≤ ηk,j
≤ bk,j ∗ my
−
0 ≤ ηk,j
≤ (1 − bk,j ) ∗ my
nR
X
ci,j = 1
∀i ∈ {1 . . . nC }
j=1
nC
X
(10)
ci,j ≤ nch
∀j ∈ {1 . . . nR }
i=1
Furthermore, to control the channel density, i.e., to limit the
number of clock splitters mapped to each routing channel,
separate constraints are added to the problem. The capacity
of a channel, defined as the max number of splitters in each
channel, is calculated using the following formula.
nch = b
∀i ∈ {1 . . . nS }
X
Di =
As shown, a new boolean variable b is added to the problem.
Parameter m represents the maximum value of z. Additionally, in order to transform max and min functions to linear
functions, two new parameters Dmax and Dmin are introduced
and the following constraints are added to the problem.
Di ≤ Dmax
Dmin ≤ Di
Di ≤ Dmax
Dmin ≤ Di
Wch
c
(Wspl + P )
(11)
variables
yi = j ⇐⇒ ci,j = 1
∀i ∈ {1 . . . nC }
∀j ∈ {1 . . . nR }
yi0 ≤ yi ≤ yi0 + 1
∀i ∈ {1 . . . nC }
yi ∈ {1 . . . nR }
bk,j ∈ {0, 1}
ci,j ∈ {0, 1}
∀i ∈ {1 . . . nC }
∀k, j → ∃ ek,j
∀i ∈ {1 . . . nC }
∀j ∈ {1 . . . nR }
Di ∈ R +
∀i ∈ {1 . . . nS }
+
δk,j ∈ R
To model the assignment of each cell i to each routing channel
j, a boolean parameter ci,j is defined and the following set of
∀k, j → ∃ ek,j
(16)
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
As observed, there are integer (yi ), binary (bk,j and ci,j ),
and real-number (Di and δk,j ) variables in this formulation.
Therefore, this problem is an instance of mixed integer linear
programming (MILP). Solving this problem yields the optimum assignment of splitter cells to routing channels such that
the clock skew is minimized.
Once this step is completed, a similar problem can be
formulated to minimize the skew while eliminating all the
horizontal overlaps among the cells allotted to the same
routing channel. Using the horizontal ordering imposed by the
embedding algorithm, we can set constraints on the location
of each cell in the routing channel, such that no two adjacent
cells overlap. Accordingly, after the assignment of cells to
routing channels are determined, for each routing channel,
cells are sorted based on their x coordinate and the following
constraints are used to eliminate the horizontal overlaps among
cells.
Wspl + P ≤ xj − xi
∀i, j ∈ {1 . . . nC } | yi = yj , x0i ≤ x0j
(17)
Since the ordering of the cells is determined, we can make use
of the transitive relationships. For example, if there are three
cells x, y, and z, two constraints x ≤ y and y ≤ z imply that
that x ≤ z. Therefore, if there are a total number of n cells
in a row, a total number of n − 1 constraints are required to
ensure the legality of the placement solution in that row. Using
the transformations and variables defined for legalization in
the vertical direction, the min-skew placement and legalization
problem in the horizontal direction is formulated as follows.
minimize
Dmax − Dmin + λ ·
nS
X
Di
i=1
subject to
Di ≤ Dmax
Dmin ≤ Di
∀i ∈ {1 . . . nS }
X
Di =
x
y
δk,j
+ δk,j
ek,j ∈path(C0 ,Ci )
−
x
+
+ ηk,j
δk,j
= α ∗ ηk,j
−
+
− ηk,j
xk − xj = ηk,j
+
0 ≤ ηk,j
≤ bk,j ∗ mx
−
0 ≤ ηk,j
≤ (1 − bk,j ) ∗ mx
variables
xi + Wspl + P ≤ xj
yi = yj , x0i ≤ x0j
xi ∈ [0, Wa − Wspl ]
bk,j ∈ {0, 1}
∀i ∈ {1 . . . nC }
∀k, j → ∃ ek,j
Di ∈ R+
∀i ∈ {1 . . . nS }
δk,j ∈ R+
∀k, j → ∃ ek,j
(18)
Similar to formulation (16), min-skew placement and legalization in the horizontal direction is also an instance of MILP.
Note that assuming the lower left corner of the layout area to
be at (0, 0), the xi coordinates are constrained to be within
the boundaries of the layout. Solving the above problem yields
the final placement of the clock splitters, such that the clock
skew is minimized and a legal placement (with no overlaps)
is produced. Note that the legalization in the vertical direction
should be done before the horizontal direction. The reason
is that the assignment of cells to channels and their initial
ordering in the horizontal direction determine the necessary
constraints for removing overlaps during horizontal legalization. Once the horizontal and vertical legalization problems are
solved, a legal minimum-skew solution similar to Fig. 6b is
produced. In the next section, our simulation framework along
with the results of applying the proposed CTS algorithm to
multiple SFQ benchmarks are presented.
IV. S IMULATION R ESULTS
We used the qPlace package for placing the logic cells
in the layout area [16], [20]. We added the support for our
proposed delay model (cf. Section II-B) to the implementation
of the DME algorithm for embedding the clock trees [23]. We
implemented the clock topology generation and the rest of the
proposed algorithms in C++ and used the IBM CPLEX v12.8
package for solving the MILP problems [32]. The qSTA tool
was used for static timing analysis. We used the clock tree
synthesis approach in [20] as the baseline for comparison.
This approach essentially uses the DME algorithm for clock
tree synthesis. There are two major differences between the
proposed method and the baseline approach.
• The proposed method maps each splitter to the channel
above or below the initial location, such that clock skew
is minimized. However, the baseline approach maps the
clock splitters to the closest routing channel (either above
or below) greedily, minimizing the displacement of each
individual splitter, while ignoring the clock skew or the total
wirelength of the clock tree.
• The proposed approach moves all the cells in the horizontal
direction aiming at minimizing the skew and removing the
overlaps. Conversely, the baseline approach only moves the
cells that have overlap with each other, by shifting the
overlapping cell(s), ignoring the effect of displacement on
the clock skew.
We assume the baseline approach uses a fully-balanced topology to minimize the skew, similar to our proposed approach.
Note that this is an advantage for the baseline solution as
using an imbalanced clock topology while ignoring the splitter
delays increases the clock skew. On the other hand, using a
fully-balanced clock tree topology, which makes sure all the
sink nodes in the clock tree have the same level, effectively
removes the impact of clock splitter delays on the skew. Note
that, in this work, it is assumed that all the splitter cells have
the same delay and the process variations do not change the
delay of splitters.
The characteristics of the benchmark circuits, including
the number of I/O pads, sink nodes, cells, nets, before and
after the clock tree synthesis are listed in Table II. Since
the same topology generation algorithm is used for both the
proposed and baseline approaches, the number of cells after
clock synthesis is equal for both solutions.
The clock skew, total negative hold slack, worst negative
hold slack, and the maximum achievable clock frequency for
each design, after the placement and clock tree synthesis are
reported in Table III. We have also listed the clock skew values
after the clock routing using qGDR routing tool and 4 metal
layers for routing [34]. The clock period for each circuit is
calculated as the smallest value such that there are no setup
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
time violations. For solving each of the MILP problems, a
time limit of 30 minutes is established.
As shown in Table III, the average clock skew value for
17 benchmarks is 4.6ps. The proposed method improves the
average clock skew by 70%, compared with the baseline
approach. Additionally, the total and worst negative hold slack
values are improved by 80% and 60%, respectively. The
average values for the total and worst negative hold slack
for all the benchmarks are −6.1ps and −1.7ps, respectively.
As observed, even the worst case hold time violation among
all the benchmarks (cf. Table III, benchmark c3540) can be
solved by the post-CTS timing closure flow, without the need
for extensive refinement of the circuit and by insertion of a
small number of hold buffer/JTL cells. Finally, it should be
mentioned that if the time limits for solving MILP problems
are increased or movement of clock splitters is not restricted
to the top or bottom channels only, lower skew values may be
achieved.
Post-routing maximum clock skew results are listed in Table
III. The average clock skew increases to 8.6ps after the clock
routing. The main reason behind this increase in the maximum
clock skew is that the routing tool focuses on finishing the
routing of all the nets while reducing the total via count
used for routing [34]. Therefore, it ignores the propagation
delay along the nets, and consequently, the maximum clock
skew values. Aside from the routing tool itself, we point out
that the competition for limited available routing resources
in large benchmarks (e.g., 16-bit Array Multiplier with more
than 14,000 logic cells and 13,000 clock sub-nets) results in
an increase in the length of routed nets, compared with the
ideal Manhattan distance, and hence the clock skew may be
increased. Note that the maximum post-routing clock skew
among all benchmarks is 15.3ps, which is rather small, and
the resulting negative slack values can be easily eliminated by
a timing closure flow, which selectively adds a small number
of gates on data propagation paths. A timing-driven routing
tool will address the increase in maximum clock skew. Aside
from post-routing results reported in Table III, throughout the
paper, all the delay values are reported after the placement and
TABLE II: Benchmark characteristics. KSA stands for Kogg-Stone
adder [30], ArrMult stands for array multiplier, and ID stands for
integer divider. Rest of the benchmarks are chosen from ISCAS85
benchmark suite [35]. Post-CTS columns report the number of cells
and nets in the design after adding clock splitters and clock nets
(using the proposed approach).
Benchmark
KSA4
KSA8
KSA16
KSA32
ArrMult8
ArrMult16
ID4
ID8
c432
c499
c880
c1355
c1908
c2670
c3540
c5315
c6288
#I/O pads
15
27
51
99
33
65
17
33
44
74
87
74
59
206
73
285
65
#Clk Sinks
59
318
414
1049
1404
4798
420
2703
976
566
1133
618
1100
1713
2679
4483
5546
Pre-CTS
#Cells
#Nets
87
124
230
318
592
803
1486
1988
1875
2296
6206
7646
570
694
3192
3697
1186
1432
875
1225
1469
1865
922
1267
1516
1965
2195
2832
3936
5097
5931
7557
7236
8958
Post-CTS
#Cells
#Nets
150
246
485
732
1103
1728
3533
5084
3922
5747
14397
20635
1081
1625
7287
10495
2209
3431
1898
2814
3516
5045
1945
2908
3563
5112
4242
6592
8031
11871
14122
20231
15427
22695
clock tree synthesis steps (and before routing), assuming that
the net lengths are equal to the Manhattan distance between
the corresponding logic gates.
Although the proposed clock embedding, placement, and
legalization algorithms (cf. steps 2-4 Fig. 2a) minimize the
skew given a fully-balanced clock topology, these algorithms
can also be applied to imbalanced topologies with fewer
number of splitters, to produce minimum-skew clock trees
with a fewer number of JJs, compared with a fully-balanced
tree topology. In the next section, we present the necessary
modifications to the proposed methodology in Section III
to minimize the clock skew given imbalanced clock tree
topologies.
V. C LOCK S YNTHESIS FOR I MBALANCED T REE
T OPOLOGIES
In RSFQ circuits, the static power dissipation in the resistive
bias network is about 100× larger than the dynamic power
dissipation of the Josephson junctions [3]. Additionally, the
clock network may require large amounts of current, exceeding
10A for large circuits. Consequently, the current delivery is a
significant problem in large circuits with more than 100K JJs
[33].
One of the possible ways to reduce the required amount of
current delivered to the circuit and the static power consumption is to decrease the number of splitters (i.e., JJ count) in the
clock network. Although creating a fully-balanced clock tree
in which all the sink nodes have the same level reduces the
skew significantly, it may introduce a large overhead in terms
of the total area, biasing current, and static power consumption
of the clock network. To address this issue, imbalanced tree
structures with a fewer number of splitters compared with the
fully-balanced solution may be utilized. As a consequence of
imbalance in the clock tree topology, to minimize the skew, the
clock tree embedding, placement, and legalization algorithms
should be modified to account for the delay of clock splitters.
In the following subsections, first we propose using an
algorithm for imbalanced clock tree topology generation
[27]. Next, we present splitter-delay-aware zero-skew clock
tree embedding and splitter-delay-aware min-skew clock tree
placement and legalization algorithms in detail. The goal is to
minimize both the clock skew and the number of JTLs in the
clock network.
A. Imbalanced Topology Generation
For topology generation, we use the greedy-DME algorithm
[27]. In this approach, topology generation is performed in
a bottom-up fashion (in contrast to our proposed approach
which was done in a top-down manner, cf. Section III-B).
Assume that initially there are n sink nodes. The greedy-DME
algorithm starts with a set of nodes, representing the sinks
of the clock tree. The algorithm iteratively finds the nearest
neighbors u and v in the set of possible nodes, where the
distance between nodes u and v is smaller than the distance
between any other pairs of nodes. A parent node is then
created by merging nodes u and v and these two child nodes
are removed from the set of nodes. The next pair of nearest
neighbors are detected and the same procedure is repeated
until the total number of remaining nodes is 1 (this process
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
TABLE III: Simulation results (clock skew, total negative hold slack (TNS), worst negative hold slack (WNS), and clock frequency) for
several benchmarks using the proposed method and the baseline method [20]. Impr. stands for improvements over the baseline. Freq. denotes
the max clock frequency. Post-pl. and post-rt. stand for post-placement and post-routing, respectively. Values for skew, WNS, and TNS are
in ps.
Proposed
Benchmark
KSA4
KSA8
KSA16
KSA32
ArrMult8
ArrMult16
ID4
ID8
c432
c499
c880
c1355
c1908
c2670
c3540
c5315
c6288
Average
Post-Pl.
2.7
5.3
4.5
4.6
4.0
6.1
4.4
4.4
3.6
4.9
3.9
5.6
4.8
4.1
4.3
5.0
5.7
4.6
Skew
Imrp. (%)
66.3
47.0
59.5
52.1
70.4
63.5
60.0
73.7
75.5
84.5
78.7
78.7
64.4
72.7
69.1
70.9
67.8
70.5
Post-RT.
3.6
8.4
6.7
9.1
8.3
14.2
7.2
9.7
5.9
9.5
7.0
8.6
9.9
6.6
8.0
9.0
15.3
8.6
TNS
0
0
-1.3
-0.6
0
-3.2
0
-0.5
-0.4
-11
-6
-17.3
-3.6
-13.7
-18.1
-13
-14.8
-6.1
Hold
Imrp. (%)
WNS
N/A
0
100
0
38.1
-1.3
76.0
-0.6
100
0
89.9
-1.8
100
0
98.9
-0.5
94.2
-0.4
87.5
-2.7
73.8
-3.8
77.6
-2.1
76.8
-1.6
79.6
-2.9
20.3
-5.2
84.1
-2.9
71.2
-2.6
80.4
-1.7
takes n − 1 steps). A time complexity of O(n log n) can be
obtained for this algorithm, where n denotes the number of
sinks [27].
As observed, the greedy-DME algorithm merges the nearest
neighbors, which may be sink nodes or roots of partial-trees
with different heights. Therefore, as a result of this bottomup approach which tries to minimize the total wirelength
heuristically, the generated topology has a fewer number
of internal nodes (i.e., clock splitters) compared to a fullybalanced topology which always merges the sub-tree roots
with the same height. Consequently, the total number of JTLs
in the clock tree generated by the greedy-DME algorithm is
smaller than the tree produced by the proposed approach in
Section III-B (i.e., MMM [29]).
B. Splitter-Delay-Aware Clock Tree Embedding
In the proposed zero-skew clock tree embedding algorithm
in Section III-C, we did not need to account for the splitter
delays during the embedding of the tree. The reason was that
the generated clock topology was already fully-balanced, so
all the source-sink paths had the same number of splitters.
Consequently, all the phase delays included an identical delay
value associated with the sum of the delay of the clock splitters
in each source-sink path, which did not affect the clock skew.
In the DME algorithm, the location of a merging segment
is a function of the location and downstream delay of its child
merging segments [23]. Accordingly, assuming an imbalanced
tree topology, splitter delays play an important role in determining the location of merging segments and embedding
points of a clock tree. However, the DME algorithm does
not consider the delay of splitter cells in the clock tree. To
address this issue, we modify the formulation of the DME
algorithm as follows. Similar to the original algorithm, clock
tree embedding is done in two phases. In the bottom-up pass,
after merging two child nodes, we add the delay of the splitter
cell to the total downstream delay of the parent node. Conse-
Imrp. (%)
N/A
100
-8.3
45.5
100
62.5
100
88.6
83.3
77.1
0.0
85.8
15.8
59.7
-188.9
58.0
43.5
60.5
Freq.
37.6
26.7
25.2
14.2
17.2
9.4
19.5
6.5
4.8
6.5
10
6.9
4.3
3.6
4.4
2.6
4.5
-
Skew
8.0
10.0
11.1
9.6
13.5
16.7
11.0
16.7
14.7
31.6
18.3
26.3
13.5
15.0
13.9
17.2
17.7
15.6
Baseline
Hold
TNS
WNS
0.0
0.0
-1.4
-1.4
-2.1
-1.2
-2.5
-1.1
-10.0
-2.8
-31.6
-4.8
-3.3
-2.2
-45.9
-4.4
-6.9
-2.4
-88.3
-11.8
-22.9
-3.8
-77.1
-14.8
-15.5
-1.9
-67.3
-7.2
-22.7
-1.8
-81.9
-6.9
-51.3
-4.6
-31.2
-4.3
Freq.
34.6
21.9
26.5
13.4
16.0
9.1
18.7
6.4
4.8
6.5
9.8
7.0
4.3
3.7
4.5
2.6
4.5
-
quently, when merging two sub-trees, the algorithm accounts
for both wire delay of each sub-tree and the delay associated
with the splitters inserted in each sub-tree. Accordingly, the
location of the merging segments is different than the one
produced by the original DME algorithm. The top-down phase
of the DME algorithm remains the same. Finally, a zero-skew
embedding is generated that accounts for both splitter and
interconnect delays.
Fig. 7 depicts an imbalanced tree topology and the placement of internal nodes, using the original and the modified
DME algorithms for an example with 3 sinks. The delay
associated with different edges are also shown. Assume the
delay of splitter cells to be 2 units (of delay) and a pathlength delay model. As depicted in Fig. 7b, generated using
the original DME algorithm that ignores the splitter delays,
the phase delays of sinks s1 -s3 are 10, 10, and 8, respectively.
Hence, the clock skew is 2. Conversely, Fig. 7c depicts the
location of merging segment msr considering the splitter delay
values. Once the msa is formed, the downstream delay of
msa which is originally set to be 3, is modified and the delay
of the splitter cell corresponding to node a is added to this
delay. Hence, the downstream delay of node a becomes 5.
Accordingly, the merging segment msr is formed further away
from node s3 and closer to node a, to create a zero-skew
merging segment. Consequently, the phase delays of sinks s1 s3 are all equal to 9 and the clock skew becomes 0.
The splitter-delay-aware clock tree embedding algorithm is
applied to the imbalanced topology to calculate the location of
clock splitters. In the next subsection, we describe the modifications made to the placement and legalization algorithm to
properly handle imbalanced tree topologies.
C. Splitter-Delay-Aware Clock Tree Placement and Legalization
The clock tree placement and legalization algorithm should
account for splitter delays, while minimizing the clock skew
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
s2
msa
3
3
r
s1
a
3
s1
3
s2 s3
(a)
a
msr
a
3
msr
2
r
r
s1
s2
msa
6
s3
(b)
7
s3
(c)
Fig. 7: An example of applying DME algorithm to a circuit with 3
sink nodes. (a) An imbalanced tree topology. The location of sinks,
merging segments, and the embedding points of the internal nodes of
the tree using (b) original DME algorithm and (c) splitter-delay-aware
DME algorithm are shown.
and removing the overlaps. To do so, we modify the formulas
calculating the insertion delay at each sink node (i.e., formulations (16) and (18)) as follows.
X
y
x
Di = Li ∗ dspl +
δk,j
+ δk,j
(19)
ek,j ∈path(C0 ,Ci )
where Li denotes the level of sink i and dspl denotes the
delay of splitter cells. As observed, the first term in the above
formula is a constant value, a function of the tree topology and
not one of the variables of the MILP formulation for clock tree
legalization.
Once the above modifications are done to the flow of Fig.
2a, the proposed clock tree synthesis algorithm can be used
for both balanced and imbalanced tree topologies, possibly
generating min-skew solutions with a fewer number of clock
splitters compared with the approach proposed in Section
III. The splitter-delay-aware CTS flow tries to increase the
length of the wires along the source-sink paths that have fewer
splitters, to minimize the difference between max and min
insertion delays and hence, to minimize the maximum clock
skew. In the next subsection, the results of applying the CTS
algorithm for imbalanced topologies to multiple benchmark
circuits are presented and compared to the baseline solution.
D. Simulation Results for Imbalanced Clock Tree Topologies
We used the greedy-DME algorithm for imbalance clock
topology generation [23] and added the aforementioned modifications to the qCTS flow. Table IV lists the clock skew,
total negative hold slack (TNS), worst negative hold slack
(WNS) (in ps), the number of clock splitters, the maximum
clock frequency, and imbalance degree (the maximum level
difference among sinks) for several benchmarks obtained by
applying the proposed algorithm and compares them with the
baseline solution [20]. As shown, the average number of clock
splitters and average clock skew value are reduced by 37%
and 56%, respectively, compared with the baseline solution
[20] described in Section IV. The average clock skew value
over all 17 benchmarks is 6.8ps. Additionally, the average
total negative hold slack and average worst negative hold slack
values are improved by 32% and 51%, respectively. Table IV
also lists the imbalance degree of the clock trees for different
benchmarks. As shown, some of the tree topologies have an
imbalance degree as large as 8, i.e., there exists a source-sink
path that has 8 splitters fewer than a path with the maximum
number of splitters. Such large differences can potentially
cause a large skew (i.e., 8 × dspl = 44ps), if a CTS algorithm
ignores the splitter delays. However, the proposed embedding
and legalization algorithms modify the location of splitters and
the delay of interconnects along all the source-sink paths in a
way that the effect of splitter delays on the maximum clock
skew is balanced out by the wire delays. Consequently, the
maximum clock skew among all the benchmarks is limited to
10.1ps, approximately equal to the delay of only 2 splitters.
As shown in Tables III and IV, using the same clock
tree embedding and legalization algorithms, imbalanced tree
topologies create a trade-off between the number of clock
splitters (also the total static power consumption and the total
biasing current delivered to the network) and the total negative
hold slack values, compared with balanced topologies (cf.
Section III). This suggests that the timing closure flow may
need more hold buffers to fix all the hold time violations for
imbalanced topologies, compared with balanced trees.
As a future direction, we plan to quantify the effect of
process variations on the timing yield of the circuits and design
a clock tree synthesis algorithm and a timing closure flow that
try to minimize timing violations and improve the timing yield,
in the presence of process variations.
VI. C ONCLUSION
This paper presents a minimum-skew clock tree synthesis
methodology for single flux quantum logic circuits, called
qCTS. The qCTS algorithm first builds a fully-balanced tree
topology considering the placement of the sequential elements
in a circuit, such that there are equal number of splitters from
the clock source to any clock sink. The fully-balanced tree
topology removes the effect of splitter delays on the clock
skew. The location of the clock splitters is then calculated
using a zero-skew embedding algorithm. Finally, using a novel
mixed integer linear programming based method, overlaps
among clock splitters and logic cells are removed while the
clock skew is minimized. The qCTS method improves the
state-of-the-art by accounting for splitter delays and placement
blockages and reduces the clock skew by 70% on average, over
17 benchmarks. Subsequently, this methodology is extended
to minimize the clock skew given an imbalanced topology in
which there are different number of splitters from the root of
the tree to every sink node. The modified algorithm generates
solutions in which the average number of splitters and the
average clock skew are reduced by 56% and 37%, respectively,
compared with a fully-balanced clock tree synthesis and a
greedy legalization algorithm.
ACKNOWLEDGMENT
The authors would like to thank Naveen Katam, Ghasem
Pasandi, Ting-Ru Lin, and Bo Zhang from University of
Southern California for providing tools and benchmarks used
in this paper.
R EFERENCES
[1] J. Koomey, “Worldwide electricity used in data centers,” Environ. Res.
Lett., vol. 3, no. 3, pp. 034 008–1–034 008–8, Jul 2008.
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
TABLE IV: Results of the clock skew, total negative hold slack (TNS), worst negative hold slack (WNS) (in ps), the number of clock
splitters, maximum clock frequency, imbalance degree (maximum level difference among sinks), and comparisons with the baseline solution
[20]. Impr., Freq., and Imb. Deg. stand for improvement, maximum clock frequency, and the imbalance degree of the clock tree, respectively.
Benchmark
KSA4
KSA8
KSA16
KSA32
ArrMult8
ArrMult16
ID4
ID8
c432
c499
c880
c1355
c1908
c2670
c3540
c5315
c6288
Average (%)
Skew
4.9
5.2
5.5
6.0
6.0
8.6
5.0
6.0
5.8
10.1
6.8
5.8
6.3
9.2
7.5
8.9
8.1
6.8
Impr. (%)
38.8
48.0
50.5
37.5
55.6
48.5
54.5
64.1
60.5
68.0
62.8
77.9
53.3
38.7
46.0
48.3
54.2
56.4
Hold TNS
-1.4
-3.1
0
-2.6
-7.6
-42.5
0
-14.8
-2.7
-31.2
-18.2
-17.4
-12.4
-34.9
-35.9
-84.4
-48.5
-21.0
Impr. (%)
N/A
-121.4
100
-4.0
24.0
-34.5
100
67.8
60.9
64.7
20.5
77.4
20.0
48.1
-58.1
-3.1
5.5
32.7
Hold WNS
-1.4
-1.7
0
-1.3
-1.8
-4.3
0
-2.6
-0.7
-5.5
-2.9
4.3
-2.4
-3.5
-5.3
-4.6
-2.8
-2.1
[2] T. V. Duzer and C. W. Turner, Principle of Superconducting Circuits.
New York: Elsevier, 1981.
[3] D. S. Holmes, A. L. Ripple, and M. A. Manheimer, “Energy-Efficient
Superconducting Computing Power Budgets and Requirements,” IEEE
Transaction on Applied Superconductivity, vol. 23, no. 3, 1701610, Jun
2013.
[4] K. K. Likharev and V. K. Semenov, “RSFQ logic/memory family: A new
Josephson-junction technology for sub-terahertz-clock-frequency digital
systems,” IEEE Transaction on Applied Superconductivity, vol. 1, no. 1,
pp. 3–28, Mar 1991.
[5] W. Chen, A. V. Rylyakov, V. Patel, J. E. Lukens, and K. K. Likharev,
“Rapid single flux quantum T-flip flop operating up to 770 GHz,” IEEE
Transaction on Applied Superconductivity, vol. 9, no. 2, pp. 3212–3215,
Jun 1999.
[6] A. Mukhanov, “Energy-efficient single flux quantum technology,” IEEE
Transaction on Applied Superconductivity, vol. 21, no. 3, pp. 760–769,
Jun 2011.
[7] S. Polonsky, “Delay insensitive RSFQ circuits with zero static power
dissipation,” IEEE Transaction on Applied Superconductivity, vol. 9, pp.
3535–3538, Jun 1999.
[8] A. H. Silver and Q. P. Herr, “A new concept for ultra-low power and
ultra-high clock rate circuits,” IEEE Transaction on Applied Superconductivity, vol. 11, pp. 333–336, Jun 2001.
[9] O. T. Oberg, Q. P. Herr, A. G. Ioannidis, and A. Y. Herr, “Integrated
power divider for superconducting digital circuits,” IEEE Transaction
on Applied Superconductivity, vol. 21, pp. 571–574, Jun 2011.
[10] Y. Yamanashi, T. Nishigai, and N. Yoshikawa, “Study of LR-loading
technique for low-power single flux quantum circuits,” IEEE Transaction
on Applied Superconductivity, vol. 17, no. 2, pp. 150–153, Jun 2007.
[11] L. R. Eaton and M. W. Johnson, “Superconducting constant current
source,” 2009. [Online]. Available: U.S. Patent 7 002 366 B2
[12] D. E. Kirichenko, A. F. Kirichenko, and S. Sarwana, “No static power
dissipation biasing of RSFQ circuits,” IEEE Transaction on Applied
Superconductivity, vol. 21, pp. 776–779, Jun 2011.
[13] M. Tanaka, M. Ito, A. Kitayama, T. Kouketsu, and A. Fujimaki, “18GHz, 4.0-aJ/bit operation of ultra-low-energy rapid single-flux-quantum
shift registers,” Japan Journal of Applied Physics, vol. 51, no. 5, pp.
053 102–1–053 102–4, May 2012.
[14] M. Pedram and Y. Wang, “Design automation methodology and tools
for superconductive electronics,” in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov 2018, pp. 1–6.
[15] C. J. Fourie, K. Jackman, M. M. Botha, S. Razmkhah, P. Febvre,
C. L. Ayala, Q. Xu, N. Yoshikawa, E. Patrick, M. Law, Y. Wang,
M. Annavaram, P. Beerel, S. Gupta, S. Nazarian, and M. Pedram,
“Coldflux superconducting eda and tcad tools project: Overview and
progress,” IEEE Transactions on Applied Superconductivity, vol. 29,
no. 5, pp. 1–7, Aug 2019.
[16] S. N. Shahsavani, A. Shafaei, and M. Pedram, “A placement algorithm
for superconducting logic circuits based on cell grouping and supercell placement,” in 2018 Design, Automation Test in Europe Conference
Exhibition (DATE), 2018, pp. 1465–1468.
Impr. (%)
N/A
-21.4
100
-18.2
35.7
10.4
100
40.9
70.8
53.4
23.7
70.9
-26.3
51.4
-194.4
33.3
39.1
51.2
#Clk Spl.
58
158
413
1048
1403
4797
419
2702
975
565
1132
617
1099
1712
2678
4482
5545
1753.1
Impr. (%)
7.9
38.0
19.2
48.8
31.5
41.4
18.0
34.0
4.7
44.8
44.7
39.7
46.3
16.4
34.6
45.3
32.3
37.1
Freq.
38.8
22.7
26.3
14.1
15.3
9.5
19.2
6.5
4.9
6.4
9.9
6.6
4.3
3.6
4.4
2.6
4.5
Imb. Deg.
2
4
3
8
5
8
3
6
6
5
6
4
8
5
6
6
6
[17] R. N. Tadros and P. A. Beerel, “A robust and tree-free hybrid clocking
technique for rsfq circuits - csr application,” in 2017 16th International
Superconductive Electronics Conference (ISEC), June 2017, pp. 1–4.
[18] N. Kito, K. Takagi, and N. Takagi, “A fast wire-routing method and
an automatic layout tool for rsfq digital circuits considering wire-length
matching,” IEEE Transactions on Applied Superconductivity, vol. 28,
no. 4, pp. 1–5, June 2018.
[19] K. Han, A. B. Kahng, and J. Li, “Optimal generalized h-tree topology
and buffering for high-performance and low-power clock distribution,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, pp. 1–1, 2018.
[20] S. N. Shahsavani, T. R. Lin, A. Shafaei, C. J. Fourie, and M. Pedram,
“An integrated row-based cell placement and interconnect synthesis tool
for large sfq logic circuits,” IEEE Transactions on Applied Superconductivity, vol. 27, no. 4, pp. 1–8, June 2017.
[21] A. B. Kahng, J. Lienig, I. L. Markov, and J. Hu, VLSI Physical Design:
From Graph Partitioning to Timing Closure, 1st ed. Springer Publishing
Company, Incorporated, 2011.
[22] L.-T. Wang, Y.-W. Chang, and K.-T. T. Cheng, Eds., Electronic Design
Automation: Synthesis, Verification, and Test. San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc., 2009.
[23] J. Cong, A. B. Kahng, C.-K. Koh, and C.-W. A. Tsao, “Bounded-skew
clock and steiner routing,” ACM Trans. Des. Autom. Electron. Syst.,
vol. 3, no. 3, pp. 341–388, Jul. 1998.
[24] K. Gaj, E. G. Friedman, and M. J. Feldman, “Timing of multi-gigahertz
rapid single flux quantum digital circuits,” J. VLSI Signal Process. Syst.,
vol. 16, no. 2-3, pp. 247–276, Jul. 1997.
[25] K. Takagi, Y. Ito, S. Takeshima, M. Tanaka, and N. Takagi, “LayoutDriven Skewed Clock Tree Synthesis for Superconducting SFQ Circuits.” IEICE Transactions, vol. 94-C, no. 3, pp. 288–295, 2011.
[26] S. Y. Yoshio Kameda and Y. Hashimoto, “A New Design Methodology for Single-Flux-Quantum (SFQ) Logic Circuits Using PassiveTransmission-Line (PTL) Wiring,” IEEE Transaction on Applied Superconductivity, vol. 17, no. 2, pp. 508–511, Jun 2007.
[27] M. Edahiro, “A clustering-based optimization algorithm in zero-skew
routings,” in 30th ACM/IEEE Design Automation Conference, June 1993,
pp. 612–616.
[28] J. Cong, A. B. Kahng, and G. Robins, “Matching-based methods for
high-performance clock routing,” IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. 12, no. 8, pp. 1157–1169,
Aug 1993.
[29] M. A. Jackson, A. Srinivasan, and E. S. Kuh, “Clock routing for highperformance ics,” in Design Automation Conference, 1990. Proceedings.,
27th ACM/IEEE. IEEE, 1990, pp. 573–579.
[30] N. H. E. Weste and D. M. Harris, CMOS VLSI Design: A Circuits and
Systems Perspective, 4th ed. San fransisco, CA: Addison-Wesley, 2005.
[31] I. Griva, S. Nash, and A. Sofer, Linear and Nonlinear Optimization:
Second Edition, ser. Other Titles in Applied Mathematics. Society for
Industrial and Applied Mathematics, 2009.
[32] “IBM
ILOG
CPLEX.”
[Online].
Available:
www.ilog.com/products/cplex/
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASC.2019.2943930, IEEE
Transactions on Applied Superconductivity
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. X, NO. Y, Z 2019
[33] S. K. Tolpygo, “Superconductor digital electronics: Scalability
and energy efficiency issues (review article),” Low Temperature
Physics, vol. 42, no. 5, pp. 361–379, 2016. [Online]. Available:
https://doi.org/10.1063/1.4948618
[34] T. R. Lin, T. Edwards, and M. Pedram “qGDR: A Via-MinimizationOriented Routing Tool for Large-Scale Superconductive Single-FluxQuantum Circuits,” IEEE Transactions on Applied Superconductivity,
vol. 29, no. 7, pp. 1–12, 2019.
[35] M. C. Hansen, H. Yalcin, and J. P. Hayes, “Unveiling the
ISCAS-85 benchmarks: A case study in reverse engineering,”
IEEE Des. Test, vol. 16, no. 3, Jul. 1999.
1051-8223 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Download