Uploaded by Dimitrios Garyfallou

[TECHCON 2010]Timing Analysis and Optimization for Many-Tier 3D ICs

advertisement
Timing Analysis and Optimization
for Many-Tier 3D ICs
Young-Joon Lee and Sung Kyu Lim
Electrical and Computer Engineering, Georgia Institute of Technology
email: yjlee@gatech.edu
Abstract— Tremendous amount of investments to keep semiconductor technology scaling possible has led industry and
academia to search for alternatives to continue Moore’s law. As
an alternative, 3-dimensional integrated circuits (3D ICs) have
major advantages of smaller footprint, faster operation and/or
low power consumption, and higher level of integration. These
advantages can be further exploited with many-tier integration.
Yet, to take advantage of many-tier systems we need to consider
design aspects of many-tier 3D ICs that are not addressed by
existing design tools.
In this study, we focus on the timing analysis and optimization
of a many-tier 3D IC. We perform 3D timing analysis using
existing commercial 2D EDA tools as well as our in-house
tools. We consider parasitics for the through-silicon-vias (TSVs)
depending on the actual geometries of TSVs.
Then, we present two kinds of 3D timing optimizations that are
based on different timing constraints: timing scaling and timing
budgeting. Combined with the 3D timing analysis, the timing
constraints successfully guide the optimization. We also discuss
what are missing in the optimization flow and how they can be
addressed.
I. I NTRODUCTION
As the process technology shrinks down to what is believed
to be the physical limit, more and more people consider
3-dimensional integrated circuits (3D ICs) as a viable way
to continue Moore’s law. The major advantages of 3D ICs
compared to conventional 2D ICs include smaller footprint,
faster operation and/or low power consumption due to reduced
wire parasitics from shorter wirelength, and higher level
of integration including heterogeneous systems. Since more
benefits may be possible with more layers, we expect to see
many-tier integration in the future. Yet, to take advantage of
many-tier systems we need to consider design aspects that are
unique to many-tier 3D ICs and not addressed by existing
design tools.
For 3D ICs to be beneficial, several grand challenges should
be overcome [1]. In this study, we focus on the timing analysis
and optimization of many-tier 3D ICs which are the crucial
design steps for achieving higher operating frequency. A 3D
IC design with many sets of a core die and a few memory
dies is chosen as the target system, because it is one of the
possible many-tier 3D IC applications in near future.
Then, we present two kinds of timing-constraint-driven 3D
timing optimization methods that are performed after circuit
placement. Currently, there is no industrial-grade EDA tool
that can handle multiple dies and perform timing optimization.
Thus, we use existing 2D tools to perform timing optimization.
Fig. 1. An illustration of the target many-tier 3D IC where 1,000 cores and
2,000 memory tiles are integrated. The bottom left shows a core layout, and
the bottom right shows a memory tile.
Timing constraints are set up to guide the timing optimization
engine. Combined with the 3D timing analysis, the timing
constraints successfully guide the optimization.
The remainder of the paper is organized as follows. We
present the target system in Section II. In Section III, we
discuss how to insert buffers for TSVs. Then, in Section IV we
present two kinds of 3D timing optimizations that are based on
different timing constraints. Experimental results are presented
in Section V, followed by discussions in Section VI. Finally,
we conclude in Section VII.
II. TARGET S YSTEM
A. Overall Architecture
Our target system has 1,000 processor cores and 2,000
memory tiles that perform massively parallel execution. An
illustration of our many-tier system is shown in Fig. 1. 10
core layers and 20 memory layers are stacked together. Cores
are connected in a 3D mesh structure (10x10x10) for inter-core
communication. To reduce the design effort, a core design is
replicated and used for all the cores.
Figure 2 shows a core, its neighbor cores, and its memory
tiles. Except for the global synchronization signals, a core
communicates with only its nearest neighbors in the 3D mesh.
A single core has two memory tiles stacked on each other.
Neighbor
Core
U
W
S
N
gate
Memory
Tile 2
E
Liner
M6 LP
TSV
TSV
2um
M O S
CTSV /2
C TSV /2
M1 LP
Memory
D
RTSV =1
Bulk Si
0.1um
Tile 1
CTSV =25fF
Core
(b) Side view of a core and its memory tiles.
(a) A core and its neighbor cores.
Fig. 2. A core, its neighbor cores, and its memory tiles. In (a), N/E/W/S/U/D
stands for North/East/West/South/Up/Down. In (b), LP stands for landing pad.
M1 and M6 are Metal Layer 1 and 6, respectively.
Fig. 4. A top-down view of a TSV and its circuit model. Figure is not drawn
to scale.
Die 4 (core)
sink
Die 3 (mem.)
Instr.
P
+1
PC
i
p
e
e
l
l
i
i
n
n
R
Mem
P
p
e
Instr.
Dec.
i
Reg.
File
P
+
P
i
i
p
p
e
e
l
ALU
i
n
l
ALU
e
e
R
R
R
e
e
Addr.
e
g
g
Gen.
g
.
.
1
2
N
N
3
4
2
3
E
E
4
5
W
W
S
S
U
U
D
D
.
e
Mem.
g
.
Neighbor Cores
Fig. 3.
The core architecture. Some datapaths were omitted for simplicity.
buffer
Die 1 (core)
source
i
n
e
Data
Die 2 (mem.)
(a)
(b)
(c)
(d)
Fig. 5. Side view of core and memory dies. The net to be buffered connects
from a source gate on Die 1 to a sink gate on Die 4. We assume that both
source and sink gates are the same as buffer gate.
ports.
The core was implemented in VHDL/Verilog and synthesized using Synopsys Design Compiler. The netlist of the
core has about seven thousand gates, eight thousand nets, and
several memory macro blocks.
C. TSV Model
These memory tiles serve as data storages. The benefits are
twofold: 1) all cores can access their memory simultaneously,
maximizing ttotal memory bandwidth, 2) the memory blocks
are very close to the core, enabling single cycle access. The
connections between a core and its memory tiles as well as a
core to a neighbor core in z-direction are made by TSVs.
With this architecture, the entire system can perform computation in 1D (each core executing independent program), 2D
(cores communicate to N/E/W/S), or 3D (cores communicate
to N/E/W/S/U/D) fashion. If we were to implement the same
3D mesh architecture in 2D ICs, the die size would be very
large which degrades yield, and the core-to-core connections
would be prohibitively complex.
B. Core Architecture
To demonstrate high memory bandwidth, we designed
a two-way VLIW core architecture which can issue one
arithmetic or logic instruction and one memory instruction
simultaneously. Figure 3 shows the core architecture. In the
third stage, communication to neighbor core is performed.
The receiving side is latched so that the core-to-core delay
does not hurt overall performance. In the fourth stage, the
core communicates to the memory tiles.
The N/E/W/S communication ports are placed on the
N/E/W/S periphery of the core. To avoid the core-to-core
communication paths being timing-critical, we insert pipeline
flip-flops (F/Fs) on the receiving side of the communication
A TSV is typically formed by first etching a hole on bulk
silicon, insulating side wall with oxide liner, and lastly filling
up the space with metal. This structure is a metal-oxide-silicon
capacitor as shown in Fig. 4. A detailed TSV MOS capacitor
modeling is discussed in [2]. Although a TSV may have
capacitance from different sources such as coupling between
neighboring TSVs and coupling between M1 or M6 landing
pads (LPs) and surrounding wires, the major source of TSV
capacitance is the MOS capacitor effect.
The capacitance and the resistance of a TSV are determined
by various factors such as dimension and material property of
the TSV, thickness and material property of the liner, body
bias voltage and so on. Based on our structural assumptions,
we determined RT SV and CT SV as shown in Figure 4.
III. B UFFERING TSV S
Since a TSV has a large capacitance, we may need to
insert buffers (or repeaters) to achieve the desired timing goal.
As shown in Figure 2(b), there are two major kinds of 3D
connections in our system: 1) core-to-memory connections, 2)
core-to-core connections. Based on a preliminary 3D timing
analysis, we determined that core-to-core connections were
more timing critical. Thus, we focused on buffering core-tocore TSVs.
For a given net connecting a source gate on a core die
to a sink gate on the neighbor core in z-direction, we may
need to insert buffers on the net to achieve optimal timing.
TABLE I
D ELAY FROM THE SOURCE GATE TO THE SINK GATE WITH DIFFERENT
TSV CAPACITANCE AND BUFFERING OPTIONS . U NIT IS ns.
CT SV
No buffer
Buffer on Die 2
Buffer on Die 3
Buffer on Die 2/3
5f F 25f F 125f F
0.172 0.344
1.137
0.240 0.359
1.002
0.240 0.359
1.007
0.313 0.419
1.100
Assuming that on the core dies buffers can be inserted near the
TSV location, we need to determine whether inserting more
buffers on the intermediate dies (i.e. memory dies) is helpful.
As shown in Fig. 5, we tried all possible buffering options: (a)
no buffer in the intermediate dies, (b) a buffer on Die 2, (c) a
buffer on Die 3, (d) buffers on Die 2 and 3. Since the optimal
buffering solution may vary with respect to TSV capacitance,
we ran experiments with three different TSV capacitances.
Table I shows the timing results with different TSV capacitance and buffering options. We used Synopsys PrimeTime
with TSV parasitic model discussed in Section II-C. When
TSV capacitance is 5f F and 25f F , inserting buffers actually
increased the delay. In contrast, when TSV capacitance is
125f F , inserting a buffer on Die 2 was optimal. Since in
later experiments we assume that CT SV is 25f F , we do not
insert buffers on memory dies.
IV. 3D T IMING A NALYSIS AND O PTIMIZATION
A. 3D Timing Analysis
The timing paths of the entire design can be categorized as
follows:
• Intra-core paths: This path starts from a F/F in a core,
goes through combinational logic gates, and ends at a
F/F in the same core.
• Core-to-core paths in x/y-plane: This path is the communication path from a core to its N/E/W/S neighbor core.
• Core-to-core paths in z-direction: This path is the communication path from a core to its U/D neighbor core. A
path of this kind involves three TSVs.
• Core-to-memory paths: This path goes from a core to its
memory tiles and vice versa. Memory clock and control
signals, write and read data buses fall into this category.
• Global synchronization paths: This path starts from a
F/F in a core, goes through several pipeline F/Fs, and
ends at the global synchronization F/F, and vice versa.
Since these paths connect all the cores to a centralized
synchronization logic, wires tend to be very long and
fan-out is very high. Thus, we solve timing issues in
architectural way – inserting pipeline F/Fs. The penalty of
this solution is the increased latency of synchronization.
With the synthesized netlist, we perform physical layout
generation using Cadence Encounter. Since Encounter currently is for 2D ICs and can handle only single die at a time,
we design die by die.
Our 3D static timing analysis (STA) is performed as follows.
First, we prepare the Verilog netlist files of all dies and the
SPEF files containing extracted parasitic values for all the nets
of the dies. Then, we create a top-level Verilog netlist that
instantiates the design of each die and connects them by the 3D
nets. We also create an SPEF file that has the TSV parasitics
for TSV nets. Finally, we run PrimeTime with all the Verilog
files and the SPEF files to get the 3D timing analysis results.
B. 3D Timing Optimization
Since the target system is very large, we perform the physical design in a hierarchical manner. Compared to a flattened
(non-hierarchical) design flow, in a hierarchical design flow
the timing constraints on the boundary is important because
it is the key information that timing optimization engine uses
for each design hierarchy.
After preliminary timing analysis (before any timing optimization), we determined that the critical paths are intra-core
paths and core-to-core paths in z-direction. Since the intracore paths can be optimized without any boundary timing
constraints, we focus on the boundary of core-to-core paths
in z-direction.
Since we know the intra-core paths and core-to-core paths
in z-direction needs timing optimizations, we work on the
timing optimization of the core with timing constraints. We
demonstrate two methods to generate the timing constraints
on the boundary of core-to-core paths in z-direction: timing
scaling and timing budgeting.
1) Timing Scaling: Timing scaling method is to scale the
input/output delay timing constraints at each boundary point.
Consider a 3D path from a source F/F in a core through a
boundary point to a sink F/F in a neighbor core in z-direction.
After the 3D timing analysis is done, we get the longest path
delay from the source to the sink (= TLP D ) as well as the
delay up to the boundary (= Tboundary ). To achieve the target
clock period TCLK , ideally we need to make TLP D the same
as TCLK . Thus, we set the scaling factor SF = TCLK /TLP D .
Then we calculate the scaled boundary constraints as follows:
Tboundary,scaled = Tboundary × SF
The updated timing constraint file is used in timing optimization. By this method, all the 3D paths are constrained so as
to meet the target clock period. We implemented this method
in PrimeTime Tcl and Perl.
2) Timing Budgeting: Timing budgeting [3] is to distribute
the timing slack of a path to each net on the path. This method
analyzes the timing graph of the entire circuit to find out where
the critical paths are. Nets on non-critical paths can be given
a positive timing budget which can be used for other circuit
optimizations such as area and power minimization. On the
other hand, nets on critical paths are given negative timing
budgets, which means the delays of the nets should be reduced
by timing optimization. We use Synopsys Design Compiler to
perform timing budgeting.
The overall design flow is shown in Fig. 6. With the
generated timing constraints, timing optimization is performed
by Encounter. We iterate the optimization loop several times.
the entire design. Thus budgeting did not generate tight timing
constraints on 3D nets. Meanwhile, total negative slack (TNS)
is about 14% better with scaling.
Perform initial placement and circuit extraction
Make top level netlist and TSV model
Initial 3D STA
iteration
iteration
VI. D ISCUSSIONS
Calculate scaling factor
Run timing budgeting
Generate timing constraints
Generate timing constraints
Timing optimization per die
Timing optimization per die
Circuit extraction
Circuit extraction
3D STA
3D STA
(a) With timing scaling
(b) With timing budgeting
Fig. 6.
4.6E+5
Design flow with timing scaling and timing budgeting.
m
Total wirelength (
1,800
)
# added buffers
1,500
4.5E+5
1,200
4.4E+5
900
4.3E+5
600
4.2E+5
300
Scalin g
Bu dgetin g
4.1E+5
Iteration 0
1
2
3
ns
1.0
WNS (
4
5
Scalin g
Bu dgetin g
0
Iteration 0
1
2
3
ns
0
)
TNS (
4
5
)
-100
0.0
-200
-1.0
-300
-2.0
-400
Scalin g
Bu dgetin g
-3.0
Scalin g,3D n ets
Bu dgetin g,3D n ets
-4.0
Iteration 0
Fig. 7.
1
2
3
4
5
Scalin g
-500
Bu dgetin g
-600
Iteration 0
1
2
3
4
5
Experimental results with timing scaling and budgeting.
V. E XPERIMENTAL R ESULTS
We perform timing optimization with the two timing constraint generation methods described in Section IV-B. We use
130nm technology node with 6 metal layers. RT SV and CT SV
are 1Ω and 25f F , respectively. Target clock period is 2.4ns.
We perform 5 optimization iterations.
The experimental results are shown in Fig. 7. In both scaling
and budgeting cases, the total wirelength keep increasing
during the first 5 iterations. Total wirelength of scaling is
longer than that of budgeting, although the gap decreases
with iterations. Budgeting used about 8.1% more buffers than
scaling.
The worst negative slack (WNS) of the entire design comes
from a timing critical path that goes through muxes of forwarding logic and ALU for address calculation, and arrives
at the instruction memory. After 5 iterations, WNS of the
entire design is almost the same for both cases, with the
difference of less than 20ps. From the first iteration, WNS
of 3D nets with scaling is positive, meaning the timing goal
has been met. In contrast, with budgeting WNS of 3D nets is
negative all the time. That is because in budgeting case, WNS
of 3D nets was smaller than WNS of the entire design and
the budgeting algorithm focused on minimizing the WNS of
From this study, we found the following noteworthy points.
• Even though timing budgeting is much complex than
timing scaling, it did not lead to better timing results.
That is partly because in this study the 3D nets were
not the timing-critical nets. However if the 3D nets are
indeed on critical paths, the results with budgeting would
be different.
• It is possible to perform timing optimization with existing 2D CAD tools by providing timing constraints on
the boundaries of design hierarchy, however we cannot
exploit the full benefit of 3D structure. The 2D tool
does not see the whole 3D picture, thus various powerful optimization techniques such as net transformation
and restructuring cannot be performed. This shortcoming
could get worse with many-tier systems, thus true 3D
CAD tool should be developed.
• The placement of gates and TSVs should be 3D-aware.
In this study we performed the gate placement using an
existing 2D tool. Moreover, the locations of TSVs were
determined manually by generating several layouts and
evaluating timing analysis. We observed that moving the
gates on the 3D paths to better locations led to better
timing results. Also we found that better TSV location
could lead to shorter wirelength for 3D nets.
VII. C ONCLUSIONS
In this paper, we presented the timing analysis and optimization for many-tier 3D ICs. Two methods for timing
constraint generation were discussed and experimental results
were presented. Although it is possible to perform timing
optimization in 2D CAD tools with timing constraints, it is
not optimal way to exploit benefits of many-tier 3D ICs. Our
future works include further timing optimizations with buffer
insertion for timing critical 3D nets and 3D net topology
generation.
ACKNOWLEDGMENT
This material is based upon work supported by the National
Science Foundation under CAREER Grant No. CCF-0546382
and the Interconnect Focus Center (IFC).
R EFERENCES
[1] S. K. Lim, “TSV-Aware 3D Physical Design Tool Needs for Faster
Mainstream Acceptance of 3D ICs,” ACM DAC Knowledge Center, Feb.
2010. [Online]. Available: http://www.dac.com
[2] G. Katti, M. Stucchi, K. D. Meyer, and W. Dehaene, “Electrical Modeling
and Characterization of Through Silicon via for Three-Dimensional ICs,”
IEEE Transactions on Electron Devices, vol. 57, no. 1, pp. 256–262, Jan.
2010.
[3] R. Nair, C. L. Berman, P. S. Hauge, and E. J. Yoffa, “Generation of
performance constraints for layout,” IEEE Transactions on ComputerAided Design, vol. 8, pp. 860–874, Aug. 1989.
Download