Timing Analysis and Optimization for Many-Tier 3D ICs Young-Joon Lee and Sung Kyu Lim Electrical and Computer Engineering, Georgia Institute of Technology email: yjlee@gatech.edu Abstract— Tremendous amount of investments to keep semiconductor technology scaling possible has led industry and academia to search for alternatives to continue Moore’s law. As an alternative, 3-dimensional integrated circuits (3D ICs) have major advantages of smaller footprint, faster operation and/or low power consumption, and higher level of integration. These advantages can be further exploited with many-tier integration. Yet, to take advantage of many-tier systems we need to consider design aspects of many-tier 3D ICs that are not addressed by existing design tools. In this study, we focus on the timing analysis and optimization of a many-tier 3D IC. We perform 3D timing analysis using existing commercial 2D EDA tools as well as our in-house tools. We consider parasitics for the through-silicon-vias (TSVs) depending on the actual geometries of TSVs. Then, we present two kinds of 3D timing optimizations that are based on different timing constraints: timing scaling and timing budgeting. Combined with the 3D timing analysis, the timing constraints successfully guide the optimization. We also discuss what are missing in the optimization flow and how they can be addressed. I. I NTRODUCTION As the process technology shrinks down to what is believed to be the physical limit, more and more people consider 3-dimensional integrated circuits (3D ICs) as a viable way to continue Moore’s law. The major advantages of 3D ICs compared to conventional 2D ICs include smaller footprint, faster operation and/or low power consumption due to reduced wire parasitics from shorter wirelength, and higher level of integration including heterogeneous systems. Since more benefits may be possible with more layers, we expect to see many-tier integration in the future. Yet, to take advantage of many-tier systems we need to consider design aspects that are unique to many-tier 3D ICs and not addressed by existing design tools. For 3D ICs to be beneficial, several grand challenges should be overcome [1]. In this study, we focus on the timing analysis and optimization of many-tier 3D ICs which are the crucial design steps for achieving higher operating frequency. A 3D IC design with many sets of a core die and a few memory dies is chosen as the target system, because it is one of the possible many-tier 3D IC applications in near future. Then, we present two kinds of timing-constraint-driven 3D timing optimization methods that are performed after circuit placement. Currently, there is no industrial-grade EDA tool that can handle multiple dies and perform timing optimization. Thus, we use existing 2D tools to perform timing optimization. Fig. 1. An illustration of the target many-tier 3D IC where 1,000 cores and 2,000 memory tiles are integrated. The bottom left shows a core layout, and the bottom right shows a memory tile. Timing constraints are set up to guide the timing optimization engine. Combined with the 3D timing analysis, the timing constraints successfully guide the optimization. The remainder of the paper is organized as follows. We present the target system in Section II. In Section III, we discuss how to insert buffers for TSVs. Then, in Section IV we present two kinds of 3D timing optimizations that are based on different timing constraints. Experimental results are presented in Section V, followed by discussions in Section VI. Finally, we conclude in Section VII. II. TARGET S YSTEM A. Overall Architecture Our target system has 1,000 processor cores and 2,000 memory tiles that perform massively parallel execution. An illustration of our many-tier system is shown in Fig. 1. 10 core layers and 20 memory layers are stacked together. Cores are connected in a 3D mesh structure (10x10x10) for inter-core communication. To reduce the design effort, a core design is replicated and used for all the cores. Figure 2 shows a core, its neighbor cores, and its memory tiles. Except for the global synchronization signals, a core communicates with only its nearest neighbors in the 3D mesh. A single core has two memory tiles stacked on each other. Neighbor Core U W S N gate Memory Tile 2 E Liner M6 LP TSV TSV 2um M O S CTSV /2 C TSV /2 M1 LP Memory D RTSV =1 Bulk Si 0.1um Tile 1 CTSV =25fF Core (b) Side view of a core and its memory tiles. (a) A core and its neighbor cores. Fig. 2. A core, its neighbor cores, and its memory tiles. In (a), N/E/W/S/U/D stands for North/East/West/South/Up/Down. In (b), LP stands for landing pad. M1 and M6 are Metal Layer 1 and 6, respectively. Fig. 4. A top-down view of a TSV and its circuit model. Figure is not drawn to scale. Die 4 (core) sink Die 3 (mem.) Instr. P +1 PC i p e e l l i i n n R Mem P p e Instr. Dec. i Reg. File P + P i i p p e e l ALU i n l ALU e e R R R e e Addr. e g g Gen. g . . 1 2 N N 3 4 2 3 E E 4 5 W W S S U U D D . e Mem. g . Neighbor Cores Fig. 3. The core architecture. Some datapaths were omitted for simplicity. buffer Die 1 (core) source i n e Data Die 2 (mem.) (a) (b) (c) (d) Fig. 5. Side view of core and memory dies. The net to be buffered connects from a source gate on Die 1 to a sink gate on Die 4. We assume that both source and sink gates are the same as buffer gate. ports. The core was implemented in VHDL/Verilog and synthesized using Synopsys Design Compiler. The netlist of the core has about seven thousand gates, eight thousand nets, and several memory macro blocks. C. TSV Model These memory tiles serve as data storages. The benefits are twofold: 1) all cores can access their memory simultaneously, maximizing ttotal memory bandwidth, 2) the memory blocks are very close to the core, enabling single cycle access. The connections between a core and its memory tiles as well as a core to a neighbor core in z-direction are made by TSVs. With this architecture, the entire system can perform computation in 1D (each core executing independent program), 2D (cores communicate to N/E/W/S), or 3D (cores communicate to N/E/W/S/U/D) fashion. If we were to implement the same 3D mesh architecture in 2D ICs, the die size would be very large which degrades yield, and the core-to-core connections would be prohibitively complex. B. Core Architecture To demonstrate high memory bandwidth, we designed a two-way VLIW core architecture which can issue one arithmetic or logic instruction and one memory instruction simultaneously. Figure 3 shows the core architecture. In the third stage, communication to neighbor core is performed. The receiving side is latched so that the core-to-core delay does not hurt overall performance. In the fourth stage, the core communicates to the memory tiles. The N/E/W/S communication ports are placed on the N/E/W/S periphery of the core. To avoid the core-to-core communication paths being timing-critical, we insert pipeline flip-flops (F/Fs) on the receiving side of the communication A TSV is typically formed by first etching a hole on bulk silicon, insulating side wall with oxide liner, and lastly filling up the space with metal. This structure is a metal-oxide-silicon capacitor as shown in Fig. 4. A detailed TSV MOS capacitor modeling is discussed in [2]. Although a TSV may have capacitance from different sources such as coupling between neighboring TSVs and coupling between M1 or M6 landing pads (LPs) and surrounding wires, the major source of TSV capacitance is the MOS capacitor effect. The capacitance and the resistance of a TSV are determined by various factors such as dimension and material property of the TSV, thickness and material property of the liner, body bias voltage and so on. Based on our structural assumptions, we determined RT SV and CT SV as shown in Figure 4. III. B UFFERING TSV S Since a TSV has a large capacitance, we may need to insert buffers (or repeaters) to achieve the desired timing goal. As shown in Figure 2(b), there are two major kinds of 3D connections in our system: 1) core-to-memory connections, 2) core-to-core connections. Based on a preliminary 3D timing analysis, we determined that core-to-core connections were more timing critical. Thus, we focused on buffering core-tocore TSVs. For a given net connecting a source gate on a core die to a sink gate on the neighbor core in z-direction, we may need to insert buffers on the net to achieve optimal timing. TABLE I D ELAY FROM THE SOURCE GATE TO THE SINK GATE WITH DIFFERENT TSV CAPACITANCE AND BUFFERING OPTIONS . U NIT IS ns. CT SV No buffer Buffer on Die 2 Buffer on Die 3 Buffer on Die 2/3 5f F 25f F 125f F 0.172 0.344 1.137 0.240 0.359 1.002 0.240 0.359 1.007 0.313 0.419 1.100 Assuming that on the core dies buffers can be inserted near the TSV location, we need to determine whether inserting more buffers on the intermediate dies (i.e. memory dies) is helpful. As shown in Fig. 5, we tried all possible buffering options: (a) no buffer in the intermediate dies, (b) a buffer on Die 2, (c) a buffer on Die 3, (d) buffers on Die 2 and 3. Since the optimal buffering solution may vary with respect to TSV capacitance, we ran experiments with three different TSV capacitances. Table I shows the timing results with different TSV capacitance and buffering options. We used Synopsys PrimeTime with TSV parasitic model discussed in Section II-C. When TSV capacitance is 5f F and 25f F , inserting buffers actually increased the delay. In contrast, when TSV capacitance is 125f F , inserting a buffer on Die 2 was optimal. Since in later experiments we assume that CT SV is 25f F , we do not insert buffers on memory dies. IV. 3D T IMING A NALYSIS AND O PTIMIZATION A. 3D Timing Analysis The timing paths of the entire design can be categorized as follows: • Intra-core paths: This path starts from a F/F in a core, goes through combinational logic gates, and ends at a F/F in the same core. • Core-to-core paths in x/y-plane: This path is the communication path from a core to its N/E/W/S neighbor core. • Core-to-core paths in z-direction: This path is the communication path from a core to its U/D neighbor core. A path of this kind involves three TSVs. • Core-to-memory paths: This path goes from a core to its memory tiles and vice versa. Memory clock and control signals, write and read data buses fall into this category. • Global synchronization paths: This path starts from a F/F in a core, goes through several pipeline F/Fs, and ends at the global synchronization F/F, and vice versa. Since these paths connect all the cores to a centralized synchronization logic, wires tend to be very long and fan-out is very high. Thus, we solve timing issues in architectural way – inserting pipeline F/Fs. The penalty of this solution is the increased latency of synchronization. With the synthesized netlist, we perform physical layout generation using Cadence Encounter. Since Encounter currently is for 2D ICs and can handle only single die at a time, we design die by die. Our 3D static timing analysis (STA) is performed as follows. First, we prepare the Verilog netlist files of all dies and the SPEF files containing extracted parasitic values for all the nets of the dies. Then, we create a top-level Verilog netlist that instantiates the design of each die and connects them by the 3D nets. We also create an SPEF file that has the TSV parasitics for TSV nets. Finally, we run PrimeTime with all the Verilog files and the SPEF files to get the 3D timing analysis results. B. 3D Timing Optimization Since the target system is very large, we perform the physical design in a hierarchical manner. Compared to a flattened (non-hierarchical) design flow, in a hierarchical design flow the timing constraints on the boundary is important because it is the key information that timing optimization engine uses for each design hierarchy. After preliminary timing analysis (before any timing optimization), we determined that the critical paths are intra-core paths and core-to-core paths in z-direction. Since the intracore paths can be optimized without any boundary timing constraints, we focus on the boundary of core-to-core paths in z-direction. Since we know the intra-core paths and core-to-core paths in z-direction needs timing optimizations, we work on the timing optimization of the core with timing constraints. We demonstrate two methods to generate the timing constraints on the boundary of core-to-core paths in z-direction: timing scaling and timing budgeting. 1) Timing Scaling: Timing scaling method is to scale the input/output delay timing constraints at each boundary point. Consider a 3D path from a source F/F in a core through a boundary point to a sink F/F in a neighbor core in z-direction. After the 3D timing analysis is done, we get the longest path delay from the source to the sink (= TLP D ) as well as the delay up to the boundary (= Tboundary ). To achieve the target clock period TCLK , ideally we need to make TLP D the same as TCLK . Thus, we set the scaling factor SF = TCLK /TLP D . Then we calculate the scaled boundary constraints as follows: Tboundary,scaled = Tboundary × SF The updated timing constraint file is used in timing optimization. By this method, all the 3D paths are constrained so as to meet the target clock period. We implemented this method in PrimeTime Tcl and Perl. 2) Timing Budgeting: Timing budgeting [3] is to distribute the timing slack of a path to each net on the path. This method analyzes the timing graph of the entire circuit to find out where the critical paths are. Nets on non-critical paths can be given a positive timing budget which can be used for other circuit optimizations such as area and power minimization. On the other hand, nets on critical paths are given negative timing budgets, which means the delays of the nets should be reduced by timing optimization. We use Synopsys Design Compiler to perform timing budgeting. The overall design flow is shown in Fig. 6. With the generated timing constraints, timing optimization is performed by Encounter. We iterate the optimization loop several times. the entire design. Thus budgeting did not generate tight timing constraints on 3D nets. Meanwhile, total negative slack (TNS) is about 14% better with scaling. Perform initial placement and circuit extraction Make top level netlist and TSV model Initial 3D STA iteration iteration VI. D ISCUSSIONS Calculate scaling factor Run timing budgeting Generate timing constraints Generate timing constraints Timing optimization per die Timing optimization per die Circuit extraction Circuit extraction 3D STA 3D STA (a) With timing scaling (b) With timing budgeting Fig. 6. 4.6E+5 Design flow with timing scaling and timing budgeting. m Total wirelength ( 1,800 ) # added buffers 1,500 4.5E+5 1,200 4.4E+5 900 4.3E+5 600 4.2E+5 300 Scalin g Bu dgetin g 4.1E+5 Iteration 0 1 2 3 ns 1.0 WNS ( 4 5 Scalin g Bu dgetin g 0 Iteration 0 1 2 3 ns 0 ) TNS ( 4 5 ) -100 0.0 -200 -1.0 -300 -2.0 -400 Scalin g Bu dgetin g -3.0 Scalin g,3D n ets Bu dgetin g,3D n ets -4.0 Iteration 0 Fig. 7. 1 2 3 4 5 Scalin g -500 Bu dgetin g -600 Iteration 0 1 2 3 4 5 Experimental results with timing scaling and budgeting. V. E XPERIMENTAL R ESULTS We perform timing optimization with the two timing constraint generation methods described in Section IV-B. We use 130nm technology node with 6 metal layers. RT SV and CT SV are 1Ω and 25f F , respectively. Target clock period is 2.4ns. We perform 5 optimization iterations. The experimental results are shown in Fig. 7. In both scaling and budgeting cases, the total wirelength keep increasing during the first 5 iterations. Total wirelength of scaling is longer than that of budgeting, although the gap decreases with iterations. Budgeting used about 8.1% more buffers than scaling. The worst negative slack (WNS) of the entire design comes from a timing critical path that goes through muxes of forwarding logic and ALU for address calculation, and arrives at the instruction memory. After 5 iterations, WNS of the entire design is almost the same for both cases, with the difference of less than 20ps. From the first iteration, WNS of 3D nets with scaling is positive, meaning the timing goal has been met. In contrast, with budgeting WNS of 3D nets is negative all the time. That is because in budgeting case, WNS of 3D nets was smaller than WNS of the entire design and the budgeting algorithm focused on minimizing the WNS of From this study, we found the following noteworthy points. • Even though timing budgeting is much complex than timing scaling, it did not lead to better timing results. That is partly because in this study the 3D nets were not the timing-critical nets. However if the 3D nets are indeed on critical paths, the results with budgeting would be different. • It is possible to perform timing optimization with existing 2D CAD tools by providing timing constraints on the boundaries of design hierarchy, however we cannot exploit the full benefit of 3D structure. The 2D tool does not see the whole 3D picture, thus various powerful optimization techniques such as net transformation and restructuring cannot be performed. This shortcoming could get worse with many-tier systems, thus true 3D CAD tool should be developed. • The placement of gates and TSVs should be 3D-aware. In this study we performed the gate placement using an existing 2D tool. Moreover, the locations of TSVs were determined manually by generating several layouts and evaluating timing analysis. We observed that moving the gates on the 3D paths to better locations led to better timing results. Also we found that better TSV location could lead to shorter wirelength for 3D nets. VII. C ONCLUSIONS In this paper, we presented the timing analysis and optimization for many-tier 3D ICs. Two methods for timing constraint generation were discussed and experimental results were presented. Although it is possible to perform timing optimization in 2D CAD tools with timing constraints, it is not optimal way to exploit benefits of many-tier 3D ICs. Our future works include further timing optimizations with buffer insertion for timing critical 3D nets and 3D net topology generation. ACKNOWLEDGMENT This material is based upon work supported by the National Science Foundation under CAREER Grant No. CCF-0546382 and the Interconnect Focus Center (IFC). R EFERENCES [1] S. K. Lim, “TSV-Aware 3D Physical Design Tool Needs for Faster Mainstream Acceptance of 3D ICs,” ACM DAC Knowledge Center, Feb. 2010. [Online]. Available: http://www.dac.com [2] G. Katti, M. Stucchi, K. D. Meyer, and W. Dehaene, “Electrical Modeling and Characterization of Through Silicon via for Three-Dimensional ICs,” IEEE Transactions on Electron Devices, vol. 57, no. 1, pp. 256–262, Jan. 2010. [3] R. Nair, C. L. Berman, P. S. Hauge, and E. J. Yoffa, “Generation of performance constraints for layout,” IEEE Transactions on ComputerAided Design, vol. 8, pp. 860–874, Aug. 1989.