Uploaded by jayce.kr1310

b3 buvat timineri paper

advertisement
A53 core optimization to achieve performances
Calogero TIMINERI
Julien BUVAT
Julien GUILLEMAIN
Sebastien PEURICHARD
STMicroelectronics
12 rue Jules Horowitz,
B.P. 217
38019 Cedex Grenoble,
France
www.st.com
ABSTRACT
In order to achieve the best PPA on ARM A53 cores, meaning performance, power, and area,
GPU/CPU team at STMicroelectronics has developed a dedicated physical implementation Flow
based on IC Compiler (ICC).
The purpose of this article is to share various techniques used by designers to meet performance
target all along the ICC flow. We will describe those techniques and their impact on theQuality
of Results (QoR) such as timing, leakage, routability and DFM score.
Table of Contents
1.
Introduction ........................................................................................................................... 4
2.
CPU cores implementation challenges ................................................................................. 5
3.
Implementation Flow Presentation ....................................................................................... 6
4.
Timing calibration throughout implementation flow ............................................................ 8
4.1. Purpose of calibration ........................................................................................................ 8
4.1.1. Project Schedule ......................................................................................................... 9
4.1.2. PPA monitored with a dashboard ............................................................................... 9
4.1.3. Two types of miscorrelation can be addressed by calibration ................................. 11
4.2. DCG vs ICC: setting alignment ....................................................................................... 11
4.3. Pre-cts vs post-cts correlation .......................................................................................... 11
4.3.1. A budget for each clock and every PVT (Process, Voltage, Temperature) .............. 12
4.3.2. Clocktree exceptions ................................................................................................. 13
4.3.3. Empirical calibration ................................................................................................ 13
4.4. Pre-route vs post-route on data path ................................................................................ 14
4.5. ICC vs primetime correlation .......................................................................................... 15
4.5.1. Crosstalk effect.......................................................................................................... 15
4.5.2. Delay calculation in ICC and Primetime using Graph Based Analysis and Path Base
Analysis .................................................................................................................................. 16
5.
DRC and physical convergence .......................................................................................... 18
5.1. Dealing with design related congestion ........................................................................... 18
5.2. Resolving DRC problems under power stripe with PNET feature .................................. 19
5.2.1. Power grid structure and routing resources ............................................................. 19
5.2.2. Synopsys pnet feature ................................................................................................ 21
5.3. Results ............................................................................................................................. 21
5.4. DRC convergence with signoff ....................................................................................... 22
6.
Conclusions ......................................................................................................................... 24
7.
References ........................................................................................................................... 24
Table of Figures
Figure 1: CPU core floorplan .......................................................................................................... 5
Figure 2: Top-down implementation methodology [2] .................................................................. 7
Figure 3: Vt usage kpi ................................................................................................................... 10
Figure 4: Slack distribution histogram. Moving margins has a big impact on violation count and
TNS ............................................................................................................................................... 12
Figure 5: C sensitive path (yellow), R sensitive path (orange) ..................................................... 14
Figure 9: Congestion prone hierarchy (red) and DRC (white) ..................................................... 18
Figure 10: Congestion map without partial density blockage on congestion prone hierarchy ..... 19
Figure 11: Congestion map with partial density blockage on congestion prone hierarchy .......... 19
SNUG 2015
2
A53 core optimization to achieve performances
Figure 6: Layer structure from IA/IB tapping down to M1 with via stack ................................... 20
Figure 7: Power grid structure ...................................................................................................... 20
Figure 8: Routed signals DRC under M3 power stripes ............................................................... 21
Figure 12: Via enclosure height is 0.095um in ICC ..................................................................... 22
Figure 13: Built in ICV flow [1] ................................................................................................... 23
Table of Tables
Table 1: Example of area variations highlighting the need for calibration..................................... 8
Table 2: Example of timing difference between DCG and ICC prectsopt. Path delay
contributions. .................................................................................................................................. 9
Table 3: Timing results throughout the implementation steps...................................................... 10
SNUG 2015
3
A53 core optimization to achieve performances
1.Introduction
ARM based CPU subsystems have become a major industry standard. They fit in a wide variety
of consumer products whose requirements obey to the consumer market: the right feature with a
short time to market. One of our main challenges is to design various differentiated cores within
an aggressive schedule. CPU/GPU team has developed adaptive methodologies to allow a fast
convergence to signoff.
CPU cores have specifications and expected performances challenging the technology as well as
the entire design flow from RTL to gds. All design stages are stressed to pull the best
performances, including physical implementation. Not only CPU cores have to run at high
frequency, they also have to operate in a wide range of conditions. This implies functionality
must be guaranteed with multiple voltage supplies, and various functional modes. Furthermore,
area is shrunk, density pushed up to get the most cost efficient product. Static and dynamic
power are implementation constraints too as they are key parameters when assessing CPU
performance.
Advanced technology nodes, like STMicroelectronics 28nm FDSOI technology, are becoming
more constraining in terms of design rules. Some rules are mandatory to obey (often known as
Design Rule Check), others are good practices to help manufacturing and allow yield
improvment, sometimes known as lithography rules. These rules need to be checked during the
implementation in order to minimise signoff loops and to avoid manual editing of the physical
database.
SNUG 2015
4
A53 core optimization to achieve performances
2.CPU cores implementation challenges
Project’s schedules lead to find methodologies to shorten the design phase. One way to do so is
to improve turnaround time, by decreasing signoff loops as much possible for instance.
Minimising the QoR gap between each step helps the implementation tool to converge faster and
reduces runtime. Typically, we shall look at reducing timing deviation, as well as area or leakage
divergence. Power, performance and area are part of the CPU specification. It is essential to
know as early as possible if these targets can be achieved in order to give commitment on
feasibility.
With frequency targets over the gigahertz, margins are slim. A 50ps miscorrelation impacts CPU
performances by 5%. Small timing miscorrelation may severely impact usage of low leakage
cells. CPU constraints have to be set accurately to achieve expected performance within static
power budget. Density and area have to be tightly controlled considering usual buffering of the
clocktree and hold fixing. Increasing the area may improve routability, but on another hand may
increase dynamic power too. Timing, area and density budgets have to be set accurately as they
are a trade-off between all the key indicators of a CPU.
Main features of the CPU cores we implement :
 <1.1 mm²
 >1.4GHz
 >600kInstances
 60k register
 29 memory cuts
 200 scenarios
 Aggressive dynamic and static power targets
 Large voltage range
 28nm FDSOI
 10ML
Figure 1: CPU core floorplan
SNUG 2015
5
A53 core optimization to achieve performances
3.Implementation Flow Presentation
CPU cores are generally not stand-alone IPs. Their implementations have to be set in a more
global context of a CPU subsystem.
Figure 2 shows the top down methodology for subsystem design, implementation and signoff.
Cores inherit all their inputs from the budgeting and partitioning: timing constraints, upf, and
floorplan. Cores are going through full physical implementation and signoff loop until timing is
met, and physical checks successfully pass. Top and blocks are then put together for a flat
signoff and further ECO on the top partition.
SNUG 2015
6
A53 core optimization to achieve performances
RTL Assembly
UPF
T
O
P
D
O
W
N
A
P
P
R
O
A
C
H
Flat Synthesis
Netlist / UPF
Handoff
Design Planning
Partitioning
Budgeting
Top Hier
Implementation
Blocks Implementation
PT-ECO
Top Hier Signoff
CPU Implementation
Blocks Signoff
PT-ECO
Top Flat Signoff
Figure 2: Top-down implementation methodology [2]
SNUG 2015
7
A53 core optimization to achieve performances
4.Timing calibration throughout implementation flow
4.1. Purpose of calibration
In early phase of implementation some margin is kept to account for model inaccuracy. It
generally applies to the entire design as extra conservatism. It is relaxed in later steps as the
maturity of the design grows. Setting too much conservatism should result in over quality or not
being able to converge, with tools striving to meet all constraints. On the other hand, too little
conservatism makes the job easier for the tool with the risk of target performance not being met.
Calibration is all about estimating as precisely as possible these margins.
In order to guarantee a smooth transition between implementation steps, all parameters causing
divergence must be identified whether they are tool settings, constraints, placement, clocktree or
routing options.
Key parameters are kept under control. QoR from the end of one step must compare to the start
of the next one. Any deviation can reveal a miscorrelation or ‘miss-calibration’ and needs to be
understood. For instance an area increase may be the cause of density and further congestion
issues. Main indicators under close monitoring are runtime, performance (timing), power and
area referenced as PPA.
Step
Combi area
step increase
prePlace
prectsOpt
prectsOptIncr
clockTreeSynthesis
postctsOpt
route
postrouteOpt
0.00%
14.25%
0.00%
0.77%
5.37%
2.25%
0.99%
Table 1: Example of area variations highlighting the need for calibration
SNUG 2015
8
A53 core optimization to achieve performances
Weight on timing
path
Incremental
synthesis
slack
clock: (ns)
prectsopt
0.084
0.0202
(0.0532), 8%
(0.0132), 2%
CP to Q
9%
10%
setup time
4%
11%
67%
74%
88%
97%
Data
Table 2: Example of timing difference between DCG and ICC prectsopt. Path delay contributions.
4.1.1. Project Schedule
Calibration guarantees that each step does not sensibly degrade the QoR. Therefore most of the
design weaknesses are spotted as early as possible. Considering full CPU core physical
implementation is around five days, a better predictability allows us to shorten the design phase
and to be able to commit on Performances, Power, and Area (PPA) very early. A major benefit
from calibration is productivity improvement.
4.1.2. PPA monitored with a dashboard
QoR at all steps from synthesis to post signoff are tracked in a dashboard. It allows, monitoring
key parameters through each implementation step, to have a timing summary for each scenario,
and to quickly compare two implementations considering all major indicators. Examples here
below show Vt cell’s usage giving indication of leakage (Figure 3), and timing summary along
implementation steps (Table 3). Other examples illustrate monitoring of the area (Table 1) and
the path delay contributors (Table 2)
SNUG 2015
9
A53 core optimization to achieve performances
VT1
VT2
VT3
VT4
Speed
Leakage
Impact
++
+
--
-+
++
Figure 3: Vt usage kpi
Step
TNS
WNS
prePlace
prectsOpt
prectsOptIncr
clockTreeSynthesis
postctsOpt
route
postrouteOpt
0
-0.656
-0.656
-179.115
-4.036
-272.789
-0.462
0
-0.02
-0.02
-0.141
-0.033
-0.166
-0.015
Table 3: Timing results throughout the implementation steps
SNUG 2015
10
A53 core optimization to achieve performances
4.1.3. Two types of miscorrelation can be addressed by calibration
Since timing models or timing engine are different between DC-SPG, ICC and Primetime there
is a need to correlate whenever the database is handed over from one tool to another.
Within physical implementation tool, timing models are updated and become more and more
precise as the design goes through all the steps. We shall see how calibration has been done
between pre and post cts, pre and post route.
4.2. DCG vs ICC: setting alignment
In CPU/GPU team’s methodology, synthesis is physical aware using DCG. Database is passed to
ICC in ASCII format with DEF and Verilog. DDC format contains design and optimisation
directives which may not all be under control. Hence we have abandoned the use of DDC. In
order to be in full control of the settings, we have chosen to explicitly align all variables in both
tools. We particularly cared about settings impacting net delays: scaling factors and via
resistance.
Others necessary means for a good correlation are the use of the same technokit, libraries, timing
constraints, design attributes, amongst them dont_use, ideal_nets, and tools variables.
4.3. Pre-cts vs post-cts correlation
Timing margin on a path depends on the data delay as well as the clock delay. In pre-cts steps
clock latencies and their variations must be modelled as accurately as possible as they impact all
paths to sequential elements. Clock is budgeted by means of source or internal latencies and
uncertainties. They are global constraints applied to the entire clock fanout. Therefore a small
variation on clock timing model may affect lots of timing paths in the design and it may change
the global timing results (Figure 4). Increasing uncertainty is equivalent to shifting the curve to
the left. Violation count and TNS (red area) grow significantly with a few ps shift.
SNUG 2015
11
A53 core optimization to achieve performances
Figure 4: Slack distribution histogram. Moving margins has a big impact on violation count and TNS
4.3.1. A budget for each clock and every PVT (Process, Voltage,
Temperature)
Pre-cts clock timing models account for period, latency with variations induced by PLL jitter,
OCV derating and skew. Each of these compounds can be optimised individually for each clock.
For instance OCV derating depends on most divergent branches depth. It requires a good
knowledge of what the clocktree structure and its latency will be when the cts is done.
Obviously, latencies depend on PVT corners. Hence clock budget must follow the same rules.
All scenarios used in pre-cts have to be budgeted independently. A small miscorrelation in a
single scenario may cause timing, area or leakage divergence. Ideally, not a unique mode should
be excessively more constraining than the other.
SNUG 2015
12
A53 core optimization to achieve performances
4.3.2. Clocktree exceptions
Most sequential cells of the design are aligned deep in the clocktree. Some elements such as
clock gates may need a deskew, because they are structuraly on the path from port to the leaves.
Their budgeted pre-cts latency needs to be set according to their depth. Parameters to consider
when assessing the depth are the structural location within the clocktree and the fanout. A clock
gate inserted by Power Compiler is generally deep in the clocktree. It drives a few registers,
whereas a functional clock gate with a fanout of nearly the entire design is likely to be early.
They need to have different and specific deskew budgeted. With our methodology, pre-cts
budget for deskewed elements is fully automated and takes into account structural location of the
cell within the clocktree.
4.3.3. Empirical calibration
Calibration is a trade-off between a tight global constraint that would allow a clean timing at
signoff at the cost of some over quality, and a rather slack constraint which may leave a few
violations but does not degrade other parameters such as density or leakage. With too much
conservatism induced by the clock budget, the placement may not converge. Pre-cts margins
need to be relaxed. Similarly, if pre-cts clock models are too loose, placement may give good
results, but timing violations may appear in post-cts. In that case, clock uncertainties are adjusted
based on pre-cts as well as post-cts optimisation’s results. Clock budget is empirically finetuned.
Clock calibration implies clocktree synthesis but also placement and post-cts steps. A few
implementation loops may be necessary to assess the global results. The global benefit is a better
convergence.
SNUG 2015
13
A53 core optimization to achieve performances
4.4. Pre-route vs post-route on data path
Timing paths are made of cell delays and net delays. Depending on the topology, the net length
and the process, some timing path may be more sensitive to net resistivity or to net capacitance.
In our 28nm FDSOI 10ML technology upper layers are thicker and wider than lower ones,
resulting in differences in RC parasitics. Figure 9 shows metal layer sizes. Figure 5 illustrates R
and C sensitive paths.
Figure 5: C sensitive path (yellow), R sensitive path (orange)
Layer assignment and routing may have a big impact on the timing. There is a paramount need to
control the layer assignment all along the implementation. Assumptions made by the tool in preroute stages have to be persistent and consistent until the design is fully routed.
Synopsys design flow using DCG and ICC is optimized to guarantee the best correlation on
WNS timing and area (after place_opt in ICC), and congestion. With the same global route
engine, routability can be anticipated into DCG. Therefore parasitic and nets delays are
accurately predicted from the physical synthesis. It has a positive impact on design convergence.
In synthesis and pre-cts, tools make assumptions on layer assignment for each net. They may use
fat metal layers for long timing critical data paths. However, clock nets are privileged over data
paths with less resistive metal layers, because they need to be balanced with as small latency and
skew as possible. With dedicated tracks for clocks not being anticipated in pre-route steps,
available resources for data paths would be much less than what has been assumed. ICC would
SNUG 2015
14
A53 core optimization to achieve performances
have to revert back to lower and more resistive layers for data nets. A QoR degradation may be
observed after routing.
To avoid this fall trap, we have deliberately chosen to forbid fat metal layers usage in all steps
before cts. After clock routing, the remaining resources on upmost metal layers can be used
again.
As a result, delay calculation on long data paths is a bit conservative in placement. This makes
the placer to work harder. We can afford to leave a few small violations expecting they will
easily be reclaimed in post-route.
In conjunction with relevant implementation corners, CPU/GPU team could achieve a good
timing convergence.
We could further improve layer predictability in pre-cts if we could assess routing resources
taken by clocktrees. We would allow partial use of B1 and B2 from the synthesis. The shared
groute engine guarantees DCG and ICC make exactly the same layer assignment. Therefore the
timing would be accurately predicted.
4.5. ICC vs primetime correlation
4.5.1. Crosstalk effect
Crosstalk is the undesirable electrical interaction between two or more physically adjacent nets
due to capacitive coupling. The two major effects of crosstalk are crosstalk-induced delay and
static noise.
IC Compiler tool uses crosstalk prevention techniques during track assignment. During timing
and crosstalk-driven track assignment, the tool minimizes the crosstalk effects by assigning long,
parallel nets to nonadjacent tracks. It runs a simplified noise analysis to make sure the noise level
from aggressor nets is minimized. After detail routing, the tool performs crosstalk-induced noise
and delay analysis by calculating the coupling capacitance effects to identify any remaining
violations. It fixes these violations during the post-route optimization phases.
ST’s physical design flow kit allows the user to control different threshold for crosstalk
prevention and fixing using “set_si_options” command:
set_si_options \
-delta_delay true \
-max_transition_mode normal_slew \
-min_delta_delay true \
-route_xtalk_prevention true \
-route_xtalk_prevention_threshold $::STM::TECH::siThreshold(xtalk_prevention) \
-static_noise true \
-static_noise_threshold_above_low $::STM::TECH::siThreshold(static_noise) \
-static_noise_threshold_below_high $::STM::TECH::siThreshold(static_noise)
SNUG 2015
15
A53 core optimization to achieve performances
The lower the threshold voltage for crosstalk prevention, the harder the router tries to prevent
crosstalk. With advanced design nodes like 28nm FDSOI, we are very sensitive to crosstalk
impact; therefore, the use of this feature has become mandatory.
We do enable Timing-driven global routing by using “set_route_zrt_global_options timing_driven true” as well as enabling timing-driven track assignment by using
“set_route_zrt_track_options -timing_driven true” however we do not allow xtalk prevention
during global routing with “set_route_zrt_global_options -crosstalk_driven false“
During post-route optimization, “route_opt” command performs the following crosstalk
optimizations based on the signal integrity options set with “set_si_options” for both setup and
hold by setting the “-delta_delay” and “-min_delta_delay ”option to true.
The relevant settings we are using today in our ST 28nm FDSOI process node for crosstalk
prevention show a good convergence as no crosstalk violation is reported in the first signoff.
4.5.2. Delay calculation in ICC and Primetime using Graph Based Analysis
and Path Base Analysis
“graph“ is the term used for the timing database. It is first created when the netlist is read in and
the design is linked. During a timing update, timing database is populated with timing values
from delay calculation. “graph“ represents the entire design, all possible timing paths are
contained within the graph. Ports and pins in the design become the nodes in the graph, and the
timing arcs become the connections between the nodes.
“graph“ stores both min and max timing values for all timing arcs in the design, along with
other information.
Delay calculation is performed as the edges propagate in a forward direction across the logic. If
we were computing the timing on a chain of buffers, we would simply feed the output slew from
each stage into the next stage, performing delay calculation and storing the results on the graph
as we propagate along the direction.
However, when two slews arrive at the same point on the graph, timing engine must choose one
of these slews to propagate forward so it can continue delay calculation for the downstream
logic. To ensure the min/max graph values always bound the fastest and slowest possible timing,
the worst slew must be chosen and propagated forward. This is the fastest (numerically smallest)
slew for min delays, and the slowest (numerically largest) slew for max delays which is the
default graph-based mode.
path-based analysis mode allows to pull any timing path off the graph and perform a more
accurate timing analysis of that specific path's timing by propagating the exact slew along the
path under analysis.
A major difference between IC Compiler and PrimeTime is graph-based versus
path-based analysis. Full path-based optimization is not feasible in a physical
implementation tool due to the analysis required, leading to excessive runtimes.
SNUG 2015
16
A53 core optimization to achieve performances
We have established a statistical estimate of the divergence between signoff PBA and
implementation GBA timing. An adjustment has been done in the implementation constraints
through the uncertainty. Refined corrections have been done for all implementation scenarios
until timing divergences are reduced to a minimum while violation count within primetime still
allows to meet timing after eco.
We have seen a good QoR in setup, hold and leakage out of icc. Hold and leakage could be
further optimised after signoff eco.
SNUG 2015
17
A53 core optimization to achieve performances
5.DRC and physical convergence
Our team has faced various challenges with physical convergence. Three of them are developed
in this section. The first one is intrinsic congestion in a particular design’s hierarchy. Secondly,
we shall see how we have overcome an issue with routability related to the power grid structure.
For the third challenge related to global DFM convergence with physical signoff we shall
describe how we have improved our implementation flow to clean up physical violators.
5.1. Dealing with design related congestion
Some designs, or part of them, are prone to congestion due to their high level of
interconnections. Agressive timing constraints, sometimes tend to force the placement engine to
stack the cells closely in order to reduce net delays. As a consequence, smaller cells are used by
the placer and the instance count per area increases. The design can end-up with high net density
and local congestion hotspots can appear (see Figure 7). We have faced such problem in an
identified hierarchy of our cores (see Figure 6). Highlighted hierarchy was realy sensitive to
routing congestion during prects steps and it worsen after successive postcts optimizations as the
design was filled with buffers inserted for long nets or crosstalk.
Figure 6: Congestion prone hierarchy (red) and DRC (white)
ICC has no pin density aware placer and cannot prevent this type of issue. However, ICC has a
feature with the possibility to constrain the cell density with a beneficial effect on congestion. It
has the flexibility to set a density threshold on a particular area with the command:
create_placement_blockage -type partial -blocked_percentage 40
Used on a specific hierarchy of the A53 CPU core has resulted in spreading the cells and
decongestioning this area. In some cases it has also helped the timing optimizer to get better
results due to the fact that tool can find more room to legalize cells. (see Figure 8)
SNUG 2015
18
A53 core optimization to achieve performances
Figure 7: Congestion map without partial density blockage on congestion prone hierarchy
Figure 8: Congestion map with partial density blockage on congestion prone hierarchy
5.2. Resolving DRC problems under power stripe with PNET feature
5.2.1. Power grid structure and routing resources
Core’s power grid consist in dedicated two fat metal layers IA/IB and strengthening stripes on
M3 before tapping down standard cells rails on M2.
See Figure 9: Layer structure from IA/IB tapping down to M1 with via stack and Figure 10:
Power grid structure
Power grid structure is generally sized to minimize voltage drop without impacting routability
too much in any direction. Power stripes and via stacks may sometimes act as routing
obstructions.
During the implementation of our IPs we faced DRC errors under these M3 stripes. Many
standard cells placed under these M3 grid have to be connected with limited routing ressources.
M3 blocked all vertical access. Considering M1 cannot be used due to standard cell’s pins, there
is only one metal available M2, hence a few routing tracks in a single direction.
SNUG 2015
19
A53 core optimization to achieve performances
Figure 9: Layer structure from IA/IB tapping down to M1 with via stack
Figure 10: Power grid structure
SNUG 2015
20
A53 core optimization to achieve performances
Without any specific action on pin access under M3 stripes, a typical design ends up with
unfixable DRCs. We have noticed the issue is more likely to happen when small cells are used,
and more strictly when the pin density is higher.
ICC placement is not pin density aware and cannot foresee later routing congestion. Therefore,
nothing prevents placer from stacking cells with limited access to pins.
Figure 11: Routed signals DRC under M3 power stripes
Using placement blockages under M3 power stripes would clear off the issue, but this results in a
loss of placeable area. We have not chosen this solution.
5.2.2. Synopsys pnet feature
Synopsys has implemented a feature allowing to define some placement rules under metal
stripes. It is invoqued during every legalization. We used this feature to restrict usage of selected
cells under M3 whenever they are small cells or prone to create DRCs.
5.3. Results
With the following options CPU cores got rid of all DRC related to accessibility under M3
stripes
set_pnet_options -partial M3
set_app_var legalizer_avoid_pin_under_pnet_lib_cells $forbiddenCells
set_app_var legalizer_avoid_pin_under_pnet_layers M3
set legalizer_avoid_pin_under_pnet_min_width 0.500
SNUG 2015
21
A53 core optimization to achieve performances
5.4. DRC convergence with signoff
Considering 28nm physical rule’s complexity, running DRC and DFM checks late in signoff
may cause lots of rerouting and may have a side-effect on timing or crosstalk.
In implementation, search and repair engine fixes DRC errors based on a set of rules specifically
coded for ICC. Several iteration of DRC checks and fixes are executed after every post-route
optimization. The bulk of DRC violations are fixed.
Despite the continuous effort of P&R tools to support rules requirements from different
foundries, we see few cases where ICC Zroute, misses some DRC violations reported by the
signoff tool. There are many reasons for this misscorrelation. DRC involving shapes not present
in abstract cannot be spotted nor fixed by ICC. Also, specific rules with very low occurrence in
design have deliberately not been coded in ICC runset because it would severely penalize the
runtime. Complex rules can be coded in different ways depending on the DRC engine. There are
divergences between ICC search and repair and third party signoff tool. For instance, geometries
may be reshaped to reflect process variations. Pieces of metal may be shrunk or extended prior to
rules checking. For instance Figure 12, via enclosure is 0.095um height in ICC. During signoff
this piece of metal is extended to 0.100um making a DRC rule to fail.
Figure 12: Via enclosure height is 0.095um in ICC
Solution to improve the convergence between ICC and signoff DRC checks is to use ICValidator
(ICV). The tool is dynamicaly invoqued during the P&R flow to complement the use of ICC
runset, with a particular focus on a selection of rules identified as weaknesses of the ICC router.
It keeps a low runtime and may be invoqued many times during postroute optimizations. This
has become a regular methodology and has been put in our execution flow (Figure 13: Built in
ICV flow).
SNUG 2015
22
A53 core optimization to achieve performances
Figure 13: Built in ICV flow [1]
The benefits of this flow is to identify DRC violations based on the foundry sign-off DRC runset
and target an automatic DRC fixing. The ability to select only a subset of rules checks by ICV
decreases the runtime impact. ICV feature has been activated in post route and post signoff.
Finally the number of iterations between place and route and signoff verification has been
limited to only one loop.
SNUG 2015
23
A53 core optimization to achieve performances
6.Conclusions
Calibration has been done at all implementation stages, correlation done for every tool handover.
We have seen timing continuity, area stability and power predictability. Physical convergence
fall traps whether they are design related or due to rule set definitions could be avoided thanks to
various ICC features. All these techniques can be re-used for similar designs if the constraints are
modified (target frequency, scenarios) or if the technology node changes (14nm, layer,
lithography or DRC rules).
With good correlation throughout the flow, ICC optimizations focus on real violations. No
overdesign is done and runtime is cut down. DFM indicators get good scores guaranteeing
manufacturing and yield. A design out of ICC has a number of violations left that can be handled
within acceptable timescale.
All these predictable results allowed us to give early commitment on PPA.
7.References
[1] In-Design Automatic DRC Repair Flow Using IC Compiler and IC Validator, Stephane Pautou
(STMicroelectronics, Crolles, France), Alain Boyer (Synopsys, Montbonnot, France), SNUG 2014
[2] Concurrent Top and Blocks Level Implementation of a High Performance Graphics Core Using OnePass Timing Closure in Synopsys IC Compiler, Corine Pulvermuller, Julien Guillemain
(STMicroelectronics, Grenoble, France)
SNUG 2015
24
A53 core optimization to achieve performances
Download