ASIC Design Methodology

advertisement
ASIC Design Methodology
Divya Macharla, 200501148
Mr. Radhakrishnan Pasirajan, Engineering Manager, Open-Silicon
Prof. Rahul Dubey, DA-IICT
Evaluation Committee No. 4
Abstract - The project is about design of High Definition
Audio Controller following Intel High Definition Audio
Specification 1.0 and the design supports Codecs
incompatible with Intel High Definition Audio Specification
1.0. The design flow includes Logic Synthesis of the Register
Transfer Logic (mapping to a target technology along with
automated Power Optimization technique like Clock Gating
and Scan insertion for Testability), Formal Verification,
Pre-Layout Timing Analysis (using cell delays and wire load
models) for initial understanding of the design’s post-layout
performance, Floorplanning (including Power Planning,
Input/Output planning and Intellectual Property
Placement), Placement of the design according to the
floorplanning (along with timing and area optimizations
using techniques like buffering, sizing, trimming, pin
swapping, cloning and scan chain reordering), Clock Tree
Synthesis, Routing, Post-Layout netlist verification
(Regression Analysis and Formal Verification), Post-Layout
Timing Analysis, Reliability Analysis (Voltage Drop,
Crosstalk and Electromigration) and Layout Verification
(Design Rule and Layout versus Schematic Checks).
Index Terms - High Definition Audio, Synthesis, Static Timing
Analysis, Formal Verification, On-Chip Variation
Introduction
Audio in the personal computers (PCs) has evolved from
producing beep sounds and now compete with home theater
systems. PC Audio started with producing beep sounds, and
then came the sound cards using digital signal processing (DSP)
performed in the codec that is interfaced with the Integrated
Controller Hub (ICH) and ICH is attached to the PC through the
Peripheral Component Interconnect (PCI) bus. AC97 allowed
DSP operations to use the CPU computation power and the
codec perform analog to digital conversion and digital to analog
conversions. AC97 required a digital controller in the ICH, to
send and receive processed digital audio samples. [1]
The problems with AC97 are that it cannot support higher
sampling rates, higher bit resolution and multiple channels
which actually help in producing better quality sound. Due to
these drawbacks of AC97, it is replaced by Intel High
Definition (HD) Audio. HD Audio supports up to 8 channels at
192 KHz/32-bit quality whereas AC97 supports 6 channels at
48 KHz/20-bit [2]. In addition, HD Audio provides dedicated
system bandwidth for critical audio functions and also
achieving high quality audio along with a technology like a PC
in a room with a remote control providing access to music,
video and telephone service not only in a room but also
throughout the house with surround sound capabilities at a
reasonably low cost, makes PC a viable choice for Home Audio
System [1] [4]. HD Audio controller (HDAC) is to be replaced
by the AC97 digital controller in the ICH. In HDAC, there is a
dedicated DMA engine associated with each audio stream.
Following are the functions of DMA engine –
1. Commands and audio stream data are transferred from
memory to the respective codecs via the Stream Data Out
(SDO) lines between the controller and the codecs,
following the HDA Link protocol [3].
2. Codec Responses to the HDAC commands, codec requests
and audio stream data are transferred from codecs to memory
via the point to point Stream Data In (SDI) lines from the
respective codec to the controller following the HDA Link
protocol [3].
To achieve the above functionalities, DMA engine, using
the information given in the controller registers, buffers the
frame data to be sent from the memory to the appropriate audio
codecs and also buffers the frame data to be sent from the
codecs to the appropriate memory location.
The HDA controller and codec architecture and HDA link
protocol [3] form the foundation for the advanced features
provided by HDA. The limitation of HDA architecture is that
the only those codecs, with an interface that is compatible with
HDA link, commands and responses, can be used and thereby
restricting the number of codecs that can be used with HDA. To
overcome this limitation, additional logic needs to be
implemented in the HDA controller that can make an HDA
incompatible codec still use the HDA features [5]. The project
is about the design of the modified HDA controller chip to
interface both HDA compatible and incompatible codecs.
Design Methodology
Logic Synthesis
The RTL (Register Transfer Level) along with the
Intellectual Property (IP) used in the chip design, is synthesized
using Synopsys Design Compiler to get the gate-level netlist
which is optimized for the given design constraints. The
synthesis starts with setting up the environment. The
environment information includes the technology library to be
used for synthesis, the libraries to refer macros and pads and the
values for the tool specific variables. First the RTL is mapped to
Synopsys Technology Independent Generic Library GTECH.
Then, the tool maps the design to the vendor specific set of
technology specific cells and also uses Synopsys DesignWare
architectural libraries of technology independent functions like
adder, multipliers, comparators, FIFOs and so on. The compiler
uses the information provided by the library files to synthesize
for optimization and for providing the timing information.
There are two types of technology libraries – Logic and
Physical. The Logic Library contains the following information
required for gate-level synthesis and optimization – the type of
technology, default values that apply to the entire library,
environment attributes like scaling factors for deration,
operating conditions across various corners (best, typical and
worst), timing and power ranges across various corners, wireload models and modeling the Process, Voltage, Temperature
(PVT) variations. The function, timing and other information
related to each cell like area, information about pins, maximum
capacitance and fanout allowed and other design rule checking
attributes are described in the library which guides the
synthesis. Non-linear delay model is used in very deep
submicron (VDSM) technologies to characterize the timing with
a variety of input slew rates and output load capacitances which
give the cell delay for a particular slew rate and output load
capacitance. Physical library contains the process specific
information such as the number of layers, pitch, resistance, and
capacitance and so on. Apart from process specific information,
it contains physical characteristics of the cells like physical
dimensions of cells, layer information, cell orientation and other
information which are required for physical placement and
routing. Performance of synthesized design does depend on the
library used. Therefore it is required to use a good library with
some of the characteristics given in [8]. In addition to the
generic libraries, there are libraries, such as LVT (Low VT),
RVT (Regular VT) and HVT (High VT), with cells of different
threshold voltage (VT) transistors to design for speed and
leakage power. Clock-gating is another technique that can be
used to achieve power reduction. Clock gating can be inserted at
RTL level and synthesized.
We get better results from synthesis by partitioning a
design into smaller functional blocks. [6][8][9]Some of the
strategies that are helpful in partitioning a design are –
1. Keep related combinational logic in a single block
2. Registering inputs and outputs of major blocks
3. Registering the outputs of sub blocks
4. Separate blocks with different design requirements such as
optimized area and optimized speed
5. Partitioning by compile technique
6. Keeping sharable resources together
7. Keeping the user-defined resources with the logic they drive
A good efficient HDL coding style considering the
hardware implications of the code results in smaller and faster
synthesized logic. Some verilog coding guidelines for
successful synthesis are given in [11] [6] [8] [9].
Synthesis Constraints
Operating environment, Design rule constraints and Design
optimization constraints should be specified before a design can
be mapped to a gate-level netlist. [6]
1. The operating environment details should be provided before
the design can be optimized by specifying the operating
conditions such as PVT which affect the performance of the
design, wire load models which include coefficients for area,
capacitance and resistance per unit length and a fanout-tolength for estimating net lengths and they are useful in
estimating the effect of fanout and wire length on the
resistance, capacitance and area of nets and system interface
characteristics such as drive characteristics for input ports,
loads on the I/O ports and fanout loads on output ports. In
case of hierarchical design, the wire load model for the nets
that cross hierarchical boundaries must also be specified.
2.
Design rule constraints such as transition time, fanout
load, capacitance, cell degradation and so on which might be
the implicit constraints specified in the technology library
and/or additional constraints can also be specified to ensure
proper functioning of the fabricated circuit.
3. Design optimization constraints which are usually specified are
– timing, area and power constraints. Synthesis procedure tries
to optimize the design based on these constraints without
violating any design rule constraints. Timing constraints include
– clock period, waveform, clock latency, clock skew, I/O
constraints with respect to the clocks, combinational delay
requirements and timing exceptions such as false paths and
multicycle paths. The area constraint is set by specifying the
maximum area of the design. When both area and timing
constraints are specified, Design Compiler tries to meet timing
constraints before area constraints. In order to prioritize area
constraints over total negative slack, ignore the total negative
slack specifying area constraint. Design Compiler allows for
priority to be set among the constraints if required.
After specifying the constraints, we need to make sure the
design is consistent before synthesizing it to gate-level netlist.
Logic Synthesis Optimization [6]
Optimization step of the synthesis ensures that the design is
mapped to an optimal combination of cells in the target
technology library so that functional requirements and
constraints are met. Architectural, Logic-level, Gate-level
optimization and Register Retiming are performed by Design
Compiler. Architectural optimization is performed on HDL
code by performing tasks such as – sharing common sub
expressions, sharing resources, selecting designware
components, reordering operators, identifying arithmetic
expressions for data-path synthesis to perform advanced
arithmetic optimization. Logic optimization on the GTECH
netlist, involves structuring and flattening. Structuring involves
finding out sub functions that can be factored out, evaluates
these factors, based on the size of factor and number of times
the factor appears in the design. Design Compiler converts the
sub functions that can reduce the logic equation to a large extent
into intermediate variables and uses them in place of the sub
functions and thereby resulting in a reduced area. Flattening
involves Two-level optimization which minimizes the number
of minterms and Multi-level optimization which minimizes the
number of literals and thereby resulting in speed optimization.
Gate-level optimization produces a technology specific netlist
by mapping the generic netlist cells to cells from the library by
choosing the cells such that the specified constraints are met.
Register Retiming is a sequential optimization technique that
moves the registers through the combinational logic of a design
in order to optimize area and timing. The optimal location of
registers cannot be easily decided and coded at RTL level.
Using register retiming technique, we can equalize the
combinational delays of the stages and this technique becomes
useful when we have some stages exceeding the timing
constraints and others falling short of it. When there is no path
which exceeds the timing constraints, this technique can be used
to reduce the number of registers. Pipelining, the well-known
technique, is useful to retime the combinational logic. Register
Retiming is performed on mapped gate-level netlist.
Logic Synthesis strategies [6]
Top-down, Bottom-up and mixed synthesis strategies
can be used to synthesis hierarchical designs. Top-down
compilation automatically takes care of interblock constraints
and optimizes the slack allocation of the top design. But it is not
a useful strategy for large designs due to the large memory
requirement and requires longer runtimes. Bottom-up
compilation can be done for medium and large scale
hierarchical designs and it requires less memory and runtime. In
bottom-up, first sub designs are synthesized, then they are
combined in the top-level design and finally the top-level
constraints are used to check for violations in the top-level
design. Applying top-level constraints might not capture interblock constraints accurately. To ensure that inter-block
constraints are captured accurately, first read the top design and
map it to GTECH library and connect the top design to the cells
in the libraries used and sub designs that are referred by the top,
then top-level constraints should be read and then the
constraints for the sub designs are created. Then the sub designs
are synthesized individually with their respective constraints
and then they are used in the top-design and then finally the toplevel design is synthesized with top-level constraints. This
strategy is also called as Design Budgeting. In the mixed
synthesis strategy, both the top-down and bottom-up strategies
are used. In this case Top-down synthesis can be done for small
hierarchies in the top design and then the bottom-up synthesis
can be used for synthesizing the top-design using already
synthesized small hierarchies.
Floorplanning [7]
Initially the gate-level netlist is linked to the physical
library. Core size, IP size, Aspect ratio, Area utilization, Type
of core cell row structure and power ports is set. Core power
rings, I/O pads, Bonding pads and Filler Pads are generated and
placed. IPs are oriented, placed and their blockages are
established.
Placement [7]
Scan chain is disconnected so that it does not affect
register placement. After the cells are placed, scan chains are reconnected to consider the loading effects of scan chain for
optimization. Initially, the cells in the netlist are placed. High
fanout nets and long wires can be buffered. The cells are resized
to match the drive-strengths required for buffered nets. If any
cell has heavy loads that cannot be driven even by resizing the
cell and buffering, then cell is cloned to reduce the load. If
there are any logic redundancies, they are removed. Blast
Fusion, Physical synthesis tool, uses trimming and area
buffering as a part of area optimization process of nontimingcritical paths. Trimming assigns delay budget to every cell in
the design and tries to reduce the slack closer to zero. Area
buffering is done after trimming to fix any load violations that
might have occurred due to trimming. Blast Fusion performs
another optimization technique, Pin Swapping, used on
symmetric logic in the worst timing paths of the design and the
nets of the symmetric logic gate are swapped without changing
the function of the logic. The timing optimization techniques
(buffering, sizing, and cloning, pin swapping) and area
optimization techniques (trimming and area buffering) can be
performed multiple times, to meet the timing and area goals of
the design, using Blast Fusion. Blast Fusion sometimes during
buffering considers additional functionality that improves
timing (It converts AND to NAND followed by an inverter only
if it improves timing). After cell placement and optimization,
Scan chain is reordered by connecting the Q pin of the scan flip
flop to the Scan-In pin of the nearby scan flip-flop reducing
routing congestion but introducing hold violations which can,
by default, fixed by inserting buffers. (When scan is enabled
during logic synthesis, flip-flops are replaced by scan flipflops).
Clock Tree Synthesis [7]
Clock logic is inserted, optimized for minimum clock
insertion delay and skew (sometimes for useful skew - useful
for setup time requirement), meeting the maximum skew, power
and design constraints and also ensures minimum number of
clock tree buffers is used to achieve the same. Non-default
routing and shielding rules from the initial high fanout clock-net
are propagated to all the branches of the new clock tree. With
actual clock skew and latency information, hold violations can
be fixed. Cells are re-sized to meet the new clock timing and
loads which were inserted to fix hold violations. Pin swapping
can also be used to meet timing. Buffering, cell re-sizing and
pin swapping should be done without affecting signal integrity.
Routing [7]
All signal nets, clock nets, power and shielding nets
are routed and performs additional wire spacing if required.
Buffering, cell re-sizing, adjustment of placement and routing
can be done to fix any slew and noise violations after routing. If
there are any antenna violations, they are fixed using diodes.
The post-layout netlist is analysed using regression analysis and
also formally verified to ensure that there is no change in the
functionality in the routed netlist. IR drop, crosstalk and
electromigration analysis are performed and fixed. Design Rule
Checks (DRCs) and Layout versus Schematic (LVS) checks are
done after routing. Cell delays and interconnect parasitic delays
are extracted and used for accurate timing analysis in place of
wire load models that were used in pre-layout timing analysis.
Formal Verification [6]
After the design is synthesized, the gate-level netlist is
verified against RTL using Formal Verification (FV)
techniques. The tool used for the same is Formality. FV shows
whether two designs or technology libraries (can be verilog files
or Synopsys design database format files or spice netlist files)
are functionally equivalent or not. Formal Verification is an
alternative to simulation or regression testing. FV reduces
verification time for it does not require the input test vectors to
verify the design. FV requires a golden design that is
functionally correct and it is used to prove or disprove
functional equivalence of the modified version of the golden
design. Formality verifies the optimization steps and other
transformations performed during the RTL synthesis, Clock
Gating Insertion, Clock Tree Synthesis, Placement and Routing
including Scan Chain Insertion and Reordering, adding I/O pads
and also after performing design fixes. Formality compares
compare points (can be primary outputs, sequential elements,
input pins of black boxes, nets driven by multiple drivers) in the
designs by comparing the logic cone from a compare point in
the implemented or modified design against a matching
comparing point logic cone of the golden design, to verify the
modified design against golden design.
Timing Analysis [6]
Most of the ASIC designs are synchronous.
Synchronous Sequential Logic Circuits which in addition to
storing information, attribute to the various functions of the
design [12]. For an error-free operation of such designs, it is
necessary to ensure their setup and hold time requirements are
met, design can operate at maximum frequency under different
PVT conditions satisfying the constraints like input slew,
capacitative load, fanout, design constraints of the technology
library, area, porosity for all the functional modes and also the
effect of switching of signal over the others. There are two ways
to perform timing analysis of the design – Dynamic Simulation
and Static Timing Analysis (STA). Simulation is more accurate
but it requires all possible combinations of functional vectors.
Whereas in STA, the design is divided into 4 timing paths with
start point of the path being an input port or a register clock pins
and the endpoint being an output port or the D pins of the
register. So this leads to four timing paths 1. Input ports to register data pins
2. Input ports to output port
3. Register clock pins to register data pins
4. Register clock pins to output ports
In addition to them, Synopsys PrimeTime (Timing Analysis
tool) creates path groups that end on the paths that end on
combinational elements used for clock-gating, paths that end on
asynchronous preset/clear inputs of flip-flops, default paths that
do not fall into any other path groups and other unconstrained
paths.
Timing checks that are done across various corners with
On-chip variation (OCV) of PVT are • Setup and hold including clock gating setup and hold,
recovery and removal
• Clock frequency, Minimum pulse width
• Design rules, User-specified constraints
• Glitch detection when clock gating is used in the design
• Crosstalk effects and Maximum Skew checks on sequential
devices which use more than one clock signal.
First the design, technology and constraints files are
analyzed by Synopsys PrimeTime. Then it checks the design for
setup violations and reports the path with the least slack, reports
registers that are not clocked and timing paths which are not
constrained at their end points. Timing analysis in the design
cycle is performed in iterative manner. Initially timing analysis
is done on the synthesized gate-level netlist using cell delay
information from the library and net delays from wire load
models specified. If the pre-layout timing is met, placement and
routing of the design can be performed. Otherwise the design
needs to be re-synthesized. After placement and routing is
performed, the parasitic delays can be extracted and can be used
for accurate timing analysis rather than relying on the statistical
wire load models especially when statistical fanout based
interconnect models are inaccurate in VDSM.
OCV [6] [7]
Two identical cells on the same chip might have
different timing characteristics and due to which the timing
signoff cannot be complete until timing analysis is performed
under both worst and best case operating conditions. The
variations that might occur between the identical cells on the
same chip might be variations in mask, imperfections in optimal
proximity correction, etch variations, transistor widths, channel
length, decrement in the supply voltage due to IR drop and
variations in the interconnect resistance. In short OCV in timing
analysis takes into account the delay differences of cells and
interconnects that might occur between the identical cell
instances on the same chip. Without considering OCV, setup
checks are done in worst case and hold checks are done in best
case. It might appear that if setup checks in worst case and hold
checks in best case pass, then it is sufficient for timing signoff.
But due to OCV, the setup checks in best case or typical case
can fail and hold checks might fail in worst or typical corners.
OCV analysis can be performed after clock tree synthesis.
Using OCV, Setup check is performed by considering the worst
case for datapath and launch clock path and best case for
capture clock path. It is not exactly the best case for capture
clock path. But the capture clock path is derated by an amount
so that it is slightly faster than worst case, thereby building
margin by the same amount. Similarly for hold check, consider
best case for launch clockpath and datapath and worst case for
capture clockpath or slightly slower than best case. When there
is common clockpath in capture and launch clockpaths, then it
is pessimism to consider both worst and best case at the same
time for the common clockpath. To avoid this issue, PrimeTime
uses an algorithm called Clock Reconvergence Pessimism
Removal (CRPR) which ensures that either of worst or best case
delay is used in the analysis. CRPR adds or subtracts the
difference between the worst and best case delays of the cells
and nets for setup and hold checks respectively. On the whole,
using OCV and CRPR in the timing analysis helps in increasing
the performance margin.
Techniques for timing closure
For timing closure, setup, hold and other timing constraints
need to be met.
1.
Resize the cells, Clone cells to reduce the fanout
2.
Re-synthesize to meet the timing requirements of
critical paths
3.
Inserting buffers to fix hold violations using the setup
margin available in order to ensure that no setup violation
occurs in this process.
4.
Modify the floorplan
Crosstalk Delay Analysis [7] [10]
Due to the shrinking technologies, interconnects are
closer and hence there is a need to consider the effect of
increased cross-coupled capacitance between aggressor and
victim nets for accurate timing analysis. Delay changes and
static noise effects of Crosstalk are analyzed. Crosstalk affects
signal delays by changing the signal transition times and hence
might cause setup and hold violations. The magnitude with
which crosstalk affects signal delays depends on the magnitude
of cross-coupled capacitance, relative signal transition times
and slew rates, switching polarities (rising/falling) and the
cumulative effect of multiple aggressor nets on a single victim
net. Static noise effects of crosstalk causes glitches in the static
signals. To reduce the crosstalk effects, we can insert the
buffers on the victim net and that reduces the area of interaction
of the victim net with the aggressor.
Future Work
I will be working on synthesis, timing and physical design of
HDA controller chip.
Acknowledgment
I would like to thank my supervisor, Mr. Radhakrishnan
Pasirajan, for giving me the opportunity to work on the project
and for his valuable guidance and support. I would also thank
Prof. Rahul Dubey for his advice and support.
References
[1]. David R, Scott J, Wayne J, High Definition Audio for the
Digital Home, Intel Press
[2]. www.intel.com/assets/pdf/general/hdaudio.pdf, last
accessed on 15th February 2009.
[3]. Intel High Definition Audio Specification, Revision 1.0
[4]. http://www.intel.com/design/chipsets/hdaudio.htm, last
accessed on 15th February 2009.
[5]. Jau S C, Adrian J, High Definition Audio Architecture, US
Patent Application Publication, Pub No. US 2007/0255432 A1,
Nov12007
[6]. Synopsys Design Compiler, Formality, PrimeTime,
PrimeTime SI Documentation, version 2006.12,
[7]. Magma Blast Fusion User Guide, version 2005.03
[8]. Himanshu Bhatnagar, Advanced ASIC chip Synthesis2nd
edition, Kluwer Academic Publishers
[9]. Michael K, Pierre B, Reuse Methodology Manual, 3rd
edition, Kluwer Academic Publishers
[10]. www.einfochips.com/download/dash_oct07_tech.pd, last
accessed on 15thFebraury 2009
[11]. Radhakrishnan P, Verilog Coding Styles for Successful
Synthesis, Jan 2003
[12]. Radhakrishnan P, An Insight into Sequential Logic
Circuits, July 1999
Download