ASIC Design Methodology Divya Macharla, 200501148 Mr. Radhakrishnan Pasirajan, Engineering Manager, Open-Silicon Prof. Rahul Dubey, DA-IICT Evaluation Committee No. 4 Abstract - The project is about design of High Definition Audio Controller following Intel High Definition Audio Specification 1.0 and the design supports Codecs incompatible with Intel High Definition Audio Specification 1.0. The design flow includes Logic Synthesis of the Register Transfer Logic (mapping to a target technology along with automated Power Optimization technique like Clock Gating and Scan insertion for Testability), Formal Verification, Pre-Layout Timing Analysis (using cell delays and wire load models) for initial understanding of the design’s post-layout performance, Floorplanning (including Power Planning, Input/Output planning and Intellectual Property Placement), Placement of the design according to the floorplanning (along with timing and area optimizations using techniques like buffering, sizing, trimming, pin swapping, cloning and scan chain reordering), Clock Tree Synthesis, Routing, Post-Layout netlist verification (Regression Analysis and Formal Verification), Post-Layout Timing Analysis, Reliability Analysis (Voltage Drop, Crosstalk and Electromigration) and Layout Verification (Design Rule and Layout versus Schematic Checks). Index Terms - High Definition Audio, Synthesis, Static Timing Analysis, Formal Verification, On-Chip Variation Introduction Audio in the personal computers (PCs) has evolved from producing beep sounds and now compete with home theater systems. PC Audio started with producing beep sounds, and then came the sound cards using digital signal processing (DSP) performed in the codec that is interfaced with the Integrated Controller Hub (ICH) and ICH is attached to the PC through the Peripheral Component Interconnect (PCI) bus. AC97 allowed DSP operations to use the CPU computation power and the codec perform analog to digital conversion and digital to analog conversions. AC97 required a digital controller in the ICH, to send and receive processed digital audio samples. [1] The problems with AC97 are that it cannot support higher sampling rates, higher bit resolution and multiple channels which actually help in producing better quality sound. Due to these drawbacks of AC97, it is replaced by Intel High Definition (HD) Audio. HD Audio supports up to 8 channels at 192 KHz/32-bit quality whereas AC97 supports 6 channels at 48 KHz/20-bit [2]. In addition, HD Audio provides dedicated system bandwidth for critical audio functions and also achieving high quality audio along with a technology like a PC in a room with a remote control providing access to music, video and telephone service not only in a room but also throughout the house with surround sound capabilities at a reasonably low cost, makes PC a viable choice for Home Audio System [1] [4]. HD Audio controller (HDAC) is to be replaced by the AC97 digital controller in the ICH. In HDAC, there is a dedicated DMA engine associated with each audio stream. Following are the functions of DMA engine – 1. Commands and audio stream data are transferred from memory to the respective codecs via the Stream Data Out (SDO) lines between the controller and the codecs, following the HDA Link protocol [3]. 2. Codec Responses to the HDAC commands, codec requests and audio stream data are transferred from codecs to memory via the point to point Stream Data In (SDI) lines from the respective codec to the controller following the HDA Link protocol [3]. To achieve the above functionalities, DMA engine, using the information given in the controller registers, buffers the frame data to be sent from the memory to the appropriate audio codecs and also buffers the frame data to be sent from the codecs to the appropriate memory location. The HDA controller and codec architecture and HDA link protocol [3] form the foundation for the advanced features provided by HDA. The limitation of HDA architecture is that the only those codecs, with an interface that is compatible with HDA link, commands and responses, can be used and thereby restricting the number of codecs that can be used with HDA. To overcome this limitation, additional logic needs to be implemented in the HDA controller that can make an HDA incompatible codec still use the HDA features [5]. The project is about the design of the modified HDA controller chip to interface both HDA compatible and incompatible codecs. Design Methodology Logic Synthesis The RTL (Register Transfer Level) along with the Intellectual Property (IP) used in the chip design, is synthesized using Synopsys Design Compiler to get the gate-level netlist which is optimized for the given design constraints. The synthesis starts with setting up the environment. The environment information includes the technology library to be used for synthesis, the libraries to refer macros and pads and the values for the tool specific variables. First the RTL is mapped to Synopsys Technology Independent Generic Library GTECH. Then, the tool maps the design to the vendor specific set of technology specific cells and also uses Synopsys DesignWare architectural libraries of technology independent functions like adder, multipliers, comparators, FIFOs and so on. The compiler uses the information provided by the library files to synthesize for optimization and for providing the timing information. There are two types of technology libraries – Logic and Physical. The Logic Library contains the following information required for gate-level synthesis and optimization – the type of technology, default values that apply to the entire library, environment attributes like scaling factors for deration, operating conditions across various corners (best, typical and worst), timing and power ranges across various corners, wireload models and modeling the Process, Voltage, Temperature (PVT) variations. The function, timing and other information related to each cell like area, information about pins, maximum capacitance and fanout allowed and other design rule checking attributes are described in the library which guides the synthesis. Non-linear delay model is used in very deep submicron (VDSM) technologies to characterize the timing with a variety of input slew rates and output load capacitances which give the cell delay for a particular slew rate and output load capacitance. Physical library contains the process specific information such as the number of layers, pitch, resistance, and capacitance and so on. Apart from process specific information, it contains physical characteristics of the cells like physical dimensions of cells, layer information, cell orientation and other information which are required for physical placement and routing. Performance of synthesized design does depend on the library used. Therefore it is required to use a good library with some of the characteristics given in [8]. In addition to the generic libraries, there are libraries, such as LVT (Low VT), RVT (Regular VT) and HVT (High VT), with cells of different threshold voltage (VT) transistors to design for speed and leakage power. Clock-gating is another technique that can be used to achieve power reduction. Clock gating can be inserted at RTL level and synthesized. We get better results from synthesis by partitioning a design into smaller functional blocks. [6][8][9]Some of the strategies that are helpful in partitioning a design are – 1. Keep related combinational logic in a single block 2. Registering inputs and outputs of major blocks 3. Registering the outputs of sub blocks 4. Separate blocks with different design requirements such as optimized area and optimized speed 5. Partitioning by compile technique 6. Keeping sharable resources together 7. Keeping the user-defined resources with the logic they drive A good efficient HDL coding style considering the hardware implications of the code results in smaller and faster synthesized logic. Some verilog coding guidelines for successful synthesis are given in [11] [6] [8] [9]. Synthesis Constraints Operating environment, Design rule constraints and Design optimization constraints should be specified before a design can be mapped to a gate-level netlist. [6] 1. The operating environment details should be provided before the design can be optimized by specifying the operating conditions such as PVT which affect the performance of the design, wire load models which include coefficients for area, capacitance and resistance per unit length and a fanout-tolength for estimating net lengths and they are useful in estimating the effect of fanout and wire length on the resistance, capacitance and area of nets and system interface characteristics such as drive characteristics for input ports, loads on the I/O ports and fanout loads on output ports. In case of hierarchical design, the wire load model for the nets that cross hierarchical boundaries must also be specified. 2. Design rule constraints such as transition time, fanout load, capacitance, cell degradation and so on which might be the implicit constraints specified in the technology library and/or additional constraints can also be specified to ensure proper functioning of the fabricated circuit. 3. Design optimization constraints which are usually specified are – timing, area and power constraints. Synthesis procedure tries to optimize the design based on these constraints without violating any design rule constraints. Timing constraints include – clock period, waveform, clock latency, clock skew, I/O constraints with respect to the clocks, combinational delay requirements and timing exceptions such as false paths and multicycle paths. The area constraint is set by specifying the maximum area of the design. When both area and timing constraints are specified, Design Compiler tries to meet timing constraints before area constraints. In order to prioritize area constraints over total negative slack, ignore the total negative slack specifying area constraint. Design Compiler allows for priority to be set among the constraints if required. After specifying the constraints, we need to make sure the design is consistent before synthesizing it to gate-level netlist. Logic Synthesis Optimization [6] Optimization step of the synthesis ensures that the design is mapped to an optimal combination of cells in the target technology library so that functional requirements and constraints are met. Architectural, Logic-level, Gate-level optimization and Register Retiming are performed by Design Compiler. Architectural optimization is performed on HDL code by performing tasks such as – sharing common sub expressions, sharing resources, selecting designware components, reordering operators, identifying arithmetic expressions for data-path synthesis to perform advanced arithmetic optimization. Logic optimization on the GTECH netlist, involves structuring and flattening. Structuring involves finding out sub functions that can be factored out, evaluates these factors, based on the size of factor and number of times the factor appears in the design. Design Compiler converts the sub functions that can reduce the logic equation to a large extent into intermediate variables and uses them in place of the sub functions and thereby resulting in a reduced area. Flattening involves Two-level optimization which minimizes the number of minterms and Multi-level optimization which minimizes the number of literals and thereby resulting in speed optimization. Gate-level optimization produces a technology specific netlist by mapping the generic netlist cells to cells from the library by choosing the cells such that the specified constraints are met. Register Retiming is a sequential optimization technique that moves the registers through the combinational logic of a design in order to optimize area and timing. The optimal location of registers cannot be easily decided and coded at RTL level. Using register retiming technique, we can equalize the combinational delays of the stages and this technique becomes useful when we have some stages exceeding the timing constraints and others falling short of it. When there is no path which exceeds the timing constraints, this technique can be used to reduce the number of registers. Pipelining, the well-known technique, is useful to retime the combinational logic. Register Retiming is performed on mapped gate-level netlist. Logic Synthesis strategies [6] Top-down, Bottom-up and mixed synthesis strategies can be used to synthesis hierarchical designs. Top-down compilation automatically takes care of interblock constraints and optimizes the slack allocation of the top design. But it is not a useful strategy for large designs due to the large memory requirement and requires longer runtimes. Bottom-up compilation can be done for medium and large scale hierarchical designs and it requires less memory and runtime. In bottom-up, first sub designs are synthesized, then they are combined in the top-level design and finally the top-level constraints are used to check for violations in the top-level design. Applying top-level constraints might not capture interblock constraints accurately. To ensure that inter-block constraints are captured accurately, first read the top design and map it to GTECH library and connect the top design to the cells in the libraries used and sub designs that are referred by the top, then top-level constraints should be read and then the constraints for the sub designs are created. Then the sub designs are synthesized individually with their respective constraints and then they are used in the top-design and then finally the toplevel design is synthesized with top-level constraints. This strategy is also called as Design Budgeting. In the mixed synthesis strategy, both the top-down and bottom-up strategies are used. In this case Top-down synthesis can be done for small hierarchies in the top design and then the bottom-up synthesis can be used for synthesizing the top-design using already synthesized small hierarchies. Floorplanning [7] Initially the gate-level netlist is linked to the physical library. Core size, IP size, Aspect ratio, Area utilization, Type of core cell row structure and power ports is set. Core power rings, I/O pads, Bonding pads and Filler Pads are generated and placed. IPs are oriented, placed and their blockages are established. Placement [7] Scan chain is disconnected so that it does not affect register placement. After the cells are placed, scan chains are reconnected to consider the loading effects of scan chain for optimization. Initially, the cells in the netlist are placed. High fanout nets and long wires can be buffered. The cells are resized to match the drive-strengths required for buffered nets. If any cell has heavy loads that cannot be driven even by resizing the cell and buffering, then cell is cloned to reduce the load. If there are any logic redundancies, they are removed. Blast Fusion, Physical synthesis tool, uses trimming and area buffering as a part of area optimization process of nontimingcritical paths. Trimming assigns delay budget to every cell in the design and tries to reduce the slack closer to zero. Area buffering is done after trimming to fix any load violations that might have occurred due to trimming. Blast Fusion performs another optimization technique, Pin Swapping, used on symmetric logic in the worst timing paths of the design and the nets of the symmetric logic gate are swapped without changing the function of the logic. The timing optimization techniques (buffering, sizing, and cloning, pin swapping) and area optimization techniques (trimming and area buffering) can be performed multiple times, to meet the timing and area goals of the design, using Blast Fusion. Blast Fusion sometimes during buffering considers additional functionality that improves timing (It converts AND to NAND followed by an inverter only if it improves timing). After cell placement and optimization, Scan chain is reordered by connecting the Q pin of the scan flip flop to the Scan-In pin of the nearby scan flip-flop reducing routing congestion but introducing hold violations which can, by default, fixed by inserting buffers. (When scan is enabled during logic synthesis, flip-flops are replaced by scan flipflops). Clock Tree Synthesis [7] Clock logic is inserted, optimized for minimum clock insertion delay and skew (sometimes for useful skew - useful for setup time requirement), meeting the maximum skew, power and design constraints and also ensures minimum number of clock tree buffers is used to achieve the same. Non-default routing and shielding rules from the initial high fanout clock-net are propagated to all the branches of the new clock tree. With actual clock skew and latency information, hold violations can be fixed. Cells are re-sized to meet the new clock timing and loads which were inserted to fix hold violations. Pin swapping can also be used to meet timing. Buffering, cell re-sizing and pin swapping should be done without affecting signal integrity. Routing [7] All signal nets, clock nets, power and shielding nets are routed and performs additional wire spacing if required. Buffering, cell re-sizing, adjustment of placement and routing can be done to fix any slew and noise violations after routing. If there are any antenna violations, they are fixed using diodes. The post-layout netlist is analysed using regression analysis and also formally verified to ensure that there is no change in the functionality in the routed netlist. IR drop, crosstalk and electromigration analysis are performed and fixed. Design Rule Checks (DRCs) and Layout versus Schematic (LVS) checks are done after routing. Cell delays and interconnect parasitic delays are extracted and used for accurate timing analysis in place of wire load models that were used in pre-layout timing analysis. Formal Verification [6] After the design is synthesized, the gate-level netlist is verified against RTL using Formal Verification (FV) techniques. The tool used for the same is Formality. FV shows whether two designs or technology libraries (can be verilog files or Synopsys design database format files or spice netlist files) are functionally equivalent or not. Formal Verification is an alternative to simulation or regression testing. FV reduces verification time for it does not require the input test vectors to verify the design. FV requires a golden design that is functionally correct and it is used to prove or disprove functional equivalence of the modified version of the golden design. Formality verifies the optimization steps and other transformations performed during the RTL synthesis, Clock Gating Insertion, Clock Tree Synthesis, Placement and Routing including Scan Chain Insertion and Reordering, adding I/O pads and also after performing design fixes. Formality compares compare points (can be primary outputs, sequential elements, input pins of black boxes, nets driven by multiple drivers) in the designs by comparing the logic cone from a compare point in the implemented or modified design against a matching comparing point logic cone of the golden design, to verify the modified design against golden design. Timing Analysis [6] Most of the ASIC designs are synchronous. Synchronous Sequential Logic Circuits which in addition to storing information, attribute to the various functions of the design [12]. For an error-free operation of such designs, it is necessary to ensure their setup and hold time requirements are met, design can operate at maximum frequency under different PVT conditions satisfying the constraints like input slew, capacitative load, fanout, design constraints of the technology library, area, porosity for all the functional modes and also the effect of switching of signal over the others. There are two ways to perform timing analysis of the design – Dynamic Simulation and Static Timing Analysis (STA). Simulation is more accurate but it requires all possible combinations of functional vectors. Whereas in STA, the design is divided into 4 timing paths with start point of the path being an input port or a register clock pins and the endpoint being an output port or the D pins of the register. So this leads to four timing paths 1. Input ports to register data pins 2. Input ports to output port 3. Register clock pins to register data pins 4. Register clock pins to output ports In addition to them, Synopsys PrimeTime (Timing Analysis tool) creates path groups that end on the paths that end on combinational elements used for clock-gating, paths that end on asynchronous preset/clear inputs of flip-flops, default paths that do not fall into any other path groups and other unconstrained paths. Timing checks that are done across various corners with On-chip variation (OCV) of PVT are • Setup and hold including clock gating setup and hold, recovery and removal • Clock frequency, Minimum pulse width • Design rules, User-specified constraints • Glitch detection when clock gating is used in the design • Crosstalk effects and Maximum Skew checks on sequential devices which use more than one clock signal. First the design, technology and constraints files are analyzed by Synopsys PrimeTime. Then it checks the design for setup violations and reports the path with the least slack, reports registers that are not clocked and timing paths which are not constrained at their end points. Timing analysis in the design cycle is performed in iterative manner. Initially timing analysis is done on the synthesized gate-level netlist using cell delay information from the library and net delays from wire load models specified. If the pre-layout timing is met, placement and routing of the design can be performed. Otherwise the design needs to be re-synthesized. After placement and routing is performed, the parasitic delays can be extracted and can be used for accurate timing analysis rather than relying on the statistical wire load models especially when statistical fanout based interconnect models are inaccurate in VDSM. OCV [6] [7] Two identical cells on the same chip might have different timing characteristics and due to which the timing signoff cannot be complete until timing analysis is performed under both worst and best case operating conditions. The variations that might occur between the identical cells on the same chip might be variations in mask, imperfections in optimal proximity correction, etch variations, transistor widths, channel length, decrement in the supply voltage due to IR drop and variations in the interconnect resistance. In short OCV in timing analysis takes into account the delay differences of cells and interconnects that might occur between the identical cell instances on the same chip. Without considering OCV, setup checks are done in worst case and hold checks are done in best case. It might appear that if setup checks in worst case and hold checks in best case pass, then it is sufficient for timing signoff. But due to OCV, the setup checks in best case or typical case can fail and hold checks might fail in worst or typical corners. OCV analysis can be performed after clock tree synthesis. Using OCV, Setup check is performed by considering the worst case for datapath and launch clock path and best case for capture clock path. It is not exactly the best case for capture clock path. But the capture clock path is derated by an amount so that it is slightly faster than worst case, thereby building margin by the same amount. Similarly for hold check, consider best case for launch clockpath and datapath and worst case for capture clockpath or slightly slower than best case. When there is common clockpath in capture and launch clockpaths, then it is pessimism to consider both worst and best case at the same time for the common clockpath. To avoid this issue, PrimeTime uses an algorithm called Clock Reconvergence Pessimism Removal (CRPR) which ensures that either of worst or best case delay is used in the analysis. CRPR adds or subtracts the difference between the worst and best case delays of the cells and nets for setup and hold checks respectively. On the whole, using OCV and CRPR in the timing analysis helps in increasing the performance margin. Techniques for timing closure For timing closure, setup, hold and other timing constraints need to be met. 1. Resize the cells, Clone cells to reduce the fanout 2. Re-synthesize to meet the timing requirements of critical paths 3. Inserting buffers to fix hold violations using the setup margin available in order to ensure that no setup violation occurs in this process. 4. Modify the floorplan Crosstalk Delay Analysis [7] [10] Due to the shrinking technologies, interconnects are closer and hence there is a need to consider the effect of increased cross-coupled capacitance between aggressor and victim nets for accurate timing analysis. Delay changes and static noise effects of Crosstalk are analyzed. Crosstalk affects signal delays by changing the signal transition times and hence might cause setup and hold violations. The magnitude with which crosstalk affects signal delays depends on the magnitude of cross-coupled capacitance, relative signal transition times and slew rates, switching polarities (rising/falling) and the cumulative effect of multiple aggressor nets on a single victim net. Static noise effects of crosstalk causes glitches in the static signals. To reduce the crosstalk effects, we can insert the buffers on the victim net and that reduces the area of interaction of the victim net with the aggressor. Future Work I will be working on synthesis, timing and physical design of HDA controller chip. Acknowledgment I would like to thank my supervisor, Mr. Radhakrishnan Pasirajan, for giving me the opportunity to work on the project and for his valuable guidance and support. I would also thank Prof. Rahul Dubey for his advice and support. References [1]. David R, Scott J, Wayne J, High Definition Audio for the Digital Home, Intel Press [2]. www.intel.com/assets/pdf/general/hdaudio.pdf, last accessed on 15th February 2009. [3]. Intel High Definition Audio Specification, Revision 1.0 [4]. http://www.intel.com/design/chipsets/hdaudio.htm, last accessed on 15th February 2009. [5]. Jau S C, Adrian J, High Definition Audio Architecture, US Patent Application Publication, Pub No. US 2007/0255432 A1, Nov12007 [6]. Synopsys Design Compiler, Formality, PrimeTime, PrimeTime SI Documentation, version 2006.12, [7]. Magma Blast Fusion User Guide, version 2005.03 [8]. Himanshu Bhatnagar, Advanced ASIC chip Synthesis2nd edition, Kluwer Academic Publishers [9]. Michael K, Pierre B, Reuse Methodology Manual, 3rd edition, Kluwer Academic Publishers [10]. www.einfochips.com/download/dash_oct07_tech.pd, last accessed on 15thFebraury 2009 [11]. Radhakrishnan P, Verilog Coding Styles for Successful Synthesis, Jan 2003 [12]. Radhakrishnan P, An Insight into Sequential Logic Circuits, July 1999