STRUCTURED IC SYNTHESIS Contents Introduction Switch Models of Transistors Architectures Advantages / Disadvantages Remarks Introduction Structured ASICs include everything between FPGA and a Standard Cell-based design Structured ASIC’s are used mainly for midvolume level designs The design task for structured ASIC’s is to map the circuit into a fixed arrangement of known cells Properties Low NRE cost – Implementation engineering effort – Mask tooling charges High performance Low power consumption Less Complex – Fewer layers to fabricate Small marketing time – Pre-made cell blocks available for placing Architecture Two Main Levels – Structured Elements Combinational and sequential function blocks Can be a logical or storage element – Array of Structured Elements Uniform or non-uniform array styles A fixed arrangement of structured elements Main Implementation Steps 1. 2. 3. 4. 5. 6. 7. RTL Design Register transfer level design Logical synthesis Maps RTL into structured elements Design for Test insertion Improves testability and fault coverage Placement Maps each structured element onto array elements Places each element into a fixed arrangement Physical synthesis Improves the timing of the layout Optimizes the placement of each element Clock synthesis Distributes the clock network Minimizes the clock skew and delay Routing Inserts the wiring between the elements Implementation Issues Logical synthesis, placement and routing all depend on the target structure element architecture and hence add more complexity to the design process. The completeness of the target structured ASIC library also affects what specifically can be implemented from the design. FPGA Vs. Easy to Design Short Development Time Low NRE Costs Design Size Limited Design Complexity Limited Performance Limited High Power Consumption High Per-Unit Cost Standard Cell ASIC Difficult to Design Long Development Time High NRE Costs Support Large Designs Support Complex Designs High Performance Low Power Consumption Low Per-Unit Cost (at high volume) There are things in between (some times referred to as Structured ASIC) that Combine the Best of Both Structured ASIC Architectures Fine-Grained Structured elements contain unconnected discrete components Could include transistors, resistors, and others Structured ASIC Architectures Medium-Grained Structured elements contain generic logic Could include gates, MUX’s, LUT’s or flip-flops Structured ASIC Architectures Hierarchical Use mini structured elements that contain only gates, MUX’s, and LUT’s It does not contain storage elements like flip-flops This mini element is then combined with registers or flip-flops Architecture Comparison Fine-grained requires many connections in and out of a structured element Higher granularities reduce connections to the structured element but decreases the functionality it can support Clearly, each individual design will benefit differently at varying granularities Structured ASIC Advantages Largely Prefabricated – Components are “almost” connected in a variety of predefined configurations – Only a few metal layers are needed for fabrication – Drastically reduces turnaround time Routing Layer Routing Layer Pre-Routed Layer Pre-Routed Layer Pre-Routed Layer Structured ASIC Advantages Easier and faster to design than standard cell ASIC’s – Multiple global and local clocks are prefabricated – No skew problems that need to be addressed – Signal integrity and timing issues are inherently addressed Structured ASIC Advantages Capacity, performance, and power consumption closer to that of a standard cell ASIC Faster design time, reduced NRE costs, and quicker turnaround Therefore, the per-unit cost is reasonable for several hundreds to 100k unit production runs Structured ASIC Disadvantages Lack of adequate design tools – Expensive – Altered from traditional ASIC tools These new architectures have not yet been subject to formal evaluation and comparative analysis – Tradeoffs between 3-, 4-, and 5-input LUT’s – Tradeoffs between sizes of distributed RAM Technology Comparison Generally speaking – 100:33:1 ratio between the number of gates in a given area for standard cell ASIC’s, structured ASIC’s, and FPGA’s, respectively – 100:75:15 ratio for performance (based on clock frequency) – 1:3:12 ratio for power Design Tools Many companies are using existing standard cell-based CAD tools – They add product specific placement tools – To maximize benefits, we need CAD tools designed specifically for structured ASIC’s – Need updated algorithms to exploit the modularity of structured ASIC’s – Clock aware design Need architectural evaluation and analysis tools Embedded Clocks..Sometimes 2 main clocks – Accessible from anywhere Embedded Clocks 8 local clocks – Chip divided into 4 regions – 4 local clocks can be assigned to each region – Region divided into 4 sub regions – Each subregion assigned 2 local clocks More Clock Signals Needed? Use a custom layer to implement an additional clock signal Custom layer is limited, so it many not be feasible Try to avoid this as much as possible Assigning Clock Signal Main/local clock assignment – Which clock should be the main clock? – Which clock should be the local clock? Region clock assignment – Which local clock should be assigned to each region? Do we need a custom clock? – We generally do not want it 3 methods to solve this Number Based Heuristics Method 1 Assign 2 most used clocks as main clocks Other clocks are local clocks Assign local clocks to subregions based on I/O positions Perform placement Problems May not be possible What about delay optimization? Placement Based Clock Optimization Method 2 1. Perform placement without clock constrains Based on interconnect delays 2. Clock assignment as result of step 1 Which clock should be the main clock? Which local clock should be assigned to each region? 3. Move violating FFs to other regions 4. Map FFs to embedded positions Placement Based Clock Optimization Method 2 Problems Moving FFs to different regions will drastically increase interconnect delays Huge performance loss How do we solve this? Design Flow Partition Front-end physical design Floorplanning Placement Routing Back-end physical design Extraction and Verification Floorplanning Based Clock Optimization Circuit Partitioning Consider Clock and Delay Domain Floorplanning Not Using Embedded Clock Constraints Embedded Clock Constraint Violation No Yes Regional Clock Assignment Based on Current Floorplanning Incremental Floorplanning Use Embedded Clock Constraints Done Let’s look at some basics Series and Parallel Transistor Networking nMOS: 1 = ON pMOS: 0 = ON Series: both must be ON Parallel: either can be ON a a 0 g1 g2 (a) (b) a g1 g2 (c) a g1 g2 b 0 1 b b OFF OFF OFF ON a a a a 0 1 1 1 0 1 b b b b ON OFF OFF OFF a a a a 0 0 b 1 b 0 b 1 1 0 g2 a b a g1 a 0 0 b (d) a 0 1 1 0 1 1 b b b b OFF ON ON ON a a a a 0 0 0 1 1 0 1 1 b b b b ON ON ON OFF Example: NOR Cell Activity: – Sketch a 4-input CMOS NOR gate A B C D Y Compound Gates Compound gates can do any inverting function AOI22) Ex: YA A B C C D (AND-AND-OR-INVERT, A C B D B (a) A (b) B C D (c) C D A B (d) C D A B A B C D Y A C B D (e) D (f) Y CMOS O3AI Y A B C D A B C D Y D A B C Gate Layout Layout can be very time consuming – Design gates to fit together nicely – Build a library of standard cells Standard cell design methodology – VDD and GND should abut (standard height) – Adjacent gates should satisfy design rules – nMOS at bottom and pMOS at top – All gates include well and substrate contacts Example: Inverter Example: NAND3 Horizontal N-diffusion and p-diffusion strips Vertical polysilicon gates Metal1 VDD rail at top Metal1 GND rail at bottom 32 l by 40 l Stick Diagrams Stick diagrams help plan layout quickly – Need not be to scale – Draw with color pencils or dry-erase markers Wiring Tracks A wiring track is the space required for a wire – 4 l width, 4 l spacing from neighbor = 8 l pitch Transistors also consume one wiring track Well spacing Wells must surround transistors by 6 l – Implies 12 l between opposite transistor flavors – Leaves room for one wire track Area Estimation Estimate area by counting wiring tracks – Multiply by 8 to express in l Example: O3AI Sketch a stick diagram for O3AI and estimate area – Y A B C D Example: O3AI Sketch a stick diagram for O3AI and estimate area – Y A B C D Placement Problem – Given a netlist, and fixed-shape cells (small, standard cell), find the exact location of the cells to minimize area and wire-length – Consistent with the standard-cell design methodology Row-based, no hard-macros – Modules: Usually fixed, equal height (exception: double height cells) Some fixed (I/O pads) Connected by edges or hyperedges Objectives – Cost components: area, wire length Additional cost components: timing, congestion Placement Cost Components Area – Would like to pack all the modules very tightly Wire length (half-perimeter of the hnet bbox) – Minimize average wire length – Would result in tight packing of modules with high connectivity Overlap – Could be prohibited by the moves, or used as penalty – Keep the cells from overlapping (moves cells apart) Timing – Not a 1-1 correspondence with wire length minimization, but consistent on average Congestion – Measure of routability – Tends to move cells apart Importance of Placement Placement: fundamental problem in physical design Glue of the physical synthesis Became very active again in recent years: – 9 new academic placers for WL min. since 2000 – Many other publications to handle timing, routability, etc. Reasons: – Serious interconnect issues (delay, routability, noise) in deep-submicron design Placement determines interconnect to the first order Need placement information even in early design stages (e.g., logic synthesis) Need to have a good placement solution – Placement problem becomes significantly larger – Cong et al. [ASPDAC-03, ISPD-03, ICCAD-03] point out that existing placers are far from optimal, not scalable, and not stable [© He] Placement can Make A Difference MCNC Benchmark circuit e64 (contains 230 4-LUT). Placed to a FPGA. Random Initial Placement Final Placement After Detailed Routing [© He] ASICs Design Types – Lots of fixed I/Os, few macros, millions of standard cells – Placement densities : 40-80% (IBM) – Flat and hierarchical designs SoCs – Many more macro blocks, cores – Datapaths + control logic – Can have very low placement densities : < 20% Micro-Processor (P) Random Logic Macros(RLM) – – – – Hierarchical partitions are placement instances (5-30K) High placement densities : 80%-98% (low whitespace) Many fixed I/Os, relatively few standard cells Recall “Partitioning w Terminals” DAC`99, ISPD `99, ASPDAC`00 [© He] Requirements for Placers Must handle 4-10M cells, 1000s macros – 64 bits + near-linear asymptotic complexity – Scalable/compact design database (OpenAccess) Accept fixed ports/pads/pins + fixed cells Place macros, esp. with var. aspect ratios – Non-trivial heights and widths (e.g., height=2rows) Honor targets and limits for net length Respect floorplan constraints Handle a wide range of placement densities (from <25% to 100% occupied), ICCAD `02 [© He] Placement Footprints: Standard Cell: Data Path: IP - Floorplanning [© He] Placement Footprints: Core Reserved areas IO Control Mixed Data Path & sea of gates: [© He] Placement Footprints: Perimeter IO Area IO [© He] Unconstrained Placement [© He] Floor planned Placement [© He] VLSI Global Placement Examples bad placement good placement [© He] Placement Algorithms A Top-Down – Partitioning-based placement 1 – Recursive bi-partitioning or quadrisection 2 B Cut direction? Partition vs. physical location Iterative – Simulated annealing OR: Force directed – Start with an initial placement, iteratively improve wire-length / area Constructive – Start with a few cells in the center, and place highly connected adjacent modules around them C A L D H B F G Simulated Annealing Placement Cost – Area (usually fixed # of rows, variable row width) – Wirelength (Euclidian or Manhattan) – Cell overlap (penalty increases with temperature) Moves – Exchange two cells within a radius R (R temperature dependent?) – Displace a cell within a row – Flip a cell horizontally Low vs. High temperature – If used as a post processing, start with low-temp Post-processing? – Might be needed if there are still overlaps Case Study: TimberWolf “The Timberwolf Placement and Routing Package”, Sechen, Sangiovanni; IEEE Journal of SolidState Circuits, vol SC-20, No. 2(1985) 510-522 “Timber wolf 3.2: A New Standard Cell Placement and Global Routing Package” Sechen, Sangiovanni, 23rd DAC, 1986, 432-439 Timber wolf Stage 1 Modules are moved between different rows as well as within the same row modules overlaps are allowed when the temperature is reduced below a certain value, stage 2 begins Stage 2 Remove overlaps Annealing process continues, but only interchanges adjacent modules within the same row [© He] Solution Space All possible arrangements of modules into rows possibly with overlaps overlaps Neighboring Solutions Three types of moves: . M1: Displace a module to a new location . M2: Interchange two modules M3: Change the orientation of a module 1 2 3 2 4 1 3 1 4 3 2 Axis of reflections 4 [© He] M1: Displacement Move Selection M2: Interchange M3: Reflection Timber wolf first try to select a move betwee M1 and M2 Prob(M1)=4/5 Prob(M2)=1/5 If a move of type M1 is chosen (for certain module) and it is rejected, then a move of type M3 (for the same module) will be chosen with probability 1/10 Restriction on: How far a module can be displaced What pairs of modules can be interchanged [© He] Move Restriction Range Limiter – At the beginning, R is very large, big enough to contain the whole chip – Window size shrinks slowly as the temperature decreases. In fact, height and width of R log(T) – Stage 2 begins when window size are so small that no inter-row modules interchanges are possible Rectangular window R Cost Function net i hi Cost = C1+C2+C3 – C1 = S(aiwi + bihi) wi – ai, bi are horizontal and vertical weights, respectively – ai =1, bi =1 1/2 perimeter of bounding box – Critical nets: Increase both ai and bi – Double metal technology: Over-the-cell routing is possible. Fewer feed through cells are needed – vertical wirings are “cheaper” than horizontal wirings . use smaller vertical weights i.e. bi< ai [© He] Cost Function (Cont’d) C2: Penalty function for module overlaps O(i,j) = amount of overlaps in the X-dimension between modules i and j C2 O (i , j ) a 2 i j parameter to ensure C2 0 when T 0 a — offset C3: Penalty function that controls the row lengths Desired row length = d( r ) l( r ) = sum of the widths of the modules in row r C3 b r l (r ) - d (r ) Annealing Schedule – Tk = r(k)•Tk-1 k= 1, 2, 3, …. – r(k) increase from 0.8 to max value 0.94 and then decrease to 0.1 – At each temperature, a total number of K•n attempts is made – n= number of modules – K= user specified constant [© He] Force-Directed Placement Model – Wires simulated as springs (if the only force, what will happen?) Forceij = Weightij x distanceij. – Cell sizes as repellant forces – [Eisenmann, DAC’98]: “vacant” regions work as “attracting” forces “overcrowded” regions work as “repelling” forces Algorithm – Solve a set of linear equations to find an intermediate solution (module locations) – Repeat the process until equilibrium Force-Directed Placement (cont.) Model (details): – Cell distances: either – OR: – Forces: – Objective: find x,y coordinates for all cells such that total force exerted on each cell is zero. Force-Directed Placement (cont.) Avoiding overlaps or collapsing in one point? – – – – Use fixed boundary I/O cells Use repelling force between cells that are not connected by a net Do not allow a move that results in overlap Use repelling “field” forces from congested areas to sparse ones [Eisenmann, DAC’98] Problems with force directed: – Overlap still might occur (cell sizes model artificially) – Flat design, not hierarchy Partitioning-based Placement Simultaneously perform: – Circuit partitioning – Chip area partitioning – Assign circuit partitions to chip slots Problem: – Circuit partitioning unaware of the physical location B A B A – Solution: Terminal propagation (add dummy terminals) A B A B [She99] p.239 Partitioning-based Placement More problems: – Direction of the cut? [Yildiz, DAC’01] 1 1 4 5 2 2 3 3 6 7 (a) 4 5 (b) 5 6 7 8 9 (c) 1 2 1 2 3 4 3 (d) – How to handle fixed blocks? (area assigned to a partition might not be enough) – How to correct a bad decision made at a higher level? Advantages: – Hierarchical, scalable – Inherently apt for congestion minimization, easily extendable to timing optimization