ECE 551 Digital System Design & Synthesis Lecture 10 Synthesis Techniques Lecture 10 Topics Synthesis Process Revisited Optimization Stages in Synthesis Advanced Synthesis Strategies 2 Synthesis Verilog files aren’t hardware yet! Need to “synthesize” them Tool reads hardware descriptions Figures out what hardware to make Done automatically Faster! Easier! Designers still have to understand hardware! Avoid pre- vs. post-synthesis discrepancies Describe EFFICIENT hardware 3 Useful Documentation Fairly complete documentation is available for the Synopsys tools using: /afs/engr.wisc.edu/apps/eda/synopsys/syn_Y-2006.06SP1/sold See especially (through Design Compiler link) Design Vision User Guide Design Compiler User Guide Design Compiler Reference Manuals HDL Compiler (Presto Verilog) Reference Manual HDL Compiler for Verilog Reference Manual Use as references 4 HDL Compiler for Verilog Reference Manual, pg. 1-5. HDL Compiler is called by Design Compiler and Design Vision Why do we need to compare synthesized code to initial code? 5 Design Compiler User Guide, pg. 2-17 Design Vision is GUI for Design Compiler: use design_vision Can also run Design Compiler directly using dc_shell To compile using a synthesis script use dc_shell –tcl_mode –f file_name 6 Synthesis Script Example [1] # To run, place in the directory with all the Verilog files # and type: dc_shell -tcl_mode -f script.tcl #Analyze input files. analyze -library WORK -format verilog {./prob5.v ./prob1.v ./prob2.v} #Elaborate the design. elaborate GF_multiplier_mword -architecture verilog -library WORK #Sets clock constraint of 2ns with 50% duty cycle on signal "clock". create_clock -name "clk" -period 2 -waveform {0 1} {clock} set_dont_touch_network [ find clock clk ] #Sets the area constraint for the design set_max_area 50000 7 Synthesis Script Example [2] #Check and compile the design check_design > check_design.txt uniquify compile -map_effort medium #Export netlist for post-synthesis simulation into synth_netlist.v change_names -rule verilog -hierarchy write -format verilog -hierarchy -output synth_netlist.v #Generate reports report_resources > resource_report.txt report_area > area_report.txt report_timing > timing_report.txt report_constraint -all_violators > violator_report.txt report_register -level_sensitive > latch_report.txt exit 8 Internal Synthesizer Flow (Synopsys) HDL Description Structural Representation Syntax Checking Architectural Optimization Technology Library Multi-Level Logic Optimization Technology Mapping Synthesizer Policy Checking Elaboration & Translation Technology-Based Implementation 9 Initial Steps Parsing for Syntax and Semantics Checking Gives error messages and warnings to user User may modify the HDL description in response Synthesizer Policy Checking (“Check Design”) Check for adherence to allowable language constructs Are you using unsupported operators or constructs? Combinational feedback? Multiple drivers to non-tristate? This is where you find out you can’t use certain Verilog constructs This is synthesizer-dependent Example: Advanced DesignWare library allows modulo with any value; most other tools only allow modulo with powers of 2. Certain things common to MOST synthesizers See HDL Compiler for Verilog Reference Manual for constructs 10 Elaboration & Translation Unrolls loops, substitutes macros & parameters, computes constant functions, evaluates generate conditionals Builds a structural representation of the design Like a netlist, but includes larger components Not just gate-level, may include adders, etc. Gives additional errors or warnings to the user Issues in initial transformation to hardware. For example, port sizes do not match Affects quality achieved by optimization steps Structural representation depends on HDL quality Poor HDL can prevent optimization 11 Importance of Translation It is important for the tool to recognize the sort of logic structures you are trying to describe. If it sees a 32-bit full adder, the tool has built-in solutions for optimizing adders Ripple-carry, carry-save, carry look-ahead, etc. If it just sees a Boolean function with 65 inputs, it has to work a lot harder to achieve the same results Do you think it can invent a CLA on the fly? 12 Implications of Translation Writing clear, easy to understand code not only benefits other engineers, but may give you better synthesis results. Another reason for standard coding guidelines Brush up on the list in “Verilog Styles That Kill” If you have a decent synthesis tool, it’s usually better to use Verilog’s built-in arithmetic operators rather than trying to build them from gates or Boolean equations 13 Optimization in Synthesis None of these are guaranteed! Most synthesizers will make at least some attempt Detect and eliminate redundant logic Detect combinational feedback loops Exploit don't-care conditions Try to detect unused states Detect and collapse equivalent states Make state assignments if not made already Synthesize multi-level logic equations subject to: constraints on area and/or speed available technology (library) 14 Optimization Process Optimization modifies the generic netlist resulting from elaboration and translation. Uses cells from the technology library (mapping) Attempts to meet all specified constraints The process is divided into major phases All or some selection of the major phases may be performed during optimization Phase selection can be controlled by the user Some optimizations can be disabled (ex: set_structure) or forced (ex: set_flatten) 15 Optimization Phases Major Optimization Stages Architectural Logic-Level Gate-Level Architectural optimization High-level optimizations that occur before the design is mapped to the logic-level Based on constraints and high-level coding style After optimization circuit function is represented by a generic, technology-independent netlist (GTECH) 16 Architectural Optimization In Synopsis, optimizations include: Sharing common mathematical subexpressions Sharing resources Selecting DesignWare* implementations Replacing the generic representation from Translation with a pre-built, optimized circuits Reordering operators Identifying arithmetic expressions for datapath synthesis *DesignWare is Synopsys’s library of pre-designed circuit implementations 17 Architectural Optimization Examples: Replace an adder used as a counter with incrementer count = count + 1; Replace adder and separate subtractor with adder/subtractor if not used simultaneously if (~sub) z = a + b; else z = a – b; Performs selection of pre-designed components (Synopsys DesignWare) adders, multipliers, shifters, comparators, muxes, etc. Need good code for synthesizer to do this Designer knows more about the project than the tool does! It can only do so much on its own. 18 Logic/Gate-Level Optimization Works on the generic netlist created by logic synthesis Produces a technology-specific netlist. In Synopsis, it consists of four stages: Mapping Delay optimization Design rule fixing Area optimization This phase often runs in multiple iterations if constraints are not met on the first try 19 Logic/Gate-Level Optimization Mapping Delay optimization Generates a gate level implementation using tech library Tries to meet timing and area goals Tries to fix delay violations from mapping phase. Does not fix design rule violations or meet area constraints. Design rule fixing Tries to correct design rule violations Inserting buffers or resizing existing cells If necessary, violates optimization constraints Area optimization Tries to meet area constraints, which have lowest priority 20 Combinational Optimization 21 Gate-Level Optimization 22 Boolean Logic-Level Optimizations Verilog Description Technology Libraries TRANSLATION ENGINE OPTIMIZATION ENGINE Two-level Logic Functions Optimized Multi-level Logic Functions MAPPING ENGINE Technology Implementation 23 Logic Optimizations Area Number of gates Size of gates (# inputs) Delay Number of logic levels Size of gates (# inputs) fewer == smaller fewer == smaller fewer == faster fewer == faster Note that examples that follow ignore NOT gates for gate count / levels of circuits This is because many libraries offer gate cells with one or more inputs already inverted. 24 Logic Optimizations Decomposition Extraction Factoring Substitution Elimination You don’t have to remember the names of these But should understand logic optimization Different techniques targeting area vs. delay 25 Decomposition Find common expressions in a single function Reduce redundancy Reduce area (number/size of gates) May increase delay More levels of logic Define a G(x) cost function to compare expressions G(inverter) = 0 G(basic gate) = #inputs to the gate Basic gates: AND, OR, NAND, NOR Based on the concept that the size of a gate is proportional to the number of inputs 26 Decomposition Example F = abc + abd + a’c’d’ + b’c’d’ F = ab(c + d) + c’d’(a’ + b’) F = ab(c + d) + (c + d)’(ab)’ X = ab Y=c+d F = XY + X’Y’ 1 gate, 1 level 1 gate, 1 level 3 gates, 2 levels (5 gates, 3 levels total) G(Original) = 16 (four 3-input, one 4-input gates) G(Decomposed) = 10 (five 2-input gates) 27 Extraction Find common sub-expressions between functions Like decomposition, but across more than one function Reduce redundancy Reduce area (number/size of gates) May increase delay if more logic levels introduced 28 Extraction Example F = (a + b)cd + e G = (a + b)e’ H = cde 3 gates, 3 levels 2 gates, 2 levels 1 gate, 1 level 1 gate, 1 level (each) 4 gates, 3 levels 2 gate, 2 levels 2 gate, 2 levels Common subexp: X = a + b, Y = cd F = XY + e G = Xe’ H = Ye Before: (3) 2-input ORs, (2) 3-input ANDs, (1) 2-input AND G(original) = 6 + 6 + 2 = 14 After (2) 2-input Ors, (4) 2-input ANDs G(extracted) = 4 + 8 = 12 29 Factoring Traditional two-level logic is sum-of-products Sometimes better expressed by product-of-sums Fewer literals => less area May increase delay if logic equation not completely factored (becomes multi-level) 30 Factoring Example Definitely good: F = ac + ad + bc + bd F = (a + b)(c + d) 7 gates, 3 levels* 3 gates, 2 levels Maybe good: F = ac + ad + e F = a(c + d) + e 3 gates, 2 levels (G=7) 3 gates, 3 levels (G=6) This one might improve area... But will likely increase delay (tradeoff) *Assuming 2-input gates 31 Substitution Similar to Extraction When one function is a sub-function of another Reduce area Fewer gates Can increase delay if more logic levels 32 Substitution Example G=a+b F=a+b+c 1 gate, 1 level 1 gate, 1 level F=G+c 2 gate, 2 levels Before: (1) 2-input OR, (1) 3-input OR After: (2) 2-input ORs (better area but increased levels) With compile_ultra, the sub-expressions do not have to explicitly match, i.e. a + b would still be identified if F = b + c + a 33 Elimination (Flattening) Opposite of previous optimizations Goal is to reduce delay Make signals travel though as few logic levels as possible But will likely increase area Gate replication / redundant logic Can force/disable this step using set_flatten true / set_flatten false 34 Elimination Example G=c+d F = Ga + G' b 1 gate, 1 level 3 gates, 3 levels G=c+d F = ac + ad + bc’d’ 1 gate, 1 level 4 gates, 2 levels Before: (2) 2-input ORs, (2) 2-input ANDs After: (1) 2-input OR, (1) 3-input OR, (2) 2-input ANDs, (1) 3-input AND (worse area, but fewer levels) 35 compile_ultra Optimizations Ultra-high mapping effort, 2-pass Compilation Automatic hierarchical ungrouping Ungroups small modules before mapping Ungroups critical path based on delay Automatic datapath extraction * E.g. carry-save adders, sharing/unsharing Boundary optimization Propagates logic across hierarchical boundaries (constants, NC inputs/outputs, NOT) Sequential inversion * Sequential elements can have their outputs inverted 36 Datepath Extraction Optimizations Uses carry-save adders where beneficial Carry-propagate adders only when result is needed 37 Datapath Extraction Optimizations Comparator sharing A>B, A=B, A<B use a single subtractor with multiple outputs Optimization of parallel constant multipliers SOP to POS transformation Operand reordering Explores trade-offs of common sub-expression sharing and mutually exclusive resource sharing 38 Sharing and Unsharing Expression sharing may be overridden later due to timing Z1 <= A + B + C Z2 <= A + B + D Arrival time is A < B < D < C 39 Sharing and Unsharing Mutually exclusive operations can share resources if(SEL) Z = A + B else Z = C + D When would this kind of sharing be a bad idea? 40 Sequential Inversion set compile_seqmap_enable_output_inversion true Useful if the available flip-flops do not have the same asynchronous input (preset or clear) as required in the design 41 Register Retiming At the HDL level, determining the optimal placement of registers is difficult and tedious at best, or just plain impossible at worst The register retiming tool moves registers through the synthesized combinational logic network to improve timing and/or area Equalize delay (i.e. reduce critical path delay by increasing delay in other paths) Reduce the number of flip-flops if timing criteria are met Usually propagate registers forward Be aware that this may change the values of some internal signals compared to pre-synthesis. 42 Register Retiming Example (1) 43 Register Retiming Example (2) 44 DC Topographical Mode When optimizing for delay, the synthesis engine is not aware of the net delays, since the place-androute has not been accomplished Delays can be back-annotated and synthesis repeated after place-and-route, until closure is reached Layout-aware synthesis attempts to get faster timing closure by predicting the physical design and using that information in synthesis and optimization, particularly with respect to delay Estimates the placement and routing Predicts and uses net capacitances in synthesis and optimization 45 Further Reading There are many more commands out there to give you greater control over the synthesis process if you want it. See: Synopsys Online Documentation (SOLD) Design Compiler man pages 46