FPGA design practices and optimization Gyula Istvan Nagy Avoid clock-gating for FPGA design › Clock-gating results in lower performance for FPGA implementation › Use clock-enable Page 2 Avoid data for clock › Using a derived data signal for clock results in low performance › Data propagation delay and skew decreases maximal clock frequency › Use derived signal as clock enable Page 3 Avoid latch usage (1) › Latches read input data until Gate de-asserts › Fed-back networks must have higher inherent data propagation delay than Gate assertion › Constrains minimal data propagation delay › Hard to comply for each route without excessive data propagation › FPGA internal structure is optimized for fastest data propagation Page 4 Avoid latch usage (2) › Master-slave latch cancels minimal data propagation delay requirement › Also reduces maximal allowed data propagation delay that can lead to unnecesary tight timing requirements Page 5 Avoid latch usage (3) › D-type Flip-Flop consists of three RS bistables Page 6 Avoid latch usage (4) › Design with Flip-Flops to make timing closure possible Page 7 Output driving (1) › Tri-state drivers are only in the output buffers for external signaling › No internal tri-state signal definition possible › For synthesized combinational functions control tri-state drivers directly with Flip-Flop output to avoid hazard propagation Page 8 Output driving (2) › Use only Flip-Flop sourced output ports in top-level and subsequent blocks › Use I/O block Flip-Flops for better timing match › Comply maximal SSO count › Use differential driver nodes for most accurate phase relationship of a signal pair if required › Implement parallel bus clock and data with DDR output drivers for best timing match › Sort pins by pad-to-pin propagation of used package and silicon for timing-sensitive peripherals › Take internal PLL phase swing into count when transmitting data to synchronous external device Page 9 Design goals (1) › Design synchronous network with registered outputs and registered inputs if possible › Balance combinational delay between Flip-Flops › Design low fanout networks › Keep the design on small area to be able to operate with low-skew clock network › Use multiple Flip-Flops to reduce fan-outs if necessary › Apply pipelining if necessary › Consider different realizations › Use only Flip-Flop sourced output ports in Top-Level and subsequent blocks Page 10 Design goals (2) › Use as low resources as necessary for the required performance › Implement common reset signal in a clock-domain to avoid combinational logic on reset signals Page 11 Consider different realizations (1) › Example on a single-phase FIR filter › Direct structure has a lot of adders at output that creates large combinational delay although it can be pipelined Page 12 Consider different realizations (2) › Transposed structure is optimal for one sample per clock data-rate Page 13 Consider different realizations (3) › MAC structure is optimal for lower data rates because of less resource consuption Page 14 Pipelining (1) › Pipelining is inserting registers into a combinational network in order to distribute the combinational delay between many registers and allow higher clock frequency operation as long as side effects make it possible › Raises the design Flip-Flop consumption › Data throughput is increased by the achieved higher clock frequency › Input-to-output group delay is increased by the number of inserted FF stages › Overall propagation time is not necessary higher because of the higher clock frequency › Pipelining of fed-back systems requires functional redesign Page 15 Pipelining (2) › Example pipelining on an integrator › Integrator is implemented as a fed-back adder › Full adders are fed-back bitwise, carry bits are connected to adjacent adders › Carry propagation is only fed-forward and can be pipelined › Network is divided into many smaller networks during pipelining that are separated with registers in the carrychain › Input and output data flow requires group-delay balancing Page 16 Pipelining (3) › 100-bit adder direct structure › Can run at 100 MHz in a Spartan 3 -4 speed grade FPGA › Can run at 243 MHz in a Virtex 5 -3 speed grade FPGA Page 17 Pipelining (4) › 1-stage pipelined structure › Can run at 128 MHz in a Spartan 3 -4 speed grade FPGA › Can run at 400 MHz in a Virtex 5 -3 speed grade FPGA Page 18 Pipelining (5) › 2-stage pipeline structure has two registers in the carry chain and two stages of group-delay balancing › Can run at 147 MHz in a Spartan 3 -4 speed grade FPGA › Can run at 344 MHz in a Virtex 5 -3 speed grade FPGA › As a side effect design area is increasing during pipelining › Performance is reduced for Virtex-5 because of higher clock-skew is comparable to data delay › Spartan-3 performance is still increased because data propagation dominates over the increased clock skew Page 19 Pipelining (6) Page 20 Fan-out reduction (1) › Flip-flop multiplication in a parallel manner shares fan-outs of the single Flip-Flop › Raises the design Flip-Flop consumption › Decreased data propagation allows higher clock frequency resulting in higher throughput › Can be automated in the design flow Page 21 Fan-out reduction (2) Page 22 Signal processing (1) › Signal truncation reduces signal-to-noise ratio in the overall spectrum › FIR system coefficient truncation reduces filter rejection as quantization noise spectrum is added to the filter characteristic › IIR system coefficient truncation re-places pole-zero map and can lead settling to an attractor › Always separate parts with pole or zero at z=1 and implement standalone with direct feedback or feedforward › Take quantization noise accumulation into count during partial sum calculations Page 23 Signal processing (2) › Use spare bits for quantization noise storage during accumulation › Round to half-LSB for minimal quantization noise mean value › Also consider time-domain amplitudes for implementation Page 24 General development process Development time ratio › Verification phase is at about three-times longer than source code development › By default verification phase determines the critical path to project goals › Process phase series prioritically affect product quality 3% Specification recognition 7% Architecture design and block definition Prioritically affecting quality › Most of development time is consumed by source code design and verification 25 % RTL source code design Software-based simulation and verification 90 % 75 % Post-synthesis verification Hardware verification Page 25 Reusable design (1) › High-quality system architecture creates clear data flow definition and block-level functional separation › Required for module reusability and effective system verification Page 26 Reusable design (2) › Always consider the simplest and easiest developable solution for the task › Define blocks for separate able parts of the overall task such a way that leads to the fewest, most obvious and reusable block-level port signals which individually reflect the block functionality › Separate only different tasks into different blocks › Avoid the operation-related communication between blocks, keep the communication on task-related level, avoid the communication between block FSMs › Plan the complete functionality of each block before starting the HDL coding in order to avoid further patching Page 27 Reusable design (3) › Consider the most difficult conditions first and plan the blocks’ operation be able to handle all of them › Write tidy HDL code with signal names reflect the signal behavior rather than the operation it’s produced from, the operation can be in comments › Patching can lead to side effects affecting the system-level functionality. Be aware of the causal connections that changes during patching can result in › Keep the simplest hierarchy and most clear design during patching Page 28 Naming convention (1) › Helps indentify signal functionality and physical representation › Faster development possible › Improves reusability Page 29 Naming convention (2) › Generic parameter: „g_” initial › Input port: › Output port: › Bidirectional port: „i_” inital „o_” initial „io_” initial › › › › › › › › › › „const_” initial „t_” initial „q_” initial „c_” initial „w_” initial „_n” ending „proc_” initial „func_” initial „gen_” initial „inst_” initial Constant: Type: Register: Combinational logic: Wire: Negated signal: Process: Function: Generate construct: Instance: Page 30