FPGA design practices and optimization

FPGA design practices and optimization Gyula Istvan Nagy Avoid clock-gating for FPGA design › Clock-gating results in lower performance for FPGA implementation › Use clock-enable Page 2 Avoid data for clock › Using a derived data signal for clock results in low performance › Data propagation delay and skew decreases maximal clock frequency › Use derived signal as clock enable Page 3 Avoid latch usage (1) › Latches read input data until Gate de-asserts › Fed-back networks must have higher inherent data propagation delay than Gate assertion › Constrains minimal data propagation delay › Hard to comply for each route without excessive data propagation › FPGA internal structure is optimized for fastest data propagation Page 4 Avoid latch usage (2) › Master-slave latch cancels minimal data propagation delay requirement › Also reduces maximal allowed data propagation delay that can lead to unnecesary tight timing requirements Page 5 Avoid latch usage (3) › D-type Flip-Flop consists of three RS bistables Page 6 Avoid latch usage (4) › Design with Flip-Flops to make timing closure possible Page 7 Output driving (1) › Tri-state drivers are only in the output buffers for external signaling › No internal tri-state signal definition possible › For synthesized combinational functions control tri-state drivers directly with Flip-Flop output to avoid hazard propagation Page 8 Output driving (2) › Use only Flip-Flop sourced output ports in top-level and subsequent blocks › Use I/O block Flip-Flops for better timing match › Comply maximal SSO count › Use differential driver nodes for most accurate phase relationship of a signal pair if required › Implement parallel bus clock and data with DDR output drivers for best timing match › Sort pins by pad-to-pin propagation of used package and silicon for timing-sensitive peripherals › Take internal PLL phase swing into count when transmitting data to synchronous external device Page 9 Design goals (1) › Design synchronous network with registered outputs and registered inputs if possible › Balance combinational delay between Flip-Flops › Design low fanout networks › Keep the design on small area to be able to operate with low-skew clock network › Use multiple Flip-Flops to reduce fan-outs if necessary › Apply pipelining if necessary › Consider different realizations › Use only Flip-Flop sourced output ports in Top-Level and subsequent blocks Page 10 Design goals (2) › Use as low resources as necessary for the required performance › Implement common reset signal in a clock-domain to avoid combinational logic on reset signals Page 11 Consider different realizations (1) › Example on a single-phase FIR filter › Direct structure has a lot of adders at output that creates large combinational delay although it can be pipelined Page 12 Consider different realizations (2) › Transposed structure is optimal for one sample per clock data-rate Page 13 Consider different realizations (3) › MAC structure is optimal for lower data rates because of less resource consuption Page 14 Pipelining (1) › Pipelining is inserting registers into a combinational network in order to distribute the combinational delay between many registers and allow higher clock frequency operation as long as side effects make it possible › Raises the design Flip-Flop consumption › Data throughput is increased by the achieved higher clock frequency › Input-to-output group delay is increased by the number of inserted FF stages › Overall propagation time is not necessary higher because of the higher clock frequency › Pipelining of fed-back systems requires functional redesign Page 15 Pipelining (2) › Example pipelining on an integrator › Integrator is implemented as a fed-back adder › Full adders are fed-back bitwise, carry bits are connected to adjacent adders › Carry propagation is only fed-forward and can be pipelined › Network is divided into many smaller networks during pipelining that are separated with registers in the carrychain › Input and output data flow requires group-delay balancing Page 16 Pipelining (3) › 100-bit adder direct structure › Can run at 100 MHz in a Spartan 3 -4 speed grade FPGA › Can run at 243 MHz in a Virtex 5 -3 speed grade FPGA Page 17 Pipelining (4) › 1-stage pipelined structure › Can run at 128 MHz in a Spartan 3 -4 speed grade FPGA › Can run at 400 MHz in a Virtex 5 -3 speed grade FPGA Page 18 Pipelining (5) › 2-stage pipeline structure has two registers in the carry chain and two stages of group-delay balancing › Can run at 147 MHz in a Spartan 3 -4 speed grade FPGA › Can run at 344 MHz in a Virtex 5 -3 speed grade FPGA › As a side effect design area is increasing during pipelining › Performance is reduced for Virtex-5 because of higher clock-skew is comparable to data delay › Spartan-3 performance is still increased because data propagation dominates over the increased clock skew Page 19 Pipelining (6) Page 20 Fan-out reduction (1) › Flip-flop multiplication in a parallel manner shares fan-outs of the single Flip-Flop › Raises the design Flip-Flop consumption › Decreased data propagation allows higher clock frequency resulting in higher throughput › Can be automated in the design flow Page 21 Fan-out reduction (2) Page 22 Signal processing (1) › Signal truncation reduces signal-to-noise ratio in the overall spectrum › FIR system coefficient truncation reduces filter rejection as quantization noise spectrum is added to the filter characteristic › IIR system coefficient truncation re-places pole-zero map and can lead settling to an attractor › Always separate parts with pole or zero at z=1 and implement standalone with direct feedback or feedforward › Take quantization noise accumulation into count during partial sum calculations Page 23 Signal processing (2) › Use spare bits for quantization noise storage during accumulation › Round to half-LSB for minimal quantization noise mean value › Also consider time-domain amplitudes for implementation Page 24 General development process Development time ratio › Verification phase is at about three-times longer than source code development › By default verification phase determines the critical path to project goals › Process phase series prioritically affect product quality 3% Specification recognition 7% Architecture design and block definition Prioritically affecting quality › Most of development time is consumed by source code design and verification 25 % RTL source code design Software-based simulation and verification 90 % 75 % Post-synthesis verification Hardware verification Page 25 Reusable design (1) › High-quality system architecture creates clear data flow definition and block-level functional separation › Required for module reusability and effective system verification Page 26 Reusable design (2) › Always consider the simplest and easiest developable solution for the task › Define blocks for separate able parts of the overall task such a way that leads to the fewest, most obvious and reusable block-level port signals which individually reflect the block functionality › Separate only different tasks into different blocks › Avoid the operation-related communication between blocks, keep the communication on task-related level, avoid the communication between block FSMs › Plan the complete functionality of each block before starting the HDL coding in order to avoid further patching Page 27 Reusable design (3) › Consider the most difficult conditions first and plan the blocks’ operation be able to handle all of them › Write tidy HDL code with signal names reflect the signal behavior rather than the operation it’s produced from, the operation can be in comments › Patching can lead to side effects affecting the system-level functionality. Be aware of the causal connections that changes during patching can result in › Keep the simplest hierarchy and most clear design during patching Page 28 Naming convention (1) › Helps indentify signal functionality and physical representation › Faster development possible › Improves reusability Page 29 Naming convention (2) › Generic parameter: „g_” initial › Input port: › Output port: › Bidirectional port: „i_” inital „o_” initial „io_” initial › › › › › › › › › › „const_” initial „t_” initial „q_” initial „c_” initial „w_” initial „_n” ending „proc_” initial „func_” initial „gen_” initial „inst_” initial Constant: Type: Register: Combinational logic: Wire: Negated signal: Process: Function: Generate construct: Instance: Page 30

FPGA design practices and optimization

Related documents

Products

Support

FPGA design practices and optimization

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib