1--Hello and welcome to this recorded E-Learning on

advertisement
Virtex-5 FPGA Coding Techniques, part 2 Script
1-- Hello and welcome to this recorded e-Learning module on Virtex-5 HDL Coding
Techniques. This module is one of the recommended recorded e-Learning modules, for
the Designing for Performance course and is also a part of the FPGA to ASIC REL
curriculum.
My name is Frank Nelson, I will be your instructor for this module. I am a Xilinx
Technical Trainer and Course Developer. This module introduces some of the
architectural differences of Virtex-5 and their impact on HDL Coding Techniques. This
module introduces some of the primary concepts that impact the quality of results a
designer will get when synthesizing for a Virtex-5 FPGA. This module also provides
some detailed recommendations about creating effective HDL that will provide high
speed and reduce the amount FPGA resources used in your design.
2--You can download a printable PDF version of this module and a copy of the script by
clicking on the icon you see here. If you would like to do so now, pause this recording
and then continue when you are ready to proceed
3-- This slide describes the curriculum path for ASIC designers to learn Xilinx software,
hardware and FPGA design techniques.
<click> There are 5 RELs to help you get familiar with FPGA design. There is also a
corresponding HDL Coding Techniques module for Spartan-3 devices. The content in
this module will help you immediately start making coding improvements to your Virtex5 FPGA design. There are two modules for Virtex-5 HDL Coding Techniques content.
This is part 1. For more information on these courses, visit support.xilinx.com and click
on the education link at the top of the page.
Also note that these RELs do not introduce the ISE Design Suite, good design practices,
and other essential information. This content is part of our Fundamentals of FPGA
Design and Designing for Performance courses. We recommend that you take the time to
attend these courses.
4- After completing this module you will be able to build an efficient Virtex-5 design
that runs at high speed.
<click> You will also be able to avoid the most common HDL coding mistakes that
prevent Virtex-5 designers from achieving their performance objectives.
<click> Careful attention to the content included in this module is essential if you are
new to designing for Virtex-5, and especially if you are handling the design conversion
5- We are going to discuss some of the most important ways to optimize your HDL code
for a Virtex-5 device.
<click> After listening to this entire module, you should be able to make your own
checklist of tips for building a high-speed Virtex-5 design.
6—A common question is “How do I control how my synthesis tool infers clock
enables?” Recall that like synchronous resets, clock enables can be connected directly to
the FF or be connected to a LUT input. The intention is that you might want to design
your lowest fanout clock enables to use a LUT input. This will decrease the number of
control sets in your design.
To force the CE signal to be forced to a LUT input, code it as part of the register’s output
(as seen on the right). So in this example, instead of coding it as we usually do (as seen
on the left, which will force the CE to the register’s CE port), write the detailed equations
and the explicit connections to the registers output. In this case, by having more than one
input, it forces the CE input to the LUT input.
Xilinx is not recommending that you design asynchronously, just that you try to reduce
designing low-fanout CEs (fewer than eight registers). Synthesis tools will get “smarter”
about when to use dedicated CE control signals and when to move the CE function to the
datapath, but it will take some time.
You should also note that some synthesis tools allow the user to disable the use of the CE
port on the register. Xilinx does not recommend this use, typically, because on average it
results in the use of 25% more LUTs.
<click> So remember to code for your low fan-out clock enables to drive LUT inputs.
This will enable the FF to be part of a larger control set and give the tools more flexibility
to meet your timing objectives.
7—We also want to mention that the MAP report (that is the program that is part of place
and route) will report on the number of control sets.
<click> To do this you will have to run MAP with the –detail switch to get the analysis.
Note that this report is long, but it may be useful (or just curious).
<click> Don’t forget that having a low number of members for each control set is cause
for concern because the tools may have trouble sharing slice and CLB resources with
those small groups of registers. There is no typical range, but if I was attempting to do a
design conversion having some statistics would help me see the improvement I made
during the design conversion. So this is just an FYI.
8—As you know, gated clocks can cause glitches, increased clock delay, clock skew, and
other undesirable effects. Using clock enables improves the timing characteristics and
reliability of your design. There are several ways to use clock-enable resources.
<click> To gate entire clock domains for power reduction, it is preferable to use the
clock-enabled global buffer resource called the BUFGCE. This is the symbol here, taken
from the Xilinx Unified Library Guide. This will de-activate the clock being distributed
on a global basis. As an FYI, don’t forget that Virtex-5 supports a synchronous
switching of individual clock domains with the BUFGMUX primitive.
<click> For applications that only attempt to pause the clock for a few cycles on small
areas of the design, the preferred method is to use the clock-enable port of each FPGA
register.
<click> So try to take advantage of the BUFGCE functionality. It will save you from
having to route a high-fanout CE throughout your device. And saving routing resources
can do a lot to improve the speed of your FPGA design.
We discuss synchronous design techniques and efficient ways to build clocks in the
Fundamentals of FPGA Design course.
9—Many signal processing algorithms perform an arithmetic operation on an input
stream of samples followed by a summation of all output of the arithmetic operation. To
implement the summation in an FPGA, the adder tree structure is typically used. One of
the key difficulties with adder trees is its varying size. The number of adders depends on
the number of inputs to the adder tree. Also the more inputs that are used will simply
require more adders. This increases both the amount of resources and the power
consumption of the system. Larger trees also mean larger adders in the last stages which
can negatively impact speed.
<click> It is simply better to implement these with dedicated resources, such as the DSP
slice. This implementation involves computing the summation incrementally using
chained adders instead of adder trees. This maximizes performance and lowers power for
DSP algorithms because both logic and interconnect are contained within the dedicated
silicon. When pipelined the performance is 500 MHz in the fastest speed grade,
independent of the number of adders.
To learn more about the process of converting a direct form filter to a transposed or
systolic form is detailed in UG073: XtremeDSP for Virtex-4 FPGAs User Guide.
Likewise, you should check out UG193: Virtex-5 FPGA XtremeDSP Design
Considerations.
10—When inferring dual port block memories, it is possible for both ports to access the
same location at the same time. In this case, the contents cannot be guaranteed. The
latest FPGAs have three programmable operating modes to govern the memory output
while a write operation is occurring. They include, write first (or transparent mode), read
first (or read before write mode), and no change mode. Synthesis tools can infer these
configurations based on your HDL coding style. You should avoid read before write
mode to achieve maximum Block RAM performance.
<click> Synthesis tool settings, such as Synplify, insert bypass logic around the RAM to
prevent the possible mismatch between the RTL and hardware behavior. This extra logic
is intended to force the RAM outputs to some known values when collisions can occur.
<click> If the designer knows that these collisions cannot occur in his system, the
synthesis tools setting can be used to prevent by pass logic from being added to the
application. This extra logic has a negative impact on the memory performance.
11—All Xilinx FPGAs have dedicated registers on the input and output pads. By using
these registers set-up times for the input paths and clock-to-output times for the outputs
paths can be minimized, making it easier to meet timing requirements for capturing and
providing data to external devices.
<click> However, using the dedicated IO registers can have a negative effect on meeting
the timing requirements within the FPGA; their use can lengthen route delays to the
internal logic.
<click> Unless the designer needs the IO registers to meet IO timing, Xilinx recommends
that the registers be placed within the FPGA fabric. The ideal solution is that your
synthesis tool, like Synplify, will automatically place registers based on the timing
specifications you make. Otherwise, the following steps must be performed…
<click> Disable global IO register usage in your synthesis tool. This will prevent the
synthesis tool from mapping these registers to the IOBs every time. This option is
usually on by default.
<click> Disable the Map option to pack registers into the IOBs. Map is a part of the
place and route process. It is the phase where logic is grouped into CLBs and, in our
case, IOBs. Since the implementation tools have this option on by default, it may be
worth evaluating your timing with this option off.
<click> So after turning both default options off, you can now selectively move registers
into the IOBs with a UCF attribute. So start with your timing critical IO pads and use the
IOB=TRUE syntax in your UCF file or your source HDL. For more information on how
to do this, refer to the Xilinx Constraints Guide. Likewise, check your synthesis vendors
documentation if you want to make these attributes with your synthesis tool or your
HDL.
<click> But however, you choose to map specific registers to IOBs, try to only use the
IOB registers to help meet IO timing.
Now lest take a couple minutes and take a quiz on what has been presented so far.
12—(questions)
13--Keep in mind that it is important to register all inputs and outputs to a hierarchical
block when there is a possibility incremental design practices will be employed. You
may recall that incremental design allows a designer to maintain place and route on a
module basis. The benefit is that it saves development time when you have found a result
that meets your needs. So implementing the design each time, now becomes
implementing just the components you are not happy with. In general, these practices are
popular and worthwhile. But for them to be effective in meeting system performance
needs, designers have to anticipate having signals travel a significant distance across the
die. To help this, we recommend registering all inputs and outputs to your most timing
critical hierarchical blocks.
<click> Likewise, you should try to keep your IO resources at the top level of your
design. Usually this is done through direct instantiation. While inferring IO registers is
usually okay, instantiation of DDR, Serdes, and delay resources is recommended. It is
just plain easier than inferring those resources.
<click> Also keep in mind that any logic that needs to be place in a single resource, such
as a Block RAM or a DSP slice should be contained in a single hierarchical block.
<click> Any logic that needs the synthesis tool to resources share should be placed in a
single hierarchical block. Recall that resource sharing is recommended to be turned off,
but if you decide that your synthesis result with a component is best left on, then isolate
that logic into one block.
<click> Also, manually duplicate registers with high fanout at a hierarchical boundary.
<click> By following these guidelines, if is far less likely that your chosen design
hierarchy will interfere with design optimization and performance.
14—The idea of replicating registers that have a high fanout net on the output, still
applies in Virtex-5. This allows a high fanout signal to be moved closer to some of the
many destinations where it is required.
The easiest way to determine if this is part of a problem is to generate a timing report,
either from synthesis or after implementation, and determine if you have a high fanout
net as part of the data path. If so, manual duplication is recommended. This will require
you to prevent the synthesis tool from removing the extra logic (which is its default
behavior) by using a keep attribute. If your synthesis tool allows you to control
replication that can also be useful, but keep in mind that replicating logic just once is
rarely sufficient, depending on how far from meeting your timing needs you are, you will
probably have to replicate that logic a few times.
<click> Once you have added registers through pipeline stages, you should also
remember to allow your synthesis tool to migrate those registers where needed. This is
done with the Retiming option.
<click> (nothing said)
15—Since we have been talking about register’s being a premium in Virtex-5 designs, we
wanted to take a second and mention some considerations about synthesis options.
<click> First of all, don’t over constrain your design. This means that you should not
place timing constraints on your design that are unrealistic. As we always say, only place
timing constraints that reflect what you must have. Remember that unless you are
operating at worst case operating conditions, you already have some built in slack. Over
constraining typically makes designs bigger and increases the number of registers used,
since synthesis tools tend to replicate logic to decrease fan-out on timing critical paths.
Typically this increases register usage by 1-5%.
<click> Global optimization settings don’t usually solve your register usage problems,
although they can generate different results. Typically, global optimization can reduce
your FF usage by up to 10%. However, this often forces the synthesis tool to generate
extra control sets, and can result in unusable FFs. So feel free to try it and compare your
results, but don’t expect it to solve all of your problems.
<click> FSM optimization can save you some FFs, but you will probably have to encode
the FSM yourself. This is because most FSM optimizations will use One Hot Encoding
which uses more FFs in an effort to increase speed. With the Virtex-5 FPGA, the 6-input
LUT means one-hot encoding is not quite as useful as it used be with the 4-input LUT
structure. I would recommend that you synthesize your FSM yourself by trying binary,
OHE, and Gray encoding. Then compare your results and see which one you like best. It
is also important to determine whether your FSM is part of a timing critical path (often
they are). If so, you may have little choice but to use OHE. But be an engineer and
evaluate the encoding yourself. Note that over time, this is something synthesis tools will
learn from and improve upon.
<click> Finally, don’t use slice or LUT compression switches. While this does not
consume too many extra FFs, in some cases latch-thrus were used and waste registers.
Recall that the slice does not allow you to separate the FF from the LUT, so in the case of
a latch-thru the LUT is used, but the FF is wasted.
<click> So we don’t recommend over-constraining your design. It just increases register
usage too much. We also don’t recommend using slice or LUT compression switches. It
can waste registers.
16—So to summarize our synthesis options.
<click> Turn on logic replication and retiming. Logic replication is designed to duplicate
logic that is generating a high fan-out signal. This duplication can be done by the
designer if they know their design well enough.
Retiming is best done when you have already evaluated your need to pipeline and you
have added those extra registers. Remember that you don’t need to pipeline with Virtex5 as much as you did with older devices.
<click> Turn off resource sharing.
<click> Turn on logic optimization. This is a starting point. You may find that you
prefer it to be used for widening deep data paths, but this will depend on your design.
<click> Turn off FSM optimization. Synthesis tools tend to use one-hot too much and
this wastes registers. Synthesis you large FSMs with each encoding yourself and choose
the one that is fastest and smallest.
<click> Don’t over-constrain during synthesis. Synthesis tools tend to make the design
much larger than necessary.
<click> Don’t use slice or LUT compression switches. This tends to waste FFs.
<click> All of these options tend to make the design larger, but save FFs and give the
tools more flexibility.
Now we want to reiterate that we are providing guidelines in these modules. These are
not hard rules that always deliver the best performance, but you should start out by trying
these techniques with Virtex-5 devices. After that, you can experiment as you see fit, but
I am betting you will only be experimenting a little bit, if at all.
17—As I mentioned earlier, migrating designs to Virtex-5 can be a little challenging
since the architecture is so different. So lets see what designs are easiest to migrate.
First off, designs that use the dedicated hard IP are an excellent fit. In fact, taking
advantage of this IP is critical, since this will improve your device utilization and design
speed very quickly. This is especially true of designs that use the low-power MGT
resources (Serial Gigabit Transceivers), EMACs, DSP slice resources, Block RAMs,
Power PC 440 processor, and PCI resources.
<click> Low power applications that use the dedicated IP are also ideal, because Virtex-5
uses optimum power and the dedicated IP is very power efficient.
<click> But also low-speed designs that were not optimized for a predecessor device
family (that is, not highly pipelined) would work well. This is because it probably has
several logic levels on its timing critical path. In this case, the 6-input LUT will yield a
performance improvement.
<click> However, note that as long as you are willing to evaluate your pipelining
techniques, instantiate some dedicated IP, and verify your existing coding styles, Virtex-5
should be optimum.
18—The toughest designs are those that have not gone through any conversion. They
tend to be the slowest and the most challenging. This usually includes designs that have
not been re-synthesized and contain old netlists and cores from previous architectures and
applications.
The first thing I am going to have to do is convince this customer that they are not taking
enough advantage of the dedicated hardware, which means that they are wasting some of
their money. And as soon as they start adding stuff and changing their design, all they
need is some guidance. And hopefully this REL is helping provide that guidance.
Also keep in mind that designs that have not been re-optimized may struggle to meet
timing requirements in Virtex-5. So what is the chance of the design being successful if
it’s not going to meet your timing needs?
<click> But besides that, some designs that don’t fully utilize the dedicated DSP
resources as much as possible tend to be challenging. This could be because their
synthesis tool did not infer all the DSP slice resources they hoped and in such case they
have to instantiate the components they want.
<click> And as I mentioned earlier, designs that were heavily pipelined for an older
architecture and never re-optimized struggle to meet timing.
<click> Well, what is in common? Yeah, they just need to optimize their code some.
<click> But in general, I find that most customers realize that every new FPGA requires
some work on their part to fully utilize all of the architecture advantages of a new product
family. In this case, the 6-input LUT and the dedicated IP will need to targeted.
19—Now for some common questions.
<click> We often get asked “Why cant I code how I want to?”
<click> The answer is simply that synthesis tools and implementation tools cannot do as
much of the work as customers would like. In the end, there will always be design work
necessary to compensate for synthesis and implementation tools. This is especially
important if you are going to try and get the most out of any FPGA product family.
<click> Another common questions is “Shouldn’t the tools be able to make my code
optimal?”
<click> Well, the bottom line is synthesis and implementation tools have limitations.
They always will. This is exactly the reason we provided this content. So you can
improve your design so you can get the most out of your Virtex-5 design.
20—Some more common questions…
<click> “The Virtex-5 FPGA should always be a speed grade faster than Virtex-4, right?”
<click>No, this is not always true, especially if the design was heavily pipelined.
<click> “This design easily fit in Virex-4, and now it can’t fit in Virtex-5. What’s
wrong?”
<click> Remember to check your control sets. I am willing to bet you have to many.
Check your use of Resets and Clock Enables as well. You should also remember that
each device family has a different set of dedicated hardware. Did you use as much of the
dedicated hardware as you could? Don’t forget to make sure that your Cores also share
the same control signals as your HDL.
<click> “Why can’t the software just optimize my inverters across a partition?”
<click> Remember the purpose of partitions is for the designer to control hierarchy and
preserve logic. If any tool selectively removed this option, customers would scream at
us.
Now lets take a minute for some review questions.
21—(questions)
22—Well, we tried to provide a detailed summary of the most important coding tips we
provided in this module. (read slide)
23— Well, there are lots of places to learn more about FPGAs and they all start at
support.xilinx.com.
I want to point out that I referenced several White Papers that you should utilize. They
all provide more information on optimizing your design and HDL coding for FPGAs.
The third white paper “Get your Priorities Right” pertains to Spartan-3 and older Virtex
devices which are 4-input LUT based devices. The first two are specific to Virtex-5.
I spoke about a User Guide for the DSP slice resources. That is UG193, Virtex-5 Xtreme
DSP Design Considerations. It provides insight into the slice construction and offers
numerous suggestions for its use. To make sure your getting the most out of your Virtex5 device, I recommend you check it out.
I also mentioned the Constraints Guide that is located on our web site as well. This guide
covers all of Xilinx’s supported constraints, including XST’s support of timing
constraints with the XCF file. I mentioned this guide when discussing how to assign
specific registers to IOBs. We call that a mapping constraint. Have a look at the
constraints documentation, but you might get more out of it after attending the Essentials
of FPGA Design and Designing for Performance courses.
24—Thank you for listening to the Virtex-5 FPGA Coding Techniques RELs. There is
still more to learn about FPGA design so we encourage you to attend the Fundamentals
of FPGA Design course followed by the Designing for Performance course.
If you would like to see what other courses we offer, or what other Free RELs are
available click on the icon you see here.
But whatever you do, please take a second and let us know what you thought of this REL.
Just click on this icon and tell us what you think.
My name is Frank Nelson. You have been listening to Virtex-5 FPGA Coding
Techniques, Part 2 REL. Thanks for listening.
25- (nothing said)
Download