Virtex-5 FPGA Coding Techniques, part 2 Script 1-- Hello and welcome to this recorded e-Learning module on Virtex-5 HDL Coding Techniques. This module is one of the recommended recorded e-Learning modules, for the Designing for Performance course and is also a part of the FPGA to ASIC REL curriculum. My name is Frank Nelson, I will be your instructor for this module. I am a Xilinx Technical Trainer and Course Developer. This module introduces some of the architectural differences of Virtex-5 and their impact on HDL Coding Techniques. This module introduces some of the primary concepts that impact the quality of results a designer will get when synthesizing for a Virtex-5 FPGA. This module also provides some detailed recommendations about creating effective HDL that will provide high speed and reduce the amount FPGA resources used in your design. 2--You can download a printable PDF version of this module and a copy of the script by clicking on the icon you see here. If you would like to do so now, pause this recording and then continue when you are ready to proceed 3-- This slide describes the curriculum path for ASIC designers to learn Xilinx software, hardware and FPGA design techniques. <click> There are 5 RELs to help you get familiar with FPGA design. There is also a corresponding HDL Coding Techniques module for Spartan-3 devices. The content in this module will help you immediately start making coding improvements to your Virtex5 FPGA design. There are two modules for Virtex-5 HDL Coding Techniques content. This is part 1. For more information on these courses, visit support.xilinx.com and click on the education link at the top of the page. Also note that these RELs do not introduce the ISE Design Suite, good design practices, and other essential information. This content is part of our Fundamentals of FPGA Design and Designing for Performance courses. We recommend that you take the time to attend these courses. 4- After completing this module you will be able to build an efficient Virtex-5 design that runs at high speed. <click> You will also be able to avoid the most common HDL coding mistakes that prevent Virtex-5 designers from achieving their performance objectives. <click> Careful attention to the content included in this module is essential if you are new to designing for Virtex-5, and especially if you are handling the design conversion 5- We are going to discuss some of the most important ways to optimize your HDL code for a Virtex-5 device. <click> After listening to this entire module, you should be able to make your own checklist of tips for building a high-speed Virtex-5 design. 6—A common question is “How do I control how my synthesis tool infers clock enables?” Recall that like synchronous resets, clock enables can be connected directly to the FF or be connected to a LUT input. The intention is that you might want to design your lowest fanout clock enables to use a LUT input. This will decrease the number of control sets in your design. To force the CE signal to be forced to a LUT input, code it as part of the register’s output (as seen on the right). So in this example, instead of coding it as we usually do (as seen on the left, which will force the CE to the register’s CE port), write the detailed equations and the explicit connections to the registers output. In this case, by having more than one input, it forces the CE input to the LUT input. Xilinx is not recommending that you design asynchronously, just that you try to reduce designing low-fanout CEs (fewer than eight registers). Synthesis tools will get “smarter” about when to use dedicated CE control signals and when to move the CE function to the datapath, but it will take some time. You should also note that some synthesis tools allow the user to disable the use of the CE port on the register. Xilinx does not recommend this use, typically, because on average it results in the use of 25% more LUTs. <click> So remember to code for your low fan-out clock enables to drive LUT inputs. This will enable the FF to be part of a larger control set and give the tools more flexibility to meet your timing objectives. 7—We also want to mention that the MAP report (that is the program that is part of place and route) will report on the number of control sets. <click> To do this you will have to run MAP with the –detail switch to get the analysis. Note that this report is long, but it may be useful (or just curious). <click> Don’t forget that having a low number of members for each control set is cause for concern because the tools may have trouble sharing slice and CLB resources with those small groups of registers. There is no typical range, but if I was attempting to do a design conversion having some statistics would help me see the improvement I made during the design conversion. So this is just an FYI. 8—As you know, gated clocks can cause glitches, increased clock delay, clock skew, and other undesirable effects. Using clock enables improves the timing characteristics and reliability of your design. There are several ways to use clock-enable resources. <click> To gate entire clock domains for power reduction, it is preferable to use the clock-enabled global buffer resource called the BUFGCE. This is the symbol here, taken from the Xilinx Unified Library Guide. This will de-activate the clock being distributed on a global basis. As an FYI, don’t forget that Virtex-5 supports a synchronous switching of individual clock domains with the BUFGMUX primitive. <click> For applications that only attempt to pause the clock for a few cycles on small areas of the design, the preferred method is to use the clock-enable port of each FPGA register. <click> So try to take advantage of the BUFGCE functionality. It will save you from having to route a high-fanout CE throughout your device. And saving routing resources can do a lot to improve the speed of your FPGA design. We discuss synchronous design techniques and efficient ways to build clocks in the Fundamentals of FPGA Design course. 9—Many signal processing algorithms perform an arithmetic operation on an input stream of samples followed by a summation of all output of the arithmetic operation. To implement the summation in an FPGA, the adder tree structure is typically used. One of the key difficulties with adder trees is its varying size. The number of adders depends on the number of inputs to the adder tree. Also the more inputs that are used will simply require more adders. This increases both the amount of resources and the power consumption of the system. Larger trees also mean larger adders in the last stages which can negatively impact speed. <click> It is simply better to implement these with dedicated resources, such as the DSP slice. This implementation involves computing the summation incrementally using chained adders instead of adder trees. This maximizes performance and lowers power for DSP algorithms because both logic and interconnect are contained within the dedicated silicon. When pipelined the performance is 500 MHz in the fastest speed grade, independent of the number of adders. To learn more about the process of converting a direct form filter to a transposed or systolic form is detailed in UG073: XtremeDSP for Virtex-4 FPGAs User Guide. Likewise, you should check out UG193: Virtex-5 FPGA XtremeDSP Design Considerations. 10—When inferring dual port block memories, it is possible for both ports to access the same location at the same time. In this case, the contents cannot be guaranteed. The latest FPGAs have three programmable operating modes to govern the memory output while a write operation is occurring. They include, write first (or transparent mode), read first (or read before write mode), and no change mode. Synthesis tools can infer these configurations based on your HDL coding style. You should avoid read before write mode to achieve maximum Block RAM performance. <click> Synthesis tool settings, such as Synplify, insert bypass logic around the RAM to prevent the possible mismatch between the RTL and hardware behavior. This extra logic is intended to force the RAM outputs to some known values when collisions can occur. <click> If the designer knows that these collisions cannot occur in his system, the synthesis tools setting can be used to prevent by pass logic from being added to the application. This extra logic has a negative impact on the memory performance. 11—All Xilinx FPGAs have dedicated registers on the input and output pads. By using these registers set-up times for the input paths and clock-to-output times for the outputs paths can be minimized, making it easier to meet timing requirements for capturing and providing data to external devices. <click> However, using the dedicated IO registers can have a negative effect on meeting the timing requirements within the FPGA; their use can lengthen route delays to the internal logic. <click> Unless the designer needs the IO registers to meet IO timing, Xilinx recommends that the registers be placed within the FPGA fabric. The ideal solution is that your synthesis tool, like Synplify, will automatically place registers based on the timing specifications you make. Otherwise, the following steps must be performed… <click> Disable global IO register usage in your synthesis tool. This will prevent the synthesis tool from mapping these registers to the IOBs every time. This option is usually on by default. <click> Disable the Map option to pack registers into the IOBs. Map is a part of the place and route process. It is the phase where logic is grouped into CLBs and, in our case, IOBs. Since the implementation tools have this option on by default, it may be worth evaluating your timing with this option off. <click> So after turning both default options off, you can now selectively move registers into the IOBs with a UCF attribute. So start with your timing critical IO pads and use the IOB=TRUE syntax in your UCF file or your source HDL. For more information on how to do this, refer to the Xilinx Constraints Guide. Likewise, check your synthesis vendors documentation if you want to make these attributes with your synthesis tool or your HDL. <click> But however, you choose to map specific registers to IOBs, try to only use the IOB registers to help meet IO timing. Now lest take a couple minutes and take a quiz on what has been presented so far. 12—(questions) 13--Keep in mind that it is important to register all inputs and outputs to a hierarchical block when there is a possibility incremental design practices will be employed. You may recall that incremental design allows a designer to maintain place and route on a module basis. The benefit is that it saves development time when you have found a result that meets your needs. So implementing the design each time, now becomes implementing just the components you are not happy with. In general, these practices are popular and worthwhile. But for them to be effective in meeting system performance needs, designers have to anticipate having signals travel a significant distance across the die. To help this, we recommend registering all inputs and outputs to your most timing critical hierarchical blocks. <click> Likewise, you should try to keep your IO resources at the top level of your design. Usually this is done through direct instantiation. While inferring IO registers is usually okay, instantiation of DDR, Serdes, and delay resources is recommended. It is just plain easier than inferring those resources. <click> Also keep in mind that any logic that needs to be place in a single resource, such as a Block RAM or a DSP slice should be contained in a single hierarchical block. <click> Any logic that needs the synthesis tool to resources share should be placed in a single hierarchical block. Recall that resource sharing is recommended to be turned off, but if you decide that your synthesis result with a component is best left on, then isolate that logic into one block. <click> Also, manually duplicate registers with high fanout at a hierarchical boundary. <click> By following these guidelines, if is far less likely that your chosen design hierarchy will interfere with design optimization and performance. 14—The idea of replicating registers that have a high fanout net on the output, still applies in Virtex-5. This allows a high fanout signal to be moved closer to some of the many destinations where it is required. The easiest way to determine if this is part of a problem is to generate a timing report, either from synthesis or after implementation, and determine if you have a high fanout net as part of the data path. If so, manual duplication is recommended. This will require you to prevent the synthesis tool from removing the extra logic (which is its default behavior) by using a keep attribute. If your synthesis tool allows you to control replication that can also be useful, but keep in mind that replicating logic just once is rarely sufficient, depending on how far from meeting your timing needs you are, you will probably have to replicate that logic a few times. <click> Once you have added registers through pipeline stages, you should also remember to allow your synthesis tool to migrate those registers where needed. This is done with the Retiming option. <click> (nothing said) 15—Since we have been talking about register’s being a premium in Virtex-5 designs, we wanted to take a second and mention some considerations about synthesis options. <click> First of all, don’t over constrain your design. This means that you should not place timing constraints on your design that are unrealistic. As we always say, only place timing constraints that reflect what you must have. Remember that unless you are operating at worst case operating conditions, you already have some built in slack. Over constraining typically makes designs bigger and increases the number of registers used, since synthesis tools tend to replicate logic to decrease fan-out on timing critical paths. Typically this increases register usage by 1-5%. <click> Global optimization settings don’t usually solve your register usage problems, although they can generate different results. Typically, global optimization can reduce your FF usage by up to 10%. However, this often forces the synthesis tool to generate extra control sets, and can result in unusable FFs. So feel free to try it and compare your results, but don’t expect it to solve all of your problems. <click> FSM optimization can save you some FFs, but you will probably have to encode the FSM yourself. This is because most FSM optimizations will use One Hot Encoding which uses more FFs in an effort to increase speed. With the Virtex-5 FPGA, the 6-input LUT means one-hot encoding is not quite as useful as it used be with the 4-input LUT structure. I would recommend that you synthesize your FSM yourself by trying binary, OHE, and Gray encoding. Then compare your results and see which one you like best. It is also important to determine whether your FSM is part of a timing critical path (often they are). If so, you may have little choice but to use OHE. But be an engineer and evaluate the encoding yourself. Note that over time, this is something synthesis tools will learn from and improve upon. <click> Finally, don’t use slice or LUT compression switches. While this does not consume too many extra FFs, in some cases latch-thrus were used and waste registers. Recall that the slice does not allow you to separate the FF from the LUT, so in the case of a latch-thru the LUT is used, but the FF is wasted. <click> So we don’t recommend over-constraining your design. It just increases register usage too much. We also don’t recommend using slice or LUT compression switches. It can waste registers. 16—So to summarize our synthesis options. <click> Turn on logic replication and retiming. Logic replication is designed to duplicate logic that is generating a high fan-out signal. This duplication can be done by the designer if they know their design well enough. Retiming is best done when you have already evaluated your need to pipeline and you have added those extra registers. Remember that you don’t need to pipeline with Virtex5 as much as you did with older devices. <click> Turn off resource sharing. <click> Turn on logic optimization. This is a starting point. You may find that you prefer it to be used for widening deep data paths, but this will depend on your design. <click> Turn off FSM optimization. Synthesis tools tend to use one-hot too much and this wastes registers. Synthesis you large FSMs with each encoding yourself and choose the one that is fastest and smallest. <click> Don’t over-constrain during synthesis. Synthesis tools tend to make the design much larger than necessary. <click> Don’t use slice or LUT compression switches. This tends to waste FFs. <click> All of these options tend to make the design larger, but save FFs and give the tools more flexibility. Now we want to reiterate that we are providing guidelines in these modules. These are not hard rules that always deliver the best performance, but you should start out by trying these techniques with Virtex-5 devices. After that, you can experiment as you see fit, but I am betting you will only be experimenting a little bit, if at all. 17—As I mentioned earlier, migrating designs to Virtex-5 can be a little challenging since the architecture is so different. So lets see what designs are easiest to migrate. First off, designs that use the dedicated hard IP are an excellent fit. In fact, taking advantage of this IP is critical, since this will improve your device utilization and design speed very quickly. This is especially true of designs that use the low-power MGT resources (Serial Gigabit Transceivers), EMACs, DSP slice resources, Block RAMs, Power PC 440 processor, and PCI resources. <click> Low power applications that use the dedicated IP are also ideal, because Virtex-5 uses optimum power and the dedicated IP is very power efficient. <click> But also low-speed designs that were not optimized for a predecessor device family (that is, not highly pipelined) would work well. This is because it probably has several logic levels on its timing critical path. In this case, the 6-input LUT will yield a performance improvement. <click> However, note that as long as you are willing to evaluate your pipelining techniques, instantiate some dedicated IP, and verify your existing coding styles, Virtex-5 should be optimum. 18—The toughest designs are those that have not gone through any conversion. They tend to be the slowest and the most challenging. This usually includes designs that have not been re-synthesized and contain old netlists and cores from previous architectures and applications. The first thing I am going to have to do is convince this customer that they are not taking enough advantage of the dedicated hardware, which means that they are wasting some of their money. And as soon as they start adding stuff and changing their design, all they need is some guidance. And hopefully this REL is helping provide that guidance. Also keep in mind that designs that have not been re-optimized may struggle to meet timing requirements in Virtex-5. So what is the chance of the design being successful if it’s not going to meet your timing needs? <click> But besides that, some designs that don’t fully utilize the dedicated DSP resources as much as possible tend to be challenging. This could be because their synthesis tool did not infer all the DSP slice resources they hoped and in such case they have to instantiate the components they want. <click> And as I mentioned earlier, designs that were heavily pipelined for an older architecture and never re-optimized struggle to meet timing. <click> Well, what is in common? Yeah, they just need to optimize their code some. <click> But in general, I find that most customers realize that every new FPGA requires some work on their part to fully utilize all of the architecture advantages of a new product family. In this case, the 6-input LUT and the dedicated IP will need to targeted. 19—Now for some common questions. <click> We often get asked “Why cant I code how I want to?” <click> The answer is simply that synthesis tools and implementation tools cannot do as much of the work as customers would like. In the end, there will always be design work necessary to compensate for synthesis and implementation tools. This is especially important if you are going to try and get the most out of any FPGA product family. <click> Another common questions is “Shouldn’t the tools be able to make my code optimal?” <click> Well, the bottom line is synthesis and implementation tools have limitations. They always will. This is exactly the reason we provided this content. So you can improve your design so you can get the most out of your Virtex-5 design. 20—Some more common questions… <click> “The Virtex-5 FPGA should always be a speed grade faster than Virtex-4, right?” <click>No, this is not always true, especially if the design was heavily pipelined. <click> “This design easily fit in Virex-4, and now it can’t fit in Virtex-5. What’s wrong?” <click> Remember to check your control sets. I am willing to bet you have to many. Check your use of Resets and Clock Enables as well. You should also remember that each device family has a different set of dedicated hardware. Did you use as much of the dedicated hardware as you could? Don’t forget to make sure that your Cores also share the same control signals as your HDL. <click> “Why can’t the software just optimize my inverters across a partition?” <click> Remember the purpose of partitions is for the designer to control hierarchy and preserve logic. If any tool selectively removed this option, customers would scream at us. Now lets take a minute for some review questions. 21—(questions) 22—Well, we tried to provide a detailed summary of the most important coding tips we provided in this module. (read slide) 23— Well, there are lots of places to learn more about FPGAs and they all start at support.xilinx.com. I want to point out that I referenced several White Papers that you should utilize. They all provide more information on optimizing your design and HDL coding for FPGAs. The third white paper “Get your Priorities Right” pertains to Spartan-3 and older Virtex devices which are 4-input LUT based devices. The first two are specific to Virtex-5. I spoke about a User Guide for the DSP slice resources. That is UG193, Virtex-5 Xtreme DSP Design Considerations. It provides insight into the slice construction and offers numerous suggestions for its use. To make sure your getting the most out of your Virtex5 device, I recommend you check it out. I also mentioned the Constraints Guide that is located on our web site as well. This guide covers all of Xilinx’s supported constraints, including XST’s support of timing constraints with the XCF file. I mentioned this guide when discussing how to assign specific registers to IOBs. We call that a mapping constraint. Have a look at the constraints documentation, but you might get more out of it after attending the Essentials of FPGA Design and Designing for Performance courses. 24—Thank you for listening to the Virtex-5 FPGA Coding Techniques RELs. There is still more to learn about FPGA design so we encourage you to attend the Fundamentals of FPGA Design course followed by the Designing for Performance course. If you would like to see what other courses we offer, or what other Free RELs are available click on the icon you see here. But whatever you do, please take a second and let us know what you thought of this REL. Just click on this icon and tell us what you think. My name is Frank Nelson. You have been listening to Virtex-5 FPGA Coding Techniques, Part 2 REL. Thanks for listening. 25- (nothing said)