Basic FPGA Architecture (Virtex-6), Slice and I/O Resources REL script V13.1 1—Basic FPGA Architecture (Virtex-6) Hello and welcome to this recorded e-Learning on Basic FPGA Architecture for the Virtex-6 device family. My name is Frank Nelson, I will be your instructor for this module. This module introduces slice and IO resources available in Virtex-6 FPGAs. This Basic FPGA Architecture module limits its scope to discussing the slice and I/O resources. Please note that there are also corresponding Memory Resources and Clocking Resources Recorded e-Learning modules available as well. 2—Objectives The objectives of this module…<read slide> 3—Virtex-6 CLB All Xilinx FPGAs contain essentially same basic logic resources. Slices, which are grouped into configurable logic blocks or CLBs, contain combinatorial logic and register resources. As you can see from this picture, there are a lot of CLBs on this die. While each device family is offered with a different number of CLBs, determining which density is right for your application often requires you to at least synthesize your design before choosing an appropriate density of FPGA. Only with some experience can you properly estimate how many slice resources will be necessary for your design without at least synthesizing the design first. The slice resources are the most essential and fundamental to an FPGA design. They include LUTs and registers. LUTS are built to create your combinatorial logic resources. Each has an associated register. The LUTs and registers are grouped into slices which also have other resources. Besides LUTs and registers, slices also have carry logic resources which are designed to implement arithmetic logic functions with high speed performance due to their dedicated carry chain which runs vertically in columns, from one slice to another. 4—Routing The implementation tools select an appropriate routing that may include different kinds of routing resources. The solution that is routed is designed to try and meet your timing needs. To do this, it is important that you add timing constraints with the ISE Design Suite so you can communicate your timing objectives to the implementation tools. This will allow the software to choose the most intelligent routing solution to meet your needs. In fact, the mapping and placement of logical resources is handled by PAR, the placement and routing utility. In Virtex-6 all routing, besides the carry logic resources, is routed through the switch matrices. While understanding the interconnect structure is seldom needed, this diagram is meant to describe the “pipulation” between slices and clbs. You see, in between each clb is a switch matrix which is designed to connect to other switch matrices as well as neighboring CLBs. The possible interconnects are called “pips”, or programmable interconnect points. Please note that proper use of global timing constraints, which are explained in the Essentials of the FPGA Design course, are going to be required. 5—6-Input LUT with Dual Output Now combinatorial logic is stored in lookup tables or LUTs. Lookup tables are also sometimes called Function Generators. The capacity of a LUT is limited by the number of inputs, not by the complexity. So delay through the LUT is constant, regardless of what logic function is being performed inside. Each Virtex-6 slice contains a 6-input LUT that can also be broken into two 5input LUTs. This allows the resource to be broken down if necessary to help assure maximum device utilization. Every customer must understand the input limiting nature of LUTs. Because many customers are used to building with ASICs, it is important to understand that the HDL code you write needs to be optimal. This is because ASICs are faster than FPGAs and it's much easier to get good speed from an ASIC with suboptimal HDL coding styles. This is highlighted when you consider what you would get if your combinatorial logic had seven inputs. Because the LUT has six inputs, your synthesis tool would be required to add a second LUT in series with the first. This would then dramatically increase the delay associated with this path and cut the maximum frequency possible. Also keep in mind that if you're trying to design at the maximum speed quoted in the device data sheet, you have to remember to limit all of your delay paths to one LUT or one logic level. This is not typical, but some customers do it by extensively using pipelining and by minimizing their combinatorial logic. The Designing for Performance course discusses pipelining. Every customer that builds his or her first FPGA inevitably struggles to get optimum speed because they're not using optimal HDL code. I recommend you listen to our HDL Coding Techniques RELs so you can be as successful. Xilinx also provides the Xilinx Synthesis and Simulation Design Guide at www.xilinx.com/support for your reference. It includes terrific coding examples and detailed recommendations. 6—FPGA Slice Resources Each clb contains two slices. From this diagram you can see that each slice has four LUTs, four flip-flops, and four flip-flops that can also be programmed as latches. We added the additional registers because the LUTs can be broken into 5-input LUTs and this should help designs get more logic out of your FPGA design. The carry chain is supported on four of the eight FFs. There are also dedicated muxes which can be used to build larger multiplexers and save LUTs so your system can get higher speed. 7—Wide Multiplexers In this diagram, we see a number of muxes used to connect the LUTs together. Each slice contains two F7 muxes that groups the outputs of two LUTs and can create an 8-to-1 multiplexer. Each slice also contains one F8 mux that combines the outputs of the F7 muxes and can thus make a 16-to-1 mux. The mux outputs can drive out of the slice or connect to the available flip-flop/latch. These dedicated multiplexers are used to improve the speed of large multiplexers and save LUTs for other purposes. The reason that these multiplexers improve the speed of the design is because there is dedicated routing between the multiplexers that are low fanout and fixed between the logic resources. It is also fast because the dedicated multiplexers are faster than building the equivalent logic just with LUTs. What is most important to remember is that each of the dedicated resources has an intended purpose of saving LUTs and improving the speed of the design. Also be aware that if you are going to code for a large multiplexer, you code with a CASE statement in your HDL. This is required for inference of the dedicated hardware. 8—Carry Logic Carry logic is the dedicated hardware resource that improves the speed of arithmetic functions, such as adders, accumulators, subtractors, and comparators. From the diagram you can see that the left hand slices are grouped into one carry chain. Please note that carry logic propagates the carry signal vertically upward. This is important to note because most designs use carry logic extensively. And by using it, the least significant bit should be placed at the bottom of the carry chain. Likewise, when carry logic is used bit ordering should be followed by the designer, with the less significant bit being placed beneath the more significant bits. 9—Flip-Flops and Latches Each slice has four flip-flop/latches and four flip-flops. The difference between them besides their programmability as a latch is that the flip-flop/latches (seen here labeled FF/L) can be driven by the LUT, carry chain or the wide muxes. Conversely, the flip-flop can only be driven by an O5 input which can come from the 6-input LUT when it is used as two 5-input LUTs or from a separate input to the slice. However, these limitations are not as much of a big deal. In the end, if you are building lots of lower input functions that share inputs (as is required to have the 6-input LUT broken into two 5-input LUTs) you should end up with the FFs being used quite a bit. 10—CLB Control Signals All flip-flops and flip-flop/latches share the same control signals. The group of associated control signals with a FF is important since only FFs that share the same control signals can be grouped into the same slice. This means that if you don’t manage the use of your control signals and try to reduce the number of controls signals your design has, then you might have trouble getting high device utilization. The set and reset signal can be configured as synchronous or asynchronous. For synchronous design purposes, Xilinx recommends that unless absolutely necessary you design with a synchronous set or reset. We discuss good synchronous design practices in the Essentials of FPGA Design course. To help designers with this we have made a series of Virtex-6 and Spartan-6 HDL Coding Style RELs. They discuss the impact of this and provide you with many good design practices that will help you get high performance and high device utilization. So check those out. 11—SLICEM as Distributed RAM Many, but not all, LUTs in the FPGA can be configured as distributed RAM. This is a small 64 bit memory that can perform a single, dual, or quad port memory. Each port has independent address inputs and has a synchronous write and an asynchronous read. There are four configurations as you can see. Most customers use these resources for DSP applications. Sometimes customers try to build larger memories out of these. It is possible, but the extra logic to connect them together make their performance prohibitive, so you may want to consider building a larger memory instead with the dedicated Block RAM. In general, those customers that build with this are using the Core Generator or System Generator to build a DSP application. 12—SLICEM as 32-bit Shift Register Many LUTs can be configured as a 32-bit shift register, as well. This is used primarily as a pipelined delay element for balancing pipelined applications. But it can also be used for variable length shift registers, synchronous FIFOs, CAMs, or for pattern generators. It is cascadable up to 128 bits in length. The most challenging thing with this is that most customers don’t realize that it is not-loadable, has no reset, and only offers serial-in serial-out capability. So if you try to infer it with any of these behaviors your synthesis tool will infer regular registers. And that will be a big waste. The benefit of this resource is that it can effectively provide the work of 32 registers with a single LUT. In pipelined applications that is a huge savings. 13—Shift Register LUT Example This is an example of the use of a shift register LUT. The SRL can be used as a programmable delay element (or No Operation, NOP). In this example, you see a 64 bit bus being processed through operation A, B, and C. A has a delay of 8 cycles, B has a delay of 12 cycles, and C has a delay of 3 cycles. Because the data processed is also grouped at its output with a multiplexer, these data paths must synchronize so that appropriate data is compared at the multiplexer. To do this, the SRL can be used to delay the C operation by 17 clock cycles. If we were to do with this with registers it would require 1,088 registers. If we use the SRL functionality instead, we only need 64 LUTs, each programmed for 17 clock cycles of delay. So, this example uses 64 LUTs to replace 1,088 flipflops and the associated routing resources to complete this (pretty good justification for using the SRL, huh?). Because there are so many registers in FPGAs, pipelining is an effective way of designing to increase design performance. Because pipelines can sometimes become unbalanced when too much logic must be generated, it is necessary to delay some of the signals. One of the best uses of the SRL is to add delay to balance pipelines. 14—Two Types of Slices So now that we know that each LUT has three different configurations (logic, distributed RAM, and the SRL) we can discuss the two types of slices in Virtex-6. Most customers don’t use as many LUTs for distributed memory or the SRL as you might think, so this segmentation of functionality does not usually have a performance impact, but saves transistors for other purposes that will be more beneficial to the user. Also keep in mind that the implementation tools will lay your design out onto the die, so understanding which slice is an M or L is not important, all we wanted to do was explain that they are different and show you how they are different. 15—I/O Bank Structure The input/output blocks or IOBs interface between the FPGA and the outside world. IOBs are grouped into IO banks located in columns throughout the device. The IOBs contain registers and some specialized resources. But the main purpose of the IOBs is to clock data into and out of the FPGA. In fact using registers to clock data in and out of the devices is the fastest way to move data. Besides registers, the IOBs contain interface logic that is designed to translate the internal voltage domain of the FPGA to whatever I/O standard you are using. This saves you from having to add all the necessary interface logic outside of the FPGA. This requires that each bank share a common power supply for inputs (Vref) and outputs (Vcco). These voltages vary by IO standard and Virtex-6 supports many IO standards. Fortunately, many of these standards can share the same power supply. Virtex-6 has 9-30 IO banks, depending on the device density and each IO bank has 40 IO pins. 16—I/O Versatility The versatile SelectIO™ resource was first introduced with the Virtex™ family of FPGAs several years ago. It has been enhanced with every new product family ever since. I/O standards have been added and selected to support what our customers require. Virtex-6 FPGAs support over 40 I/O standards including those you see listed here. Check the data sheet to see if your I/O standard is supported. Note that there are memory specific standards of varied classes, differential standards that require two I/O pins to receive the signal. The differential signaling standards include LVDS and LVPECL. The single-ended I/O standards include LVCMOS, SSTL, and HSTL. The SelectIO™ feature allows direct connection of the FPGA pin to external signals of varied voltages and thresholds. This optimizes the speed/noise tradeoff and saves you from having to place interface components onto your board. All of the dedicated interface logic already exists and can be selected on a pin by pin basis. However, note that to mix different I/O standards into the same I/O bank, you must be aware of the I/O banking rules. These rules apply because each I/O bank shares power pins for each of the input and output pins in the bank. Specifically, the Vref powers the input circuitry and the Vccio powers the output circuitry. So, whichever I/O standards you group into each I/O bank, make certain that the input pins have the same Vref, and the output pins have the same Vccio. This is easily done with the Pin Assignment utility in PlanAhead. 17—I/O Electrical Resources This slide is just trying to show you how differential pins work in Xilinx FPGAs. Each pin in the device can be configured as single-ended or differential. You should keep in mind that every pin is predefined as either a P or an N pin (as shown in this diagram). And for every pin there is a corresponding pair that is pre-assigned. So if we were using a differential I/O standard, I have to plan to use my P and N pairs accordingly. So I don’t have complete flexibility to plan my design’s pin layout. If a single I/O pin was configured as single-ended, then the corresponding P or N pair would be free to be a single I/O pin of a compatible I/O standard. And in this case, the diagram we see showing the interconnection between the two pins would not apply. Note that these limitations have an impact on your pin layout, so be sure to plan on laying out your differential transmitter pins properly and early with the Pin Planning functionality in the PlanAhead software. 18—IOB Element The primary use of the IOBs is for registering data. In this slide, you see that there are six registers associated with the IOB to support the use of double-data rate applications. As you can see, there are two registers for input, two for output, and two for 3state enable. However, the IOB can still be used for single data rate applications just as easily. Each register supports separate clocks and clock enables for input and output, the set and reset signals must be shared, however. To clock the DDR registers, remember that you can use any pair of the PLL outputs that are 180 degrees out of phase (such as the CLK90 and CLK270 outputs, likewise the CLK2X and CLK2X180, CLKFX and CLKFX180). 19—I/O Logical Resources Besides the register and SelectIO resources there is also Serdes functionality built into each IOB. This includes both parallel to serial and serial to parallel conversion. This is a programmable functionality that can decode a wide range of signals. The OSERDES/ISERDES combination is probably one of the most useful and radical developments. Many issues that you would traditionally address by using the FPGA fabric are handled within these blocks. Tasks such as training pattern recognition, clock-to-data alignment, and crossing of clock domains between high-speed serial I/O and slower parallel clock domains can now be accomplished with the dedicated logic included in the SERDES resources. When we couple these resources with the I/O clock resources and regional clock resources, this feature becomes very significant and well suited for sourcesynchronous applications. We will discuss the clocking resources in the Basic FPGA Architecture: Memory and Clocking Resources module. The I/O SERDES capability includes input and output SERDES resources. There is an OSERDES parallel-to-serial converter for both OQ and TQ. It is arranged as a master and slave IOB pair to allow for differential inputs or larger single-ended bit widths. So not only can you complete 6-to-1 parallel and serial conversion, but when paired appropriately, you can do larger parallel and serial conversion. For this to be done an appropriate master/serial pair must be found on the die. You cannot just pair any two IOBs and make them a master/slave pair. The SERDES functionality also has a user selectable fine grained programmable delay element for synchronization. 20—Flip-Flop Details Each FF has three control ports, CK, CE, and a Set/Reset port. You have to remember that the CE and Set/Reset port are active high. This is important when coding your HDL for the FFs. Likewise, note that there is not separate set and reset ports, just one port. And this port can be asynchronous or synchronous. This is important because your synthesis tool should be able to map the set or reset to a LUT input, IF it is synchronous. This gives you more flexibility that is necessary for the tools to pack the most logic in the FPGA and get high performance, since all eight FFs share the same control signals. So our recommendations are to design for active high control signals and build with a synchronous set or reset. 21—Design Tips Besides building synchronously, you should also try to use the FPGAs global reset functionality whenever possible. This will allow you to perform a global reset on your system using a dedicated routing resource. This requires the instantiation of the Startup component from the Xilinx Unified Library and can save you routing a high fanout signal to all of your FPGA resources. And don’t forget to keep your control signals active high so you don’t need a local inverter to mapped to each control signal input. These are not dedicated in the silicon and that will force the synthesis tool to waste a LUT input. 22—Software The reason we have spent some time discussing the limitations of control signals is that they can limit your ability to get high device utilization and high speed out of your FPGA. The reason this is necessary is the implementation tools can only pack related logic together in the same slice since there are a limited number of inputs to each slice. This is important with registers since all the FFs share the same control signals. Likewise, if the tools are going to use the 5-input LUT configuration, it has to be able to find another combinatorial function with five of the same inputs to grouped into the same slice. You see its all about grouping related logic into the same slice. If the tools can pack related logic together, it does not end up wasting LUTs and registers. 23—Control Signals As I mentioned earlier your synthesis tool can map CE and synchronous set or reset functionality to a LUT input. This is very powerful for being able to group registers in the same slice, since all the FFs share the same control signals. In this example, we see how a CE and a synchronous set or reset is mapped to a LUT input. This is very helpful, but a job only your synthesis tool can do. But if you build with an asynchronous set or reset, then it cannot happen. In that case, the asynchronous set or reset would be forced to a registers control port. Likewise, if you build with an active low control signal, the LUT will be required to invert the control signal. 24—Control Set Reduction And as I mentioned, FFs with different control sets cannot be packed into the same slice. So in this example, we see three FFs are synthesized to force some control signals to be associated with LUT inputs so that these registers can be packed into the same slice. This is done in XST by using the Reduce Control Set synthesis option. It is not done by default (hopefully it will eventually) but what is most important is that you code for a synchronous reset and code for a CE so that you can take advantage of this functionality in your FPGA. 25—Using the Slice Resources There are three ways to get the FPGA resources you need in your design. You can either infer logic with your synthesis tool, instantiate primitives from the Xilinx Unified Library, or instantiate an optimized component from the Core Generator. One of the biggest challenges for new users is to make certain that they are inferring the appropriate resources. First of all, one of the best ways to check that you are getting the resources inferred is to check the schematic viewer from your synthesis tool. All of the high-end synthesis tools have schematic viewers. Viewers are used to graphically display a gate level version of your netlist and also a technology view of your netlist. While a gate-level netlist is worthwhile to check the Boolean optimization produced by your synthesis tool, the technology viewer helps you to understand what Xilinx primitives were inferred by your synthesis tool. A primitive is a resource that is written to the netlist that represents a particular FPGA resource, such as a LUT, SRL, distributed RAM, etc. Xilinx provides a library of primitives, called the Xilinx Unified Library, to all of our synthesis partners, so they can create netlists that are optimal for their customers. Well, this same library is available to you any time you want to instantiate a primitive into your FPGA. The Core Generator and Architecture Wizard makes larger optimized components available to you. While some cores have a charge associated with them, most do not have a charge associated with them. The cores range from simple components to large, highly complex IP. The Architecture Wizard is used most commonly to customize DCM and PLL components. 26—Inference All primary slice resources can be inferred by XST and Synplify. This includes LUTs, FF, SRL, Muxes, and carry logic. However, you must be sure to code for the resources properly. This means that if you define a FF with an incompatible combination of control signals you will get something slightly different (probably with a LUT input used for a control signal). Likewise, the SRL is non-loadable, has no reset, and only supports a serial functionality. So if you code for a parallel read, you will get a pure register implementation, and this may not be what you expected. Similarly, muxes should use a CASE statement and carry logic should use the proper arithmetic operator. Don’t code for the XOR functionality and expect it to make to carry logic. In general, inference is best and preferred by our customers. It assures portability to another technology and is re-usable. However, most customers will have to do some instantiation. 27—Instantiation A list of Xilinx primitives and their behavior can be found from the Libraries Guide. This is available at support.xilinx.com or can be found from the Help menu in the ISE software. This will describe all of the ports and attributes that can be customized. There is also a complete description of the XST attributes located from the software manual web page. This is always a useful resource for inferring logic. 28—Architecture Wizard The Core Generator and Architecture Wizard help you instantiate a high-level component. These cores are optimized and make designing for all of the detailed functionality of things like the DCM or PLL easier. They also provide access to some very powerful design tools that can create common sophisticated components at no charge. 29—Summary In summary, <read slide> 30—Where Can I Learn More? Well, there is lots of useful information about inference and synthesis attributes in the Synthesis and Simulation Design Guide. This is available to from the ISE software using its help menu or directly from support.xilinx.com The XST User Guide provides very helpful coding recommendations that all Xilinx users should live by. It also includes a detailed description about all of XSTs synthesis options. This information can also be found from the Help menu. If you would like to see what other courses we offer, or what other Free RELs are available go to the Xilinx Education link you see here. I would also like to mention again that there are architecture modules available that discuss the basics of Xilinx’s newest devices. You may find this useful, especially if you want to learn more about the device differences. But whatever you do, please take a second and let us know what you thought of this REL. Just click on the icon on the next page and tell us what you think. My name is Frank Nelson. You have been listening to the Basic FPGA Architecture module for Virtex-6. Thanks for listening and thanks for your business. 31—Trademark Information <nothing said>