Basic Slice and I/O Resources, REL script

advertisement
Basic FPGA Architecture (Virtex-6), Slice and I/O Resources
REL script
V13.1
1—Basic FPGA Architecture (Virtex-6)
Hello and welcome to this recorded e-Learning on Basic FPGA Architecture for
the Virtex-6 device family. My name is Frank Nelson, I will be your instructor for
this module. This module introduces slice and IO resources available in Virtex-6
FPGAs. This Basic FPGA Architecture module limits its scope to discussing the
slice and I/O resources. Please note that there are also corresponding Memory
Resources and Clocking Resources Recorded e-Learning modules available as
well.
2—Objectives
The objectives of this module…<read slide>
3—Virtex-6 CLB
All Xilinx FPGAs contain essentially same basic logic resources. Slices, which are
grouped into configurable logic blocks or CLBs, contain combinatorial logic and
register resources.
As you can see from this picture, there are a lot of CLBs on this die. While each
device family is offered with a different number of CLBs, determining which
density is right for your application often requires you to at least synthesize your
design before choosing an appropriate density of FPGA. Only with some
experience can you properly estimate how many slice resources will be necessary
for your design without at least synthesizing the design first.
The slice resources are the most essential and fundamental to an FPGA design.
They include LUTs and registers. LUTS are built to create your combinatorial
logic resources. Each has an associated register. The LUTs and registers are
grouped into slices which also have other resources.
Besides LUTs and registers, slices also have carry logic resources which are
designed to implement arithmetic logic functions with high speed performance
due to their dedicated carry chain which runs vertically in columns, from one
slice to another.
4—Routing
The implementation tools select an appropriate routing that may include different
kinds of routing resources. The solution that is routed is designed to try and
meet your timing needs.
To do this, it is important that you add timing constraints with the ISE Design
Suite so you can communicate your timing objectives to the implementation tools.
This will allow the software to choose the most intelligent routing solution to
meet your needs. In fact, the mapping and placement of logical resources is
handled by PAR, the placement and routing utility.
In Virtex-6 all routing, besides the carry logic resources, is routed through the
switch matrices. While understanding the interconnect structure is seldom
needed, this diagram is meant to describe the “pipulation” between slices and
clbs. You see, in between each clb is a switch matrix which is designed to
connect to other switch matrices as well as neighboring CLBs. The possible
interconnects are called “pips”, or programmable interconnect points.
Please note that proper use of global timing constraints, which are explained in
the Essentials of the FPGA Design course, are going to be required.
5—6-Input LUT with Dual Output
Now combinatorial logic is stored in lookup tables or LUTs. Lookup tables are
also sometimes called Function Generators.
The capacity of a LUT is limited by the number of inputs, not by the complexity.
So delay through the LUT is constant, regardless of what logic function is being
performed inside.
Each Virtex-6 slice contains a 6-input LUT that can also be broken into two 5input LUTs. This allows the resource to be broken down if necessary to help
assure maximum device utilization.
Every customer must understand the input limiting nature of LUTs. Because
many customers are used to building with ASICs, it is important to understand
that the HDL code you write needs to be optimal. This is because ASICs are
faster than FPGAs and it's much easier to get good speed from an ASIC with suboptimal HDL coding styles. This is highlighted when you consider what you would
get if your combinatorial logic had seven inputs. Because the LUT has six inputs,
your synthesis tool would be required to add a second LUT in series with the first.
This would then dramatically increase the delay associated with this path and cut
the maximum frequency possible.
Also keep in mind that if you're trying to design at the maximum speed quoted in
the device data sheet, you have to remember to limit all of your delay paths to
one LUT or one logic level. This is not typical, but some customers do it by
extensively using pipelining and by minimizing their combinatorial logic. The
Designing for Performance course discusses pipelining. Every customer that
builds his or her first FPGA inevitably struggles to get optimum speed because
they're not using optimal HDL code. I recommend you listen to our HDL Coding
Techniques RELs so you can be as successful. Xilinx also provides the Xilinx
Synthesis and Simulation Design Guide at www.xilinx.com/support for your
reference. It includes terrific coding examples and detailed recommendations.
6—FPGA Slice Resources
Each clb contains two slices. From this diagram you can see that each slice has
four LUTs, four flip-flops, and four flip-flops that can also be programmed as
latches. We added the additional registers because the LUTs can be broken into
5-input LUTs and this should help designs get more logic out of your FPGA
design.
The carry chain is supported on four of the eight FFs.
There are also dedicated muxes which can be used to build larger multiplexers
and save LUTs so your system can get higher speed.
7—Wide Multiplexers
In this diagram, we see a number of muxes used to connect the LUTs together.
Each slice contains two F7 muxes that groups the outputs of two LUTs and can
create an 8-to-1 multiplexer.
Each slice also contains one F8 mux that combines the outputs of the F7 muxes
and can thus make a 16-to-1 mux. The mux outputs can drive out of the slice or
connect to the available flip-flop/latch.
These dedicated multiplexers are used to improve the speed of large multiplexers
and save LUTs for other purposes. The reason that these multiplexers improve
the speed of the design is because there is dedicated routing between the
multiplexers that are low fanout and fixed between the logic resources. It is also
fast because the dedicated multiplexers are faster than building the equivalent
logic just with LUTs.
What is most important to remember is that each of the dedicated resources has
an intended purpose of saving LUTs and improving the speed of the design. Also
be aware that if you are going to code for a large multiplexer, you code with a
CASE statement in your HDL. This is required for inference of the dedicated
hardware.
8—Carry Logic
Carry logic is the dedicated hardware resource that improves the speed of
arithmetic functions, such as adders, accumulators, subtractors, and comparators.
From the diagram you can see that the left hand slices are grouped into one
carry chain. Please note that carry logic propagates the carry signal vertically
upward. This is important to note because most designs use carry logic
extensively. And by using it, the least significant bit should be placed at the
bottom of the carry chain.
Likewise, when carry logic is used bit ordering should be followed by the
designer, with the less significant bit being placed beneath the more significant
bits.
9—Flip-Flops and Latches
Each slice has four flip-flop/latches and four flip-flops. The difference between
them besides their programmability as a latch is that the flip-flop/latches (seen
here labeled FF/L) can be driven by the LUT, carry chain or the wide muxes.
Conversely, the flip-flop can only be driven by an O5 input which can come from
the 6-input LUT when it is used as two 5-input LUTs or from a separate input to
the slice.
However, these limitations are not as much of a big deal.
In the end, if you are building lots of lower input functions that share inputs (as
is required to have the 6-input LUT broken into two 5-input LUTs) you should
end up with the FFs being used quite a bit.
10—CLB Control Signals
All flip-flops and flip-flop/latches share the same control signals. The group of
associated control signals with a FF is important since only FFs that share the
same control signals can be grouped into the same slice. This means that if you
don’t manage the use of your control signals and try to reduce the number of
controls signals your design has, then you might have trouble getting high device
utilization.
The set and reset signal can be configured as synchronous or asynchronous. For
synchronous design purposes, Xilinx recommends that unless absolutely
necessary you design with a synchronous set or reset. We discuss good
synchronous design practices in the Essentials of FPGA Design course.
To help designers with this we have made a series of Virtex-6 and Spartan-6 HDL
Coding Style RELs. They discuss the impact of this and provide you with many
good design practices that will help you get high performance and high device
utilization. So check those out.
11—SLICEM as Distributed RAM
Many, but not all, LUTs in the FPGA can be configured as distributed RAM. This
is a small 64 bit memory that can perform a single, dual, or quad port memory.
Each port has independent address inputs and has a synchronous write and an
asynchronous read.
There are four configurations as you can see. Most customers use these
resources for DSP applications. Sometimes customers try to build larger
memories out of these. It is possible, but the extra logic to connect them
together make their performance prohibitive, so you may want to consider
building a larger memory instead with the dedicated Block RAM. In general,
those customers that build with this are using the Core Generator or System
Generator to build a DSP application.
12—SLICEM as 32-bit Shift Register
Many LUTs can be configured as a 32-bit shift register, as well. This is used
primarily as a pipelined delay element for balancing pipelined applications. But it
can also be used for variable length shift registers, synchronous FIFOs, CAMs, or
for pattern generators.
It is cascadable up to 128 bits in length.
The most challenging thing with this is that most customers don’t realize that it is
not-loadable, has no reset, and only offers serial-in serial-out capability. So if
you try to infer it with any of these behaviors your synthesis tool will infer
regular registers. And that will be a big waste.
The benefit of this resource is that it can effectively provide the work of 32
registers with a single LUT. In pipelined applications that is a huge savings.
13—Shift Register LUT Example
This is an example of the use of a shift register LUT.
The SRL can be used as a programmable delay element (or No Operation, NOP).
In this example, you see a 64 bit bus being processed through operation A, B,
and C. A has a delay of 8 cycles, B has a delay of 12 cycles, and C has a delay
of 3 cycles. Because the data processed is also grouped at its output with a
multiplexer, these data paths must synchronize so that appropriate data is
compared at the multiplexer. To do this, the SRL can be used to delay the C
operation by 17 clock cycles.
If we were to do with this with registers it would require 1,088 registers. If we
use the SRL functionality instead, we only need 64 LUTs, each programmed for
17 clock cycles of delay. So, this example uses 64 LUTs to replace 1,088 flipflops and the associated routing resources to complete this (pretty good
justification for using the SRL, huh?).
Because there are so many registers in FPGAs, pipelining is an effective way of
designing to increase design performance. Because pipelines can sometimes
become unbalanced when too much logic must be generated, it is necessary to
delay some of the signals. One of the best uses of the SRL is to add delay to
balance pipelines.
14—Two Types of Slices
So now that we know that each LUT has three different configurations (logic,
distributed RAM, and the SRL) we can discuss the two types of slices in Virtex-6.
Most customers don’t use as many LUTs for distributed memory or the SRL as
you might think, so this segmentation of functionality does not usually have a
performance impact, but saves transistors for other purposes that will be more
beneficial to the user.
Also keep in mind that the implementation tools will lay your design out onto the
die, so understanding which slice is an M or L is not important, all we wanted to
do was explain that they are different and show you how they are different.
15—I/O Bank Structure
The input/output blocks or IOBs interface between the FPGA and the outside
world. IOBs are grouped into IO banks located in columns throughout the device.
The IOBs contain registers and some specialized resources. But the main
purpose of the IOBs is to clock data into and out of the FPGA. In fact using
registers to clock data in and out of the devices is the fastest way to move data.
Besides registers, the IOBs contain interface logic that is designed to translate
the internal voltage domain of the FPGA to whatever I/O standard you are using.
This saves you from having to add all the necessary interface logic outside of the
FPGA. This requires that each bank share a common power supply for inputs
(Vref) and outputs (Vcco). These voltages vary by IO standard and Virtex-6
supports many IO standards. Fortunately, many of these standards can share
the same power supply.
Virtex-6 has 9-30 IO banks, depending on the device density and each IO bank
has 40 IO pins.
16—I/O Versatility
The versatile SelectIO™ resource was first introduced with the Virtex™ family of
FPGAs several years ago. It has been enhanced with every new product family
ever since.
I/O standards have been added and selected to support what our customers
require. Virtex-6 FPGAs support over 40 I/O standards including those you see
listed here. Check the data sheet to see if your I/O standard is supported.
Note that there are memory specific standards of varied classes, differential
standards that require two I/O pins to receive the signal. The differential
signaling standards include LVDS and LVPECL. The single-ended I/O standards
include LVCMOS, SSTL, and HSTL.
The SelectIO™ feature allows direct connection of the FPGA pin to external
signals of varied voltages and thresholds. This optimizes the speed/noise
tradeoff and saves you from having to place interface components onto your
board. All of the dedicated interface logic already exists and can be selected on
a pin by pin basis.
However, note that to mix different I/O standards into the same I/O bank, you
must be aware of the I/O banking rules. These rules apply because each I/O
bank shares power pins for each of the input and output pins in the bank.
Specifically, the Vref powers the input circuitry and the Vccio powers the output
circuitry.
So, whichever I/O standards you group into each I/O bank, make certain that
the input pins have the same Vref, and the output pins have the same Vccio.
This is easily done with the Pin Assignment utility in PlanAhead.
17—I/O Electrical Resources
This slide is just trying to show you how differential pins work in Xilinx FPGAs.
Each pin in the device can be configured as single-ended or differential.
You should keep in mind that every pin is predefined as either a P or an N pin
(as shown in this diagram). And for every pin there is a corresponding pair that
is pre-assigned. So if we were using a differential I/O standard, I have to plan to
use my P and N pairs accordingly. So I don’t have complete flexibility to plan my
design’s pin layout.
If a single I/O pin was configured as single-ended, then the corresponding P or N
pair would be free to be a single I/O pin of a compatible I/O standard. And in
this case, the diagram we see showing the interconnection between the two pins
would not apply.
Note that these limitations have an impact on your pin layout, so be sure to plan
on laying out your differential transmitter pins properly and early with the Pin
Planning functionality in the PlanAhead software.
18—IOB Element
The primary use of the IOBs is for registering data. In this slide, you see that
there are six registers associated with the IOB to support the use of double-data
rate applications.
As you can see, there are two registers for input, two for output, and two for 3state enable. However, the IOB can still be used for single data rate applications
just as easily.
Each register supports separate clocks and clock enables for input and output,
the set and reset signals must be shared, however.
To clock the DDR registers, remember that you can use any pair of the PLL
outputs that are 180 degrees out of phase (such as the CLK90 and CLK270
outputs, likewise the CLK2X and CLK2X180, CLKFX and CLKFX180).
19—I/O Logical Resources
Besides the register and SelectIO resources there is also Serdes functionality
built into each IOB. This includes both parallel to serial and serial to parallel
conversion. This is a programmable functionality that can decode a wide range
of signals.
The OSERDES/ISERDES combination is probably one of the most useful and
radical developments. Many issues that you would traditionally address by using
the FPGA fabric are handled within these blocks. Tasks such as training pattern
recognition, clock-to-data alignment, and crossing of clock domains between
high-speed serial I/O and slower parallel clock domains can now be
accomplished with the dedicated logic included in the SERDES resources.
When we couple these resources with the I/O clock resources and regional clock
resources, this feature becomes very significant and well suited for sourcesynchronous applications. We will discuss the clocking resources in the Basic
FPGA Architecture: Memory and Clocking Resources module.
The I/O SERDES capability includes input and output SERDES resources. There
is an OSERDES parallel-to-serial converter for both OQ and TQ. It is arranged as
a master and slave IOB pair to allow for differential inputs or larger single-ended
bit widths. So not only can you complete 6-to-1 parallel and serial conversion,
but when paired appropriately, you can do larger parallel and serial conversion.
For this to be done an appropriate master/serial pair must be found on the die.
You cannot just pair any two IOBs and make them a master/slave pair.
The SERDES functionality also has a user selectable fine grained programmable
delay element for synchronization.
20—Flip-Flop Details
Each FF has three control ports, CK, CE, and a Set/Reset port.
You have to remember that the CE and Set/Reset port are active high. This is
important when coding your HDL for the FFs.
Likewise, note that there is not separate set and reset ports, just one port. And
this port can be asynchronous or synchronous. This is important because your
synthesis tool should be able to map the set or reset to a LUT input, IF it is
synchronous. This gives you more flexibility that is necessary for the tools to
pack the most logic in the FPGA and get high performance, since all eight FFs
share the same control signals.
So our recommendations are to design for active high control signals and build
with a synchronous set or reset.
21—Design Tips
Besides building synchronously, you should also try to use the FPGAs global reset
functionality whenever possible.
This will allow you to perform a global reset on your system using a dedicated
routing resource. This requires the instantiation of the Startup component from
the Xilinx Unified Library and can save you routing a high fanout signal to all of
your FPGA resources.
And don’t forget to keep your control signals active high so you don’t need a
local inverter to mapped to each control signal input. These are not dedicated in
the silicon and that will force the synthesis tool to waste a LUT input.
22—Software
The reason we have spent some time discussing the limitations of control signals
is that they can limit your ability to get high device utilization and high speed out
of your FPGA.
The reason this is necessary is the implementation tools can only pack related
logic together in the same slice since there are a limited number of inputs to
each slice.
This is important with registers since all the FFs share the same control signals.
Likewise, if the tools are going to use the 5-input LUT configuration, it has to be
able to find another combinatorial function with five of the same inputs to
grouped into the same slice. You see its all about grouping related logic into the
same slice. If the tools can pack related logic together, it does not end up
wasting LUTs and registers.
23—Control Signals
As I mentioned earlier your synthesis tool can map CE and synchronous set or
reset functionality to a LUT input. This is very powerful for being able to group
registers in the same slice, since all the FFs share the same control signals.
In this example, we see how a CE and a synchronous set or reset is mapped to a
LUT input.
This is very helpful, but a job only your synthesis tool can do. But if you build
with an asynchronous set or reset, then it cannot happen.
In that case, the asynchronous set or reset would be forced to a registers control
port. Likewise, if you build with an active low control signal, the LUT will be
required to invert the control signal.
24—Control Set Reduction
And as I mentioned, FFs with different control sets cannot be packed into the
same slice.
So in this example, we see three FFs are synthesized to force some control
signals to be associated with LUT inputs so that these registers can be packed
into the same slice.
This is done in XST by using the Reduce Control Set synthesis option.
It is not done by default (hopefully it will eventually) but what is most important
is that you code for a synchronous reset and code for a CE so that you can take
advantage of this functionality in your FPGA.
25—Using the Slice Resources
There are three ways to get the FPGA resources you need in your design. You
can either infer logic with your synthesis tool, instantiate primitives from the
Xilinx Unified Library, or instantiate an optimized component from the Core
Generator.
One of the biggest challenges for new users is to make certain that they are
inferring the appropriate resources. First of all, one of the best ways to check
that you are getting the resources inferred is to check the schematic viewer from
your synthesis tool. All of the high-end synthesis tools have schematic viewers.
Viewers are used to graphically display a gate level version of your netlist and
also a technology view of your netlist. While a gate-level netlist is worthwhile to
check the Boolean optimization produced by your synthesis tool, the technology
viewer helps you to understand what Xilinx primitives were inferred by your
synthesis tool. A primitive is a resource that is written to the netlist that
represents a particular FPGA resource, such as a LUT, SRL, distributed RAM, etc.
Xilinx provides a library of primitives, called the Xilinx Unified Library, to all of our
synthesis partners, so they can create netlists that are optimal for their
customers. Well, this same library is available to you any time you want to
instantiate a primitive into your FPGA.
The Core Generator and Architecture Wizard makes larger optimized components
available to you. While some cores have a charge associated with them, most
do not have a charge associated with them. The cores range from simple
components to large, highly complex IP. The Architecture Wizard is used most
commonly to customize DCM and PLL components.
26—Inference
All primary slice resources can be inferred by XST and Synplify. This includes
LUTs, FF, SRL, Muxes, and carry logic. However, you must be sure to code for
the resources properly. This means that if you define a FF with an incompatible
combination of control signals you will get something slightly different (probably
with a LUT input used for a control signal).
Likewise, the SRL is non-loadable, has no reset, and only supports a serial
functionality. So if you code for a parallel read, you will get a pure register
implementation, and this may not be what you expected.
Similarly, muxes should use a CASE statement and carry logic should use the
proper arithmetic operator. Don’t code for the XOR functionality and expect it to
make to carry logic.
In general, inference is best and preferred by our customers. It assures
portability to another technology and is re-usable. However, most customers will
have to do some instantiation.
27—Instantiation
A list of Xilinx primitives and their behavior can be found from the Libraries Guide.
This is available at support.xilinx.com or can be found from the Help menu in the
ISE software. This will describe all of the ports and attributes that can be
customized.
There is also a complete description of the XST attributes located from the
software manual web page. This is always a useful resource for inferring logic.
28—Architecture Wizard
The Core Generator and Architecture Wizard help you instantiate a high-level
component.
These cores are optimized and make designing for all of the detailed functionality
of things like the DCM or PLL easier. They also provide access to some very
powerful design tools that can create common sophisticated components at no
charge.
29—Summary
In summary, <read slide>
30—Where Can I Learn More?
Well, there is lots of useful information about inference and synthesis attributes
in the Synthesis and Simulation Design Guide. This is available to from the ISE
software using its help menu or directly from support.xilinx.com
The XST User Guide provides very helpful coding recommendations that all Xilinx
users should live by. It also includes a detailed description about all of XSTs
synthesis options. This information can also be found from the Help menu.
If you would like to see what other courses we offer, or what other Free RELs
are available go to the Xilinx Education link you see here.
I would also like to mention again that there are architecture modules available
that discuss the basics of Xilinx’s newest devices. You may find this useful,
especially if you want to learn more about the device differences.
But whatever you do, please take a second and let us know what you thought of
this REL. Just click on the icon on the next page and tell us what you think.
My name is Frank Nelson. You have been listening to the Basic FPGA
Architecture module for Virtex-6. Thanks for listening and thanks for your
business.
31—Trademark Information
<nothing said>
Download