Designing for 100+ MHz

advertisement
Designing for 100+ MHz
1
Designing for 100+MHz
1999 Designs Demand...
 Higher system speed
 Higher integration
— smaller size, less power, better reliability
 Lower cost
 Shorter development time
 Better product differentiation
2
Designing for 100+MHz
Traditional Multi-Chip Boards
 Discrete design components
— CPU, memory
— bus transceivers, PCI controller, FIFOs
— Ethernet controller, Graphics accelerator,
MPEG, DSP, etc.
— programmable logic as glue and custom function
 Advantages:
— well-documented sophisticated functions
— readily available as IP in silicon
3
Designing for 100+MHz
Multi-Chip Board Problems
 Physical size
 Power consumption and reliability
 PC board signal integrity
 Limited flexibility
— prevents design modifications and upgrades
— prevents product diversification
— prevents product customization
 Poor product differentiation
— standard parts = standard architecture
4
Designing for 100+MHz
FPGA Advantages
 Smaller size
 Lower power consumption
 Better signal integrity
— fewer PC-board issues
 Enhanced flexibility
— easy modifications, upgrades, etc.
 Enhanced product differentiation
— proprietary architectures
5
Designing for 100+MHz
FPGAs Users Want...
 System clock rate of 100+ MHz
 >100,000 gates
 Efficient design methodologies
 Availability of well-documented Cores
 Reasonable cost
6
Designing for 100+MHz
The FPGA Solution
4th Generation FPGA
Logic+Memory+Routing
Multi-Standard Select I/O
Temperature Sensing
Delay-Locked Loop for
Fast Clock and I/O
3.3 ns Synchronous
Dual-Port SRAM
500 Mbps SelectMAP
Configuration
7
Designing for 100+MHz
Now the Challenge...
Design a 100+ MHz system
 Together, we can do it...
— we’ll supply the ingredients...
— you use them intelligently
 But don’t forget...
— the clock period is less than 10 ns !
8
Designing for 100+MHz
Designing for 100+ MHz.
 Volts, Amps, and Watts
— PCB signal distribution
— chip inputs and outputs
— power and thermal considerations
 Ones and zeros
— logic emulation
 Bits and bytes
— memory hierarchy
9
Designing for 100+MHz
Moore Meets Einstein
2048
1024
Trace Length MHz
512
256
128
64
32
16
8
Clock Frequency
Inches per 1/4 Clock Period
4
2
1
’65
’70
’75
’80
’85
’90
Year
’95
’00
’05
’10
Speed Doubles Every 5 Years…
...But the speed of light never changes
10
Designing for 100+MHz
Volts, Amps, and Watts
 PCB design issues
— capacative loading
— transmission lines and termination
 Chip inputs and outputs
— clock distribution and DLLs
— I/O standards
 Power and thermal considerations
— temperature sensing diode
— power supply decoupling
 Configuration
— new SelectMAP mode
11
Designing for 100+MHz
Capacitive Loading
 Capacitance slows outputs and increases power
— output delay increase:
– ~ 25 ps per pF of additional loading
— output power dissipation increase:
– 11 µW per MHz per pF with 3.3-V swing
 Sources of capacitance
— 10 pF max for each device pin
— 2 pF per inch for narrow traces ( 0.8 pF/cm )
— 130 pF per inch2 for copper areas ( 20 pF/cm2)
 IBIS files provide output impedance details
12
Designing for 100+MHz
Transmission Lines
 Some traces must be treated as transmission lines
to minimize ringing
— transmission line if round trip > transition time
— lumped-capacitance if round trip < transition time
 Signal delay on a PCB:
— 140 to 180 ps per inch ( 50 to 70 ps/cm)
 Lumped-capacitance trace length:
— 3 inches max for a 1-ns transition time (7.5 cm)
— 6 inches max for a 2-ns transition time (15 cm)
13
Designing for 100+MHz
Terminated Transmission Lines
Reflections and ringing
Traditional Thevenin
termination at the end
VCC
100 Ω
50 Ω
100 Ω
Dynamic termination at the
end is better and saves power
50 Ω
50 Ω
100 pF
Series termination at the
source is best single source
and destination only!
22 Ω
27 Ω
50 Ω
(50 Ω Total)
14
Designing for 100+MHz
On-Chip Clock Distribution
Clock
CLB
Data
IOB
 Clock distribution introduces delay
— larger chips suffer more clock delay
15
Designing for 100+MHz
Clock Delay Problems
 Clock delay increases clock-to-output times
 Clock delay leads to unacceptable input hold time
— set-up time is negative
 Additional data delay can eliminate the hold time
— set-up time becomes positive
— but tolerance build-up widens the data-valid window
IOB
Flip-Flop
Data
Clock
Delay
Clock
Distribution
Delay
D
Q
Clock
Required Data Valid
(without delay)
Required Data Valid
(with delay)
16
Designing for 100+MHz
DLLs Maximize I/O Speed
 Clock-to-output time plus set-up time determines
the I/O speed and data bandwidth
— min clock period = max clock-to-out + max set-up
 Traditional solution:
— use highly buffered, balanced clock trees
– needed to reduce internal clock skew
– cannot totally eliminate the delay
 The Virtex solution:
— use a Delay-Locked-Loop ( DLL )
– aligns the internal and external clocks
– effectively eliminates the clock-distribution delay
17
Designing for 100+MHz
Virtex Has 4 Independent DLLs
Clock
Comparator
Error
Delay
CLB
Data
IOB
 DLLs adjust clock delay to align internal and external clocks
— digital closed-loop control
— 25 to 200-MHz range, 35-picosecond resolution
18
Designing for 100+MHz
Fast Clock-to-Out With DLL
 160 MHz inter-chip data rate
— 16-mA LVTTL
— IOB register to IOB register
Virtex FPGA
Virtex FPGA
0.5 ns
D
Q
DLL
DLL
3.8 ns
1.9 ns
Clock
19
Designing for 100+MHz
LVTTL Data Rate with DLL
1.4 ns measured clock-to-output delay
Output standard = LVTTL Fast 16mA
(OBUF_F_16)
Temp=100C, Vdd=2.375V,
Vcco=3.3V
Waveforms:
1: CLKIN
2: DATA OUT (no DLL)
3: DATA OUT (DLL deskewed)
Timing
w/o DLL
w/ DLL
r->r r->f
r->r r->f
3.9n 3.9n 1.4n 1.4n
20
Designing for 100+MHz
Other DLL Functions
 Double the incoming clock frequency
— fast internal operation – slow external clock
 Clock mirroring to the PCB
 Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16
 Adjust clock duty cycle to 50-50
 Create four quadrature clock phases
— input four sequential bits per clock period
21
Designing for 100+MHz
Duty Cycle Correction
~25% duty cycle in – 50% duty cycle out
Virtex FPGA
1X
25 MHz
25% Duty
Cycle
DLL
25 MHz
50% Duty Cycle
22
Designing for 100+MHz
Clock Doubling and Mirroring
 Clock mirror with less than 100 ps skew
— simplifies PCB clock distribution
Virtex
SDRAM
74 MHz #1
System
Clock
37 MHz
DLL 1
1 Input Load
Zero-Delay
Internal Clock Buffer
74 MHz #2
DLL 2
Actual HDTV
Customer Example
Exactly
Aligned
74 MHz Internal
37 MHz Internal
System Clock
SDRAM
Inside FPGA
Inside FPGA
23
Designing for 100+MHz
Precise Clock Mirroring
2x system clock for board use
Virtex FPGA
2X
66MHz
Clock
DLL
132 MHz
Clock
24
Designing for 100+MHz
Clock Division
 Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16
— maintain synchronous edges
CLKIn 200 MHz
CLKout 200 MHz
CLKDV 12.5 MHz
25
Designing for 100+MHz
Multi-Standard SelectI/O
GTL+
MicroProcessor
2.5V SSTL
SRAM
1.8V
SDRAM
5V Tolerant
FLASH
Mixed Signal
5V
3.3V LVTTL
Busses/Backplanes
(3/5V PCI, ISA, GTL…)
DSP
26
Designing for 100+MHz
Mix & Match Output Standards
 User-supplied voltages determine output swing
— 3.3 V, 2.5 V, 1.5 V
— one voltage per bank
— a bank is half of a chip edge
 Output characteristics are programmable on a
per-pin basis
— push-pull or open-drain
— LVTTL drive strength
– 2-mA to 24-mA sink and source current
— LVTTL Slew rate
27
Designing for 100+MHz
Mix & Match Input Standards
Internal
Reference
 Internal or user-supplied
threshold voltage
— selectable on a per-pin basis
— one user-supplied
threshold voltage per bank
 Programmable over-voltage
protection
— 5-V tolerant or diode
clamp to VCCO
— selectable on a per-pin basis
VREF
Input
Input
Input
Input
Input
Input
VREF
28
Designing for 100+MHz
SSTL Clock-to-Out With DLL
 200 MHz inter-chip data rate
— SSTL 3, Class II
— IOB register to IOB register
Virtex FPGA
Virtex FPGA
0.3 ns
D
Q
DLL
DLL
2.8 ns
1.9 ns
Clock
(Stub Series Transceiver Logic)
29
Designing for 100+MHz
SSTL Data Rate with DLL
 1.3 ns measured clock-to-output delay
— much lower noise than LVTTL
Output standard = SSTL 3 Class 2
(OBUF_SSTL3_II)
Temp=100C, Vdd=2.375V, Vcco=3.3V,
Vtt=1.5V
Waveforms:
1: CLKIN
2: DATA OUT (no DLL)
3: DATA OUT (DLL deskewed)
Timing
w/o DLL
r->r r->f
3.5n 3.8n
w/ DLL
r->r r->f
1.1n 1.3n
30
Designing for 100+MHz
From FPGA to System Component
‘Redefining the FPGA’
x1 CLK
Chip 1
Cache SRAM (Mbytes)
x2 CLK
LVCMOS
SSTL3
LVTTL
Low Voltage
CPU
GTL+
SDRAM (133MHz)
Chip 1
High Speed System Backplane
"Virtex moves FPGAs from glue to system component” - Ron Neale, EE
31
Designing for 100+MHz
Power and Thermal Issues
 Power and heat are serious concerns
 All CMOS power consumption is dynamic
— proportional to VCC2
— proportional to capacitance
— proportional to frequency
 Virtex conserves power
— 2.5-V supply voltage
— small geometries and short interconnects
reduce capacitance
32
Designing for 100+MHz
Virtex Power Consumption
 Virtex is designed to conserve power
— 100 MHz 16-bit counters
– 12.5 MHz average transition rate
– 6.5 mW per counter including clock distribution
— 100 MHz 8-bit counters
– 25 MHz average transition rate
– 5 mW per counter including clock distribution
XCV300
XCV1000
384 16-bit Counters
2.5 W Total
768 8-bit Counters
3.7 W Total
1536 16-bit Counters 9.8 W Total
3072 8-bit Counters
14.7 W Total
33
Designing for 100+MHz
Thermal Management
 Temperature-sensing diode
— matched to maxim MAX 1617 A/D
— programmable alarms
— similar to the Pentium II solution
Virtex
FPGA
SBMCLK
DXP
DXN
Maxim
MAX1617
SBMDATA
ALERT
34
Designing for 100+MHz
Power Supply Decoupling
 CMOS power-supply current is dynamic
— current pulse every active clock edge
 Peak current can be 5x the average current
— instantaneous current peaks can only be
supplied by decoupling capacitors
 Use one 0.1 µF ceramic chip capacitor for each
power-supply pin
— low L and R are more important than high C
— double up for lower L and R if necessary
— use direct vias to the supply planes, close to the
power-supply pins
35
Designing for 100+MHz
Virtex Configuration
 New byte-wide SelectMAP mode
— up to 528 Mbps at 66 MHz
Control Logic
(EPLD)
Busy
– simple handshake protocol
— up to 400 Mbps at 50 MHz
CS
Address
Configuration
EPROM
– no handshake required
 Configuration bit-stream length
— 0.5 Mbits to 6.1 Mbits
Data
WE, CS
Virtex
FPGA
36
Designing for 100+MHz
Volts, Amps, and Watts: Recap
 PCB design issues
— minimize capacitance for higher speed
— terminate transmission lines to reduce ringing
 Chip inputs and outputs
— use DLLs to maximize I/O bandwidth
— use SelectI/O to interface with different standards
 Power and thermal considerations
— use the sensing diode to manage chip temperature
— decouple the power supply well
 Configuration
— configure faster with the SelectMAP mode
37
Designing for 100+MHz
Designing for 100+ MHz.
 Volts, Amps, and Watts
— PCB Signal Distribution
— chip Inputs and Outputs
— power and Thermal Considerations
 Ones and zeros
— logic Emulation
 Bits and bytes
— memory hierarchy
38
Designing for 100+MHz
Spending the 10 ns Budget
 Fast logic requires fast function generators
— signals often pass through several
function generators
 Routing delays must also be kept short
— there are routing delays between every
function generator
 Arithmetic delays are important
— carry chains often create critical paths
39
Designing for 100+MHz
You Don’t Have To Be An Expert
 You don’t have to be an FPGA architecture expert to
implement high-performance designs
— the benefits of a good architecture are automatic
– all the logic goes faster
– software provides easy access to the features
 You can achieve high-performance only with a good
FPGA architecture
— a good FPGA empowers its users
 You’ll design better if you know the architecture
— matching your design style to the available features
increases performance and/or lowers cost
40
Designing for 100+MHz
Virtex CLB
 Logic and arithmetic delay reduction demands
improvements in the CLB
 Virtex CLB is divided into two slices, each with:
– 2 function generators
– 2 flip-flops
– 2 bits of carry logic
Fnct
Gen
Fnct
Gen
Carry
Carry
Fnct
Gen
Fnct
Gen
Carry
Carry
41
Designing for 100+MHz
Fast Function Generators
 Each function generator emulates
2 to 3 levels of logic
— a 10-level logic path typically requires
3 to 5 Function Generators in series
— at 100 MHz, they must be less than
2 ns each including the routing
 Virtex has 0.6-ns function generators
— leaves 1.4 ns for each route
42
Designing for 100+MHz
Connecting Function Generators
 Some functions need several function generators
— F5 MUXs connect pairs of function generators
– functions with 5 to 9 inputs
— F6 MUXs connect all 4 function generators
– functions with 6 to 17 inputs
Fnct
Gen
Fnct
Gen
F5
F5
Fnct
Gen
Fnct
Gen
F6
43
Designing for 100+MHz
Fast Local Routing
 Local routing provides fast interconnects
— in a CLB, Function Generators connect with minimal
routing delays
— fast paths between adjacent CLBs increases flexibility
Fnct
Gen
Fnct
Gen
Fnct
Gen
Fnct
Gen
Carry
Carry
Carry
Carry
Fnct
Gen
Fnct
Gen
Fnct
Gen
Fnct
Gen
Carry
Carry
Carry
Carry
44
Designing for 100+MHz
Use Pipelining for Speed
 Shorter clock periods means doing less each period
—
—
—
—
create a pipeline structure
pipeline stages operate concurrently
more functions are done at the same time
throughput increases
 All function generators have output flip-flops
— most pipeline support is “free”
45
Designing for 100+MHz
16-Bit Pipeline in One LUT
 In directly cascaded pipelines the flip-flops
are not free
Delay
16-Bit Shift Register
Select
 One SRLUT can implement
up to 16 bits of delay
— shift data in and select
the appropriate tap
Output
Input
46
Designing for 100+MHz
Fast Logic Needs Fast Routing
 Our typical design with 3 to
5 CLBs needed an average
routing delay of 1.4 ns or less
Vector-based Interconnect
— the Virtex routing
architecture delivers
this performance
 Delay is independent
of direction
— dependably
short delays
The circles show 1.4-ns routing regions
47
Designing for 100+MHz
Go Farther, Faster
 Virtex achieves its speed through a hierarchy of
highly buffered routing resources
— wires span 1, 2, or 6 CLBs
 The Virtex routing architecture is designed for
large arrays
— today’s FPGAs are big…
but tomorrow’s will be even bigger
 Virtex is designed to maintain its performance
even in very large arrays
48
Designing for 100+MHz
No Routing Congestion
 For high-speed applications, routing must be
dependably fast
— not just capable of being fast
 In the past, high device utilization has caused routing
congestion
— critical nets might be forced to meander
 Virtex minimizes these problems
— abundant resources prevent congestion
If it needs to be fast, it will be fast – automatically!
49
Designing for 100+MHz
Built-in Tri-State Busses
 Bi-directional busses are supported directly by
tri-state buffers built into each CLB
— two drivers per CLB
— segmentable every four CLB columns
CLB
CLB
CLB
CLB
CLB
50
Designing for 100+MHz
Arithmetic – A Special Case
 Adders, accumulators, counters, and comparators
all depend on carry chains
 Carry-chain logic is usually much deeper than the
rest of the design
— 32 levels for a 16-bit ripple adder
— too deep to use function generators at 100 MHz
— arithmetic delays would limit performance
 Dedicated carry logic provides the desired speed
— 16-bit adders can operate at up to
200 MHz register-to-register
51
Designing for 100+MHz
Wide Arithmetic
 64-bit adders would require 128 levels of logic
— expensive complex carry schemes would be needed
to preserve performance
 Virtex minimizes the carry propagation delay
— 100 ps per bit pair
— zero routing delay between CLBs
 Minimal performance loss for each extra bit
16-bit adders operate at up to 200 MHz
64-bit adders operate at up to 135 MHz
52
Designing for 100+MHz
Efficient Virtex Multipliers
 Cascade vs. tree structure
Delay
— cascade simpler and smaller
— tree is faster
Cascade
Tree
Virtex Tree
 Virtex gives the best of both
worlds
— as fast as a tree
— smaller than a cascade
Number of CLBs
 160 MHz clock rate for
pipelined 16 x 16 multiplier
4x4
8x8
16 x 16
Cascade
Tree
Virtex Tree
4x4
8x8
16 x 16
53
Designing for 100+MHz
Fast Address Decoders
 Wide address decoders
could slow operation
— wide AND gates with
invertable inputs
 Virtex carry-chain MUXs
can act as AND gates
— combine function
generator ANDs
 64-bit decoders operate
at up to 155 MHz
0
1
0
0
1
0
0
1
0
0
1
0
1
54
Designing for 100+MHz
Speed Is Never Wasted
 You can never have too much performance
— excess performance can always be traded for
size and cost reduction
 Replace single-cycle functions with smaller
multi-cycle versions
— a 2-cycle multiplier is half the cost of a
single-cycle multiplier
Reduce costs by designing down
to the performance you need
55
Designing for 100+MHz
Creating a High-Speed Clock
 Logic sometimes needs to operate faster than
the available clock
— multiple RAM accesses in a single cycle
— low-speed PCB clock distribution for power or
noise reduction
 Virtex DLLs can double and redouble
incoming clocks
45 MHz
2X
2X
DLL1
DLL2
90 MHz
180 MHz
56
Designing for 100+MHz
Optimized for the Future
 Deep sub-micron technology permits larger
and larger array sizes
— poses new circuit-design challenges
— changes the rules of FPGA architecture
 Across-chip routing is the most vulnerable
— could easily limit design performance
 Virtex is designed for long-term growth
— even long, across-chip routes will remain fast
Virtex is tomorrow’s FPGA
… today!
57
Designing for 100+MHz
10 ns is Long Enough
 Virtex CLBs can implement relatively complex
functions in 10 ns
— 0.6 ns per 4-input function generator
 Virtex offers fast interconnections
— even across-chip when fully utilized
— fast tri-state buses
 Support for very fast arithmetic operations
— 16-bit adders at 200MHz
58
Designing for 100+MHz
Implement Designs
Automatically
 You don’t have to be an FPGA wizard to use Virtex
 Virtex is optimized for automated implementation
— uniform structure
– efficient mapping/synthesis
— ample routing
– simple placement and no congestion
— predictable performance
– effective synthesis
 IP cores speed design even more
— validated functionality with guaranteed performance
59
Designing for 100+MHz
Designing for 100+ MHz
 Volts, Amps, and Watts
— PCB signal distribution
— chip inputs and outputs
— power and thermal considerations
 Ones and zeros
— logic emulation
 Bits and bytes
— memory hierarchy
60
Designing for 100+MHz
100+ MHz Memory
 Virtex memory operates up to 200 MHz
 High-speed memory has two benefits
— data storage
– “work-in-progress”
– input/output buffers, FIFOs
— accelerating complex functions
– store pre-computed values in look-up tables
61
Designing for 100+MHz
Data Storage Hierarchy
Virtex supports 3 levels of memory hierarchy
 On-chip SelectRAM+
— small-to-medium memories
— 0.6-ns read access time
 On-chip Block SelectRAM+
— larger memories
— true dual-ported operation
— 3.3-ns read access time
 Fast SelectI/O interfaces to external RAM
— DLL boosts memory bandwidth
62
Designing for 100+MHz
SelectRAM+
 SelectRAM+ uses CLB LUTs as user memory
—
—
—
—
16-deep RAMs
32-deep RAMs
16-deep dual-ported RAMs
16-deep shift registers
 Cascadable for larger memories
— 128 or more words deep
— uses logic resources for expansion
63
Designing for 100+MHz
Block SelectRAM+
 Up to 32 dual-ported 4096-bit RAM Blocks
— synchronous read and write
 True dual-port memory
— each port has full read and write capability
— different clocks for each port
 Configurable aspect ratio
— trade width for depth
– 4096 x 1 bit to 256 x 16 bits
— separate configurations for each port
 Dedicated routing for memory expansion
64
Designing for 100+MHz
High-Speed Memory Interfaces
 SelectI0 and DLLs together provide fast access to
many types of external memory
 Xilinx currently offers two reference designs
— fully synthesized
— automatic placement and routing
SDRAM
… up to 125 MHz
ZBTRAM … up to 143 MHz
(Zero Bus-Turn-around)
65
Designing for 100+MHz
Input/Output Data Buffers
 High-performance systems need data buffers to
decouple internal operation from I/O activity
— I/O may be sporadic (burst-mode busses)
— I/O may be faster or slower
— I/O may be wider or narrower
 I/O buffers can take several forms
— dual-ported RAMs
— ping-pong buffers
— FIFOs
66
Designing for 100+MHz
Dual-ported I/O Buffers
 Block SelectRAM+ is ideal for I/O buffers
— dual-ported operation
– independent clocks and controls
– bridges between clock domains
– simultaneous read and write
— port-specific aspect-ratio control
– built-in rate/width conversions
 SelectRAM+ provides similar benefits
on a smaller scale
67
Designing for 100+MHz
Ping Pong Buffers
 Ping-pong buffers are pairs of blocks that alternate
between input and processing
 SRLUT for small buffers
Read
Address
16-Bit Shift Register
— self-addressing input
— 0.6-ns read access
{
 Larger buffers can use
the dual-ported Block RAM
Output
{
16-Bit Shift Register
— one address bit alternates
read/write areas
— 3.3-ns read access
Select
Input
68
Designing for 100+MHz
Small FIFOs in SRLUTs
 Small FIFOs can be implemented in SRLUTs
word count addresses the output data
increment and enable SRLUT to Push
decrement to Pop
Pop
enable only for both
Down
 16-Byte FIFO in 4 CLBs
— 16 x 16 in 6 CLBs
— 200+ MHz
 Expandable for deeper
FIFOs
Word
Counter
{
Up
Push
16-Bit Shift Register
—
—
—
—
Output
Input
69
Designing for 100+MHz
Large FIFOs in Block RAM
— add read and write
address counters
 Asynchronous push
and pop
Data
En
Full
Push
Block
SelectRAM+
Addrs
Output
Data
Addrs
WE
Counter
Input
Counter
 Large FIFOs can use the
dual-ported block RAM
En
Control Logic
Pop
Empty
 Different port sizes give rate-for-width conversion
 Block RAM FIFOs can operate at up to 170 MHz
including flag logic
70
Designing for 100+MHz
Pre-computing for Speed
 Some functions are too complex for 10-ns
logic implementation
— pipelining is not always possible
 An alternative is to pre-compute all the possible
results and store them in memory
— select a result according to the inputs
 Function time is independent of complexity
— 0.6 ns SelectRAM+ access time
— 3.3 ns Block SelectRAM+ access time
 The function table can be smaller than the logic
71
Designing for 100+MHz
Multiplication By A Constant
 Sometimes, data has to be “scaled”
— multiplied by a constant value
 A full multiplier is too expensive
— it can multiply by a variable
— unnecessarily general and too
complex
 Storing all multiples of the constant
is a better alternative
Constant
Input
Input
Multiplier
Array
Product
Table
Scaled
Data
Scaled
Data
— smaller and much faster
72
Designing for 100+MHz
16-bit Scaler
 A 216-word product table is impractical
— partition the input into nibbles
– use 16-word LUTs for nibble products
– combine the partial products in adders
 Roughly half the CLBs of a full multiplier
— for a 16-bit Coefficient:
36 CLBs vs.
62 CLBs
 Pipeline the adders
for extra speed
Input
LUT
x4096
LUT
x256
LUT
x16
Scaled
Data
LUT
73
Designing for 100+MHz
Changing the Constant
 The SRLUT mode can be used to update the table
— “push-only” stack
— last 16 bits loaded define the table
Constant
Register
Input
{
16-Bit Shift Register
 A simple accumulator
computes all products
of a new constant
Output
Register
Clear
Load
Change
Constant
74
Designing for 100+MHz
Large Function Tables
 Larger functions can be implemented in the
Block SelectRAM+
— 12-input functions
— micro-coded state machines
 Data tables can also be implemented
— sine/cosine tables for DSP, for example
— dual-ported access gives the sine and cosine
simultaneously
— a simple address offset gives 90º phase shift for
accessing sine and cosine from a single table
75
Designing for 100+MHz
Block RAM/ROM Creation
 CORE Generator
software creates
RAMs and ROMs
— simple GUI
interface
 Initialization file is
loaded into RAMs
and ROMs at
configuration time
76
Designing for 100+MHz
Memory Summary
 Virtex has two kinds of internal memory
— distributed SelectRAM+ for small RAMs
— Block SelectRAM+ for larger RAMs
 SelectRAM+
—
—
—
—
0.6 ns read access time
16- and 32-word RAMs
16-word dual-ported RAMs
16-word shift registers
– sequential write/random-access read
– FIFOs, pipelining, LUT functions, etc...
77
Designing for 100+MHz
Memory Summary
 Dual-ported 4096-bit Block SelectRAM+
— 3.3 ns read access time
— true dual-ported operation
– both ports are read/write
– ports can be clocked asynchronously
— configurable aspect ratio
– 4096 x 1 bit to 256 x 16 bits
– configure ports differently for width/rate conversion
 High-speed SelectI/O access to external RAM
78
Designing for 100+MHz
Designing for 100+ MHz
Volts, Amps, and Watts
— DLLs and flexible I/O standards
— fast inter-chip communication
— simple rules for good signal integrity
Ones and zeros
— fast logic and fast interconnect
— dependable high performance
 Bits and bytes
— distributed SelectRAM+
— dual-ported Block SelectRAM+
79
Designing for 100+MHz
The Virtex Family
XCV50
XCV100
XCV150
XCV200
XCV300
XCV400
XCV600
XCV800
XCV1000
System Gates
57,906
108,904
164,674
236,666
322,970
468,252
661,111
888,439
1,124,022
Logic Cells
1,758
2,700
3,888
5,292
6,912
10,800
15,552
21,168
27,648
Block RAM
32 Kb
40 Kb
48 Kb
56 Kb
64 Kb
80 Kb
96 Kb
112 Kb
128 Kb
CS144
94
94
TQ144
PQ/HQ240
94
164
94
164
164
164
164
164
164
164
BG256
BG352
BG432
BG560
180
180
260
260
260
316
316
404
316
404
316
404
404
FG256
FG456
FG600
FG680
176
404
404
500
404
514
514
User I/O
176
176
260
176
284
312
The complete Virtex Data Sheet is on your AppLinx CD-ROM
and at www.xilinx.com/partinfo/virtex.pdf
80
Designing for 100+MHz
Designing for 100+ MHz
81
Designing for 100+MHz
Download