Introduction To VIRTEX II Architecture Presented By: Ankur Agarwal Xilinx Design Flow Plan & Budget Create Code/ Schematic HDL RTL Simulation Implement Translate Functional Simulation Synthesize to create netlist Map Place & Route Attain Timing Closure Timing Simulation Create Bit File Xilinx Architecture features High performance at 2.5, 3.3V and 5V Technology Independence EDIF, VHDL, Verilog, SDF Interface Footprint compatibility Devices with each family are compatible with each other Pin locking VIRTEX Up to 2 Million System Gates at 100+ MHz Features: Distributed and Block RAM available Low Power Delay Logic Loops 2.5V Internal Operation with support of common power Naming Conventions XC4028XL-3-BG256 Package Speed Grade Sub-Family (3V = XL, 5V = no XL) No. of Gates Family (4000, 9500) Spartan starts with XCS CPLD and FPGA Complex Programmable Logic Device (CPLD) Field-Programmable Gate Array (FPGA) Architecture PAL/22V10-like More Combinational Gate array-like More Registers + RAM Density Low-to-medium 0.5-10K logic gates Medium-to-high 1K to 3.2M system gates Performance Predictable timing Up to 250 MHz today Application dependent Up to 200 MHz Interconnect “Crossbar Switch” Incremental Overview of Xilinx FPGA Architecture I/O Blocks (IOBs) Programmable Interconnect Configurable Logic Blocks (CLBs) Tristate Buffers Global Resources Block Diagram of VIRTEX-II Architecture SONET / SDH DCM Distri RAM PCI-X PCI 18Kb BRAM LVDS CAM FIFO Shift Registers DDR DDR SDRAM DDR QDR SRAM DDR CAM Multiplier BLVDS Backplane CLB Resources Basic resource unit is the Logic Cell 1 CLB contains 2 - 4 Logic Cells, depending on device family Logic Cell = 4-input Look-Up Table (LUT) + D Flip-flop LUT capacity limited by number of inputs, not complexity of function LUTs can be used as ROM or synchronous RAM Flip-flop can be configured as a transparent latch in Virtex and Spartan-II LUT FF Closer Look at a CLB Structure COUT G4 G3 G2 G1 Look-Up Table O Carry & Control Logic COUT YB Y D S Q CK EC CIN CLK CE Look-Up Table O R F5IN BY SR F4 F3 F2 F1 G4 G3 G2 G1 Carry & Control Logic YB Y D S Q CK EC R F5IN BY SR Look-Up Table O Carry & Control Logic XB X D S CK EC Q F4 F3 F2 F1 R SLICE CIN CLK CE Look-Up Table O Carry & Control Logic XB X D S Q CK EC R SLICE Each slice has 2 LUT-FF pairs with associated carry logic Two 3-state buffers (BUFT) associated with each CLB, accessible by all CLB outputs Interconnect Technology Offered by VIRTEX-II Interconnect an array of switch matrices All Virtex II features can access routing resources through the switch matrix Simplify design and place & route Switch Matrix CLB Switch Matrix IOB Switch Matrix DCM Switch Matrix Switch Matrix Switch Matrix Switch Matrix 18Kb BRAM MULT 18x18 Simplified SLICE Structure Each Slice has four outputs: Two registered outputs Two non-registered outputs Two BUFTs associated, accessible by all 16 CLB outputs Carry Logic for fast addition Two independent carry chain per CLB Fast Carry Logic Each CLB contains separate logic and routing for the fast generation of carry signals Increases efficiency and performance of adders, subtractors, accumulators, comparators, and counters Carry logic is independent of normal logic and routing resources MSB Carry Logic Routing LSB CLB (Configurable Logic Blocks) Each CLB is connected to one switch matrix Providing access to general routing resources COUT COUT TBUF TBUF Slice S3 X1Y1 Switch Matrix Slice S2 X1Y0 SHIFT Slice S1 X0Y1 Slice S0 X0Y0 CIN Fast Connects CIN High level of logic integration Wide-input functions: —16:1 multiplexer in 1 CLB or any function —32:1 multiplixer in 2 CLBs (1 level of LUT) Fast arithmetic functions —2 look-ahead carry chains per CLB column Addressable shift registers in LUT —16-b shift register in 1 LUT —128-b shift register in 1 CLB (dedicated shift chain) Four-Input LUT Implements combinatorial logic Any 4-input logic function Cascaded for wide-input functions 4-input logic function Truth Table Inputs(ABCD) Output(Z) 0000 0 0001 0 0010 1 0011 0 …… .. 1110 1 1111 1 A LUT = B Z C D Multiplexers MUXF5 combines 2 LUTs to create 4x1 multiplexer Or any 5-input function (LUT5) Or selected functions up to 9 inputs MUXF6 combines 2 slices to form 8x1 multiplexer Or any 6-input function (LUT6) Or selected functions up to 19 inputs Dedicated muxes are faster and more space efficient CLB Slice LUT MUXF6 LUT MUXF5 Slice LUT LUT MUXF5 CLB Multiplexers CLB Multiplexer Location F5 F8 MUXF8 combines the 2 MUXF7 outputs (Two CLB) Slice S3 F5 F6 MUXF6 combines Slices X1Y0 & X1Y1 Slice S0 MUXF6 combines Slices X0Y0 & X0Y1 F5 F6 Slice S1 MUXF7 combines the 2 MUXF6 outputs F5 F7 Slice S2 CLB Horizontal Cascade Chain Wide AND-OR functions (Sum Of Products) SOP Slice S3 SOP Slice S3 Slice S2 SOP Slice S3 Slice S2 Slice S2 Slice S1 Slice S1 Slice S1 Slice S0 Slice S0 Slice S0 CLB CLB CLB Shift Register LUT Each LUT can be configured as shift register Serial in, serial out Dynamically addressable delay up to 16 cycles For programmable pipeline Cascade for greater cycle delays Use CLB flip-flops to add depth IN CE CLK LUT = DEPTH[3:0] D CE Q D CE Q D CE Q D CE Q OUT Shift Register 12 Cycles 64 Operation A Operation B 4 Cycles 8 Cycles 64 Operation C 3 Cycles 3 Cycles 9-Cycle imbalance Register FPGA Allows for addition of pipeline stages to increase throughput Data paths must be balanced to keep desired functionality Shift Register Look-Up Table High density integration of shift registers DSP applications use SRL16 for delay matching CDMA wireless and video applications require shift registers Up to 128-b per CLB Cascadable output Dynamic addressable output 16-b per LUT Multiple SRLC16 cascadable to any length Digital Clock Manager High-Speed 420 MHz clock generation: Clock de-skew on-chip and off-chip Up to 12 DCM per device Fully digital circuitry Flexible Frequency Synthesis Synthesis outputs: clock 0° & 180° (def.: 4X) High-Resolution Phase Shifting DPS fixed and variable modes Delay-Locked Loop (DLL) Precise Clock De-Skew DLL outputs: clock 0°, 90°, 180°, 270° DLL outputs: clock 2X and clock division 50/50 duty cycle correction Digital Clock Manager: DCM Delay-Locked Loop Clock phase de-skew Duty cycle correction Temperature compensation RST input LOCKED output Attributes: DCM CLKIN CLKFB RST DSSEN PSINCDEC PSEN PSCLK CLK0 CLK90 CLK180 CLK270 CLK2X CLK2X180 CLKDV CLKFX CLKFX180 LOCKED STATUS[7:0] PSDONE Clock signal Control signal DUTY_CYCLE_CORRECTION DLL_FREQUENCY_MODE CLKDV_DIVIDE = 1.5 to 16.0 STARTUP_WAIT CLK_FEEDBACK = CLK0 or CLK2X Up to 4 clock outputs per DCM Advanced Frequency Synthesis DCM CLKIN CLKFB RST DSSEN PSINCDEC PSEN PSCLK CLK0 CLK90 CLK180 CLK270 CLK2X CLK2X180 CLKDV CLKFX CLKFX180 LOCKED STATUS[7:0] PSDONE Clock signal Control signal Frequency Synthesis CLKFX is any M / D product of CLKIN frequency M = 2 to 32, D = 1 to 32 Default: M=4, D=1 (4X CLKIN) Always nominal 50/50 duty-cycle Attributes: CLKFX_MULTIPLY (integer) CLKFX_DIVIDE (integer) DFS_FREQUENCY_MODE After LOCKED: FreqCLKFX = (M/D) x FreqCLK IN High Resolution Phase Shifting DCM CLKIN CLKFB RST DSSEN PSINCDEC PSEN PSCLK CLK0 CLK90 CLK180 CLK270 CLK2X CLK2X180 CLKDV CLKFX CLKFX180 LOCKED STATUS[7:0] PSDONE Clock signal Control signal Fine Phase Shifting Applies to all CLK outputs Phase shift = fraction CLKIN period Fixed or variable modes Inputs in variable mode: PSINCDEC input =Increase /Decrease PSEN = Enable Phase Shift PSCLK synchronizes Phase Shift PSDONE output Attributes: CLOCKOUT_PHASE_SHIFT = NONE, FIXED, VARIABLE PHASE_SHIFT (signed integer) -255 to +255 Global Clocks Up to 16 Dedicated Low Skew Clocks 16 global clock multiplexers & buffers 8 clock nets in each quadrant Global clock ENABLE Switch glitch-free from one clock to another 16 clock pads (can be used as user I/O) Clock Distribution 16 Global Clock Multiplexers Eight on the top Eight on the bottom Switch “glitch free” from 1 clock to the other NW 8 Clocks selectable per 8 quadrant NW 8 BUFGMUX 8 BUFGMUX 8 BUFGMUX NE 8 8 max 16 Clocks NE 8 16 Clocks SW Unused Branches are Disable (Power Saving) 8 SW SW 8 BUFGMUX SE Use Global Buffers to Reduce Clock Skew •Global buffers are connected to dedicated routing. •This routing network is balanced to minimize skew •All Xilinx FPGAs have global buffers D D Q CLK2 Q BUFG CLK1 BUFG Introduces clock skew between CLK1 and CLK2 Uses an extra BUFG to reduce skew on CLK2 Design contains 2 clock signals Global Clocks: BUFGMUX Three modes: Clock buffer Stop the clock High or Low BUFGCE (stop Low) Clock multiplexer “glitch-free” O I O Low skew clock distribution BUFG primitive Clock enable I Switch from one clock to another BUFGMUX unrelated clocks CE I0 I1 BUFGMUX O S No pulse width shorter than 1/2 of the period Memory On-Chip SelectRAMTM Memory DSP Coefficients Small FIFOs CAM Shallow/Wide 128x1 Distributed RAM bytes Large FIFOs Packet Buffers Video Line Buffers Cache Tag Memory CAM Deep/Wide Up to 400 Mbps/pin DDR & QDR 18 kb Blocks Block RAM kilobytes Terabit Memory Continuum External RAM/CAM megabytes Embedded 18 kb Block RAM Up to 3 Mb on-chip block RAM High internal buffering bandwidth Reduced I/O count and more embedded memory 18Kbit block RAM Parity bit locations (parity in/out busses) Data width up to 36 bits 3 WRITE modes Output latches Set/Reset True Dual-Port RAM Independent clock (async.) & control Distributed RAM RAM16X1S CLB LUT configurable as Distributed RAM A LUT equals 16x1 RAM Implements Single and DualPorts Cascade LUTs to increase RAM size Synchronous write Synchronous/Asynchronous read Accompanying flip-flops used for synchronous read D WE WCLK A0 A1 A2 A3 = LUT O RAM32X1S D WE WCLK A0 A1 A2 A3 A4 LUT = LUT or O RAM16X2S D0 D1 WE WCLK A0 A1 A2 A3 O0 O1 or RAM16X1D D WE WCLK A0 SPO A1 A2 A3 DPRA0 DPO DPRA1 DPRA2 DPRA3 18 x 18 Embedded Multiplier Fast arithmetic functions Optimized to implement multiply / accumulate modules 18 x 18 signed multiplier Fully combinatorial Optional registers with CE & RST (pipeline) Independent from adjacent block RAM 18 x 18 Multiplier Embedded 18-bit x 18-bit multiplier 2’s complement signed operation Multipliers are organized in columns Data_A (18 bits) 18 x 18 Multiplier Data_B (18 bits) Output (36 bits) Basic I/O Block Structure D Q EC Three-State FF Enable Clock SR Three-State Control Set/Reset D Q EC Output FF Enable SR Output Path Direct Input FF Enable Registered Input Q D EC SR Input Path I/O Signal Types I/O Signal Type Single-Ended LVCMOS HSTL SSTL Differential LVTTL NOTE: Only the popular IO types shown here LVDS Bus LVDS LVPECL IOB: Double Data Rate Registers DDR registers can be clocked by Clock and not (clock) if the duty cycle is 50/50 CLK0 and CLK180 DLL outputs CLK DATA_1 DATA_2 Dual Data Rate D1A D1B D2A D1A D1C D2B D2A D1B D2C D2B D1C Built-In HSTL II Support What is the advantage of using HSTL Class II? High-speed IO interface Bi-directional Double parallel termination Vtt = 0.75V Vtt = 0.75V R=50 R=50 Zo = 50 Vref = 0.75V Digitally Controlled Impedance Dynamically adjusted termination resistors Provides drivers that matched to the impedance of the traces Provides on-chip termination Transmitter or receiver On-Chip termination advantages: No termination resistors on board Improve signal integrity by eliminating stub reflection Eliminates the need for source termination (single-ended I/O) Reduces board routing headaches and component count Virtex-II Family: Four and Six Columns Block RAM & Multiplier Device XC2V250 Virtex-II Family Members Device XC2V 40 CLB Array 18Kb BRAM 8x 8 80 250 500 16 x 8 24 x 16 32 x 24 1000 1500 2000 3000 4000 6000 8000 40 x 32 48 x 40 56 x 48 64 x 56 80 x 72 96 x 88 112 x 104 4 8 24 32 40 48 56 96 120 144 168 Multiplier 4 8 24 32 40 48 56 96 120 144 168 DCM 4 4 8 8 8 8 8 12 12 12 12 88 120 200 264 432 528 624 720 Max IOB 912 1,104 1,296 2 Columns 4 Columns 6 Columns BRAM & Multipliers BRAM & Multipliers BRAM & Multipliers VIRTEX-II Packaging Device XC2V Max user I/Os CS144 FG256 FG456 FG676 FF896 FF1152 FF1517 BG575 BG728 BF957 40 88 88 88 80 120 92 120 250 200 92 172 200 500 1000 264 432 172 264 172 324 432 1500 2000 3000 528 624 720 392 528 456 624 484 720 328 392 408 456 624 516 684 4000 6000 8000 912 (1296) 1,104 1108 824 912 824 1,104 824 1,108 684 684 684 FF and BF are flip-chip ball grid arrays packages Pinout compatibility inside same color rectangle