ADC Board VHDL Firmware development for Mona Lisa Roy Wastie Overview • • • • • • • • Introduction ADC Board Hardware Blocks Basic FPGA Architectures Xilinx ISE 10.1 Tool Flow USB Algorithm VHDL Introduction • Applications of FPGAs include digital signal processing, software-defined radio, aerospace and defense systems, ASIC prototyping, medical imaging, computer vision, speech recognition, cryptography, bioinformatics, computer hardware emulation & glue logic for PCBs. ADC Board Hardware Blocks External Clock & Trigger 16 channel ADC FIFO FPGA DAQ FPGA Memory controller USB Interface SDRAM Memory Basic FPGA Architectures Overview • All Xilinx FPGAs contain the same basic resources – Logic Resources • Slices (grouped into CLBs) – Contain combinatorial logic and register resources • Memory • Multipliers – Interconnect Resources • Programmable interconnect • IOBs – Interface between the FPGA and the outside world – Other resources • Global clock buffers • Boundary scan logic Basic Building Block Configurable Logic block • Slices contain logic resources and are arranged in two colums • A switch matrix provides access to general routing resources • Local routing provides connection between slices in the same CLB, and it provides routing to neighboring CLBs COUT COUT BUFT BUF T Slice S3 Slice S2 Switch Matrix SHIFT Slice S1 Slice S0 CIN Local Routing CIN Virtex-II CLB contains four slices Basic Building Blocks Simplified Slice Structure • Each slice has four outputs – Two registered outputs, two non-registered outputs – Two BUFTs associated with each CLB, accessible by all 16 CLB outputs • Carry logic runs vertically, up only – Two independent carry chains per CLB Slice 0 LUT Carry PRE D Q CE CLR LUT Carry D PRE Q CE CLR The Slice Detailed Structure • The next few slides discuss the slice features – LUTs – MUXF5, MUXF6, MUXF7, MUXF8 (only the F5 and F6 MUX are shown in this diagram) – Carry Logic – MULT_ANDs – Sequential Elements Combinatorial logic Boolean logic is stored in Look-Up Tables (LUTs) • Also called Function Generators (FGs) • Capacity is limited by the number of inputs, not by the complexity • Delay through the LUT is constant A B C D Z 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0 1 1 Combinatorial Logic A B C D . . . Z 1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 Storage Elements Can be implemented as either flip-flops or latches • Two in each slice; eight in each CLB • Inputs come from LUTs or from an independent CLB input • Separate set and reset controls – Can be synchronous or asynchronous • All controls are shared within a slice – Control signals can be inverted locally within a slice FDRSE_1 D S Q CE R FDCPE D PRE Q CE CLR LDCPE D PRE Q CE G CLR Dedicated Logic FPGAs contain built-in logic for speeding up logic operations and saving resources • Multiplexer Logic – Connect Slices and LUTs • Carry Chains – Speed up arithmetic operations • Multiplier AND gate – Speed up LUT-based multiplication • Shift Register LUT – LUT-based shift register • Embedded Multiplier – 18x18 Multiplier Multiplexer Logic Dedicated MUXes provided to connect slices and LUTs F5 F8 CLB Slice S0 F5 F6 Slice S1 F5 F7 Slice S2 F5 F6 Slice S3 MUXF8 combines the two MUXF7 outputs (from the CLB above or below) MUXF6 combines slices S2 and S3 MUXF7 combines the two MUXF6 outputs MUXF6 combines slices S0 and S1 MUXF5 combines LUTs in each slice Carry Chains Dedicated carry chains speeds up arithmetic operations • Simple, fast, and complete arithmetic Logic – Dedicated XOR gate for single-level sum completion – Uses dedicated routing resources – All synthesis tools can infer carry logic COUT COUT To S0 of the next CLB To CIN of S2 of the next CLB First Carry Chain SLICE S3 CIN COUT SLICE S2 SLICE S1 CIN Second Carry Chain COUT SLICE S0 CIN CIN CLB Multiplier AND Gate Speed up LUT-based multiplication • Highly efficient multiply and add implementation – Earlier FPGA architectures require two LUTs per bit to perform the multiplication and addition – The MULT_AND gate enables an area reduction by performing the multiply and the add in one LUT per bit LUT A CY_MUX S CO DI CI CY_XOR MULT_AND AxB LUT B LUT Shift Register LUT (SRL16CE) The shift register LUT saves from having to use dedicated registers • Dynamically addressable serial shift registers – Maximum delay of 16 clock cycles per LUT (128 per CLB) – Cascadable to other LUTs or CLBs for longer shift registers LUT D CE CLK D Q CE D Q CE • Dedicated connection from Q15 to D input of the next LUT SRL16CE – Shift register length can be changed asynchronously by toggling address A D Q CE Q D Q CE A[3:0] Q15 (cascade out) Embedded Multiplier Blocks Saves from having to use LUTs to implement multiplications and increases performance • 18-bit twos complement signed operation • Optimized to implement Multiply and Accumulate functions • Multipliers are physically located next to block SelectRAM™ memory Data_A (18 bits) 4 x 4 signed 18 x 18 Multiplier Data_B (18 bits) Output (36 bits) 8 x 8 signed 12 x 12 signed 18 x 18 signed IOB Element Connects the FPGA design to external components • Input path IOB – Two DDR registers • Output path – Two DDR registers – Two 3-state enable DDR registers • Separate clocks and clock enables for I and O • Set and reset signals are shared Input Reg DDR MUX OCK1 Reg ICK1 Reg OCK2 3-state Reg ICK2 Reg DDR MUX OCK1 Reg OCK2 PAD Output Distributed RAM Uses a LUT in a slice as memory • Synchronous write • Asynchronous read – Accompanying flip-flops can be used to create synchronous read • RAM and ROM are initialized during configuration – Data can be written to RAM after configuration • Emulated dual-port RAM – One read/write port – One read-only port • 1 LUT = 16 RAM bits LUT Slice LUT LUT RAM16X1S D WE WCLK A0 O A1 A2 A3 RAM32X1S D WE WCLK A0 O A1 A2 A3 A4 RAM16X1D D WE WCLK A0 SPO A1 A2 A3 DPRA0 DPO DPRA1 DPRA2 DPRA3 Block RAM Embedded blocks of RAM arranged in columns • Up to 3.5 Mb of RAM in 18-kb blocks – Synchronous read and write • True dual-port memory – Each port has synchronous read and write capability – Different clocks for each port • Supports initial values • Synchronous reset on output latches • Supports parity bits – One parity bit per eight data bits • Situated next to embedded multiplier 18-kb block SelectRAM memory DIA DIPA ADDRA WEA ENA SSRA CLKA DOA DOPA DIB DIPB ADDRB WEB ENB SSRB CLKB DOB DOPB Global Routing • Sixteen dedicated global clock multiplexers – Eight on the top-center of the die, eight on the bottom-center – Driven by a clock input pad, a DCM, or local routing • Global clock multiplexers provide the following: – Traditional clock buffer (BUFG) function – Global clock enable capability (BUFGCE) – Glitch-free switching between clock signals (BUFGMUX) • Up to eight clock nets can be used in each clock region of the device – Each device contains four or more clock regions Digital Clock Manager (DCM) • Up to twelve DCMs per device – Located on the top and bottom edges of the die – Driven by clock input pads • DCMs provide the following: – Delay-Locked Loop (DLL) – Digital Frequency Synthesizer (DFS) – Digital Phase Shifter (DPS) • Up to four outputs of each DCM can drive onto global clock buffers – All DCM outputs can drive general routing TheBuiltSpartan-3 Family for high volume, low-cost applications 18x18 bit Embedded Pipelined Multipliers for efficient DSP Configurable 18K Block RAMs + Distributed RAM Bank 0 Bank 1 Bank 3 Spartan-3 Bank 2 Up to eight on-chip Digital Clock Managers to support multiple system clocks 4 I/O Banks, Support for all I/O Standards including PCI, DDR333, RSDS, mini-LVDS Spartan-3 Family Based upon Virtex-II Architecture – Optimized for Lower Cost • Smaller process = lower core voltage – .09 micron versus .15 micron – Vccint = 1.2V versus 1.5V • Logic resources – Only one-half of the slices support RAM or SRL16s (SLICEM) – Fewer block RAMs and multiplier blocks • Clock Resources – Fewer global clock multiplexers and DCM blocks • I/O Resources – Fewer pins per package – No internal 3-state buffers – Support for different standards • New standards: 1.2V LVCMOS, 1.8V HSTL, and SSTL • Default is LVCMOS, versus LVTTL SLICEM and SLICEL • Each Spartan™-3 CLB contains four slices Left-Hand SLICEM Right-Hand SLICEL COUT COUT – Similar to the Virtex™-II Slice X1Y1 • Slices are grouped in pairs – Left-hand SLICEM (Memory) • LUTs can be configured as memory or SRL16 – Right-hand SLICEL (Logic) • LUT can be used as logic only Slice X1Y0 Switch Matrix SHIFTIN Slice X0Y1 Fast Connects Slice X0Y0 SHIFTOUT CIN CIN Xilinx Tool Flow Xilinx Design Flow Plan & Budget Create Code/ Schematic HDL RTL Simulation Implement Translate Functional Simulation Synthesize to create netlist Map Place & Route Attain Timing Closure Timing Simulation Generate BIT File Configure FPGA Synthesis Generate a netlist file • After coding up your HDL code, you will need a tool to generate a netlist (NGC or EDIF) – Xilinx Synthesis Tool (XST) included – Support for Popular Third Party Synthesis tools: Synplify, Leonardo Spectrum Implementation Process a netlist file • Consists of three phases – Translate: Merge multiple design files into a single netlist – Map: Group logical symbols from the netlist (gates) into physical components (slices and IOBs) – Place & Route: Place components onto the chip, connect the components, and extract timing data into reports • Access Xilinx reports and tools at each phase – Timing Analyzer, Floorplanner, FPGA Editor, XPower Netlist Generated From Synthesis . . . Implement Translate Map Place & Route . . . ... Configuration • Once a design is implemented, you must create a file that the FPGA can understand – This file is called a bitstream: a BIT file (.bit extension) • The BIT file can be downloaded – Directly into the FPGA • Use a download cable such as Platform USB – To external memory device such as a Xilinx Platform Flash PROM • Must first be converted into a PROM file ISE Project Navigator Xilinx ISE Foundation is built around the Xilinx Design Flow • Enter Designs • Access to synthesis tools – Including third-party synthesis tools • Implement your design with a simple double-click – Fine-tune with easy-to-access software options • Download – Generate a bitstream – Configure FPGA Synthesizing Designs Generate a netlist file using XST (Xilinx Synthesis Technology) Synthesis Processes and Analysis • Access report • View Schematics (RTL or Technology) • Check syntax • Generate Post-Synthesis Simulation Model 1 Highlight HDL Sources 2 Double-click to Synthesize The Design Summary Displays Design Data • Quick View of Reports, Constraints • Project Status • Device Utilization • Design Summary Options • Performance and Constraints • Reports Outline • • • • Overview ISE Summary Lab 1: Xilinx Tool Flow USB USB2 • • • • • • • • Peer to Peer. Host computer is master. 480Mbits/s 53.24Mb/s theoretical 30MB/s readily achievable in Bulk transfer mode. The speeds USB 1.0 Low & Full ,USB2 High Hot Plug. Peripherals electronics can be relatively simple and inexpensive. Power 500mA from the bus. USB Data Travels in Packets •Identified by “Packet ID” (PID) •Token packet tells what’s coming •Data packets deliver bytes •Handshake packets report success or otherwise USB Packets S S E Y T N U C P A D D R E N D P D S A Y T N A C 0 C R C 5 Token Packet C R C 1 6 Data Data Packet S A Y C N K C S O Y U N T C H/S Pkt Data Data Packet C R C 1 6 S A Y C N K C C R C 5 Data Stage S O Y U N T C H/S Pkt E N D P Token Packet Setup Stage D S A Y T N A C 1 A D D R A D D R E N D P D S A Y T N A C 0 C R C 5 Token Packet C R C 1 6 Data Data Packet S A Y C N K C H/S Pkt Data Stage (cont'd) S O Y U N T C A E C D N R D D C R P 5 Token Packet D C S D A R Y a T C N t A 1 C a 1 6 S A Y C N K C Data Packet H/S Pkt Data Stage (cont'd) S Y I N N C A D D R E N D P C R C 5 Token Packet D C S A R Y T C N A 1 C 1 6 Data Packet H/S Pkt Status Stage A Control Write Transfer S A Y C N K C USB2 Controller • EZ-USB FX2LP(TM) USB Microcontroller High-Speed USB Peripheral Controller • Integrated 8051 Microprocessor. • Code/Data Downloaded via USB, or EEPROM. • Many Integrated Peripherals. Simple Algorithm • Sample Data at full rate 2.77Ms/s (16 channels) • Down Convert Data to by 4 • Write data to USB interface 21.19MB/s VHDL VHDL Example An example of a two-input XNOR gate is shown below. entity XNOR2 is port (A, B: in std_logic; Z: out std_logic); end XNOR2; architecture behavioral_xnor of XNOR2 is -- signal declaration (of internal signals X, Y) signal X, Y: std_logic; begin X <= A and B; Y <= (not A) and (not B); Z <= X or Y; End behavioral_xnor;