Recent Progress in Field Programmable Gate Arrays: Hardware, CAD software, evaluation boards, and reconfigurable computing Marek Perkowski Chengdu, June 2008 Programmable Logic The simplest programmable logic devices are PALs (see 22V10 figure next page). PLD – my students use them in their first year at PSU – Programmable Logic Devices What is the next step in the evolution of PLDs? More gates! How do we get more gates? We could put several PALs on one chip and put an interconnection matrix between them!! This is called a Complex PLD (CPLD). 22V10 PLD Cypress CPLD Programmable interconnect matrix. Each logic block is similar to a 22V10. Any other approaches? Another approach to building a “better” PLD is place a lot of primitive gates on a die, and then place programmable interconnect between them: FPGA Technology 1. Bird’s Eye View of FPGA Technology 2. FPGAs in 2004: Virtex-4 Introduction 3. Software and Design 4. Special Problems and Solutions Field Programmable Gate Arrays The FPGA approach to arrange primitive logic elements (logic cells) arrange in rows/columns with programmable routing between them. What constitutes a primitive logic element? Lots of different choices can be made! Primitive element must be classified as a “complete logic family”. • A primitive gate like a NAND gate • A 2/1 mux (this happens to be a complete logic family) • A Lookup table (I.e, 16x1 lookup table can implement any 4 input logic function). Often combine one of the above with a DFF to form the primitive logic element. Other FPGA features Besides primitive logic elements and programmable routing, some FPGA families add other features Embedded memory Many hardware applications need memory for data storage. Many FPGAs include blocks of RAM for this purpose Dedicated logic for carry generation, or other arithmetic functions Phase locked loops for clock synchronization, division, multiplication. Altera Flex 10K FPGA Family Altera Flex 10K FPGA Family (cont) Dedicated memory 16 x1 LUT DFF Emedded Array Block Memory block, Can be configured: 256 x 8, 512 x 4, 1024 x 2, 2048 x 1 EPROM/EEPROM Technology EPROM can be reprogrammed, no need for external storage. EPROM can not be re-programmed in circuit. EEPROM can be re-programmed in circuit. EEPROM consumes 2X more area as EPROM. Erasable PLD (EPLD) SOP-based PAL Logic array In, Out, bidirection Registers I/Os Configured to D, T, JK, SR FFs. Programmable clock to each FF. Programming the FPGA Configuration. Readback - design verification and debugging. Security - a security-bit to prevent readback. Advantages and Disadvantages of FPGA Fast turnaround. Low NRE (non-recurring engineering) changes. Low risk. Effective design verification. low testing cost. Chip size & cost. Slow speed. CPLD versus FPGA CPLD Interconnect style Architecture and timing Software compile times In-system performance Power consumption Applications addressed Continuous Predictable Short Fast High Combinational and registered logic FPGA Segmented Unpredictable Long Moderate Moderate Registered logic only Source: Altera FPGAs What? - Programmable logic + programmable routing = FPGAs. Why? - Zero NREs, easy bug fixes, and short time-to-market. How? Comparison of Different Design Technologies Design time Fabrication Chip area Design cost Unit cost Design cycle Custom Std Cells Gate Arrays Long Short Short Long Long Short Small Med. Large High Med. Low Low Low Med. Long Med. Short FPGAs Short None Very large Very low High Very short Emerging FPGA-based Applications Low-volume production. Urgent time-to-market competition. Rapid prototyping. Logic emulation. Custom-computing hardware. Reconfigurable computing. Design Considerations Target architecture. Fixed logic and routing resources. Fixed I/O pins. Slow signal delays. FPGA Selection Criteria Density. Speed. Price. Flexibility. COSTS of Technologies Lower Cost Moore’s Law is alive Smaller geometries and larger wafers and lower defect density (=higher yield ) continue to achieve lower cost per function LUT + flip-flop: $1.- in 1990, $ 0.002 in 2003 State-of-the-art: 90 nm on 300 mm wafers Spartan-3 uses this technology for lowest cost Rapid price reductions, intense competition Changing costs of FPGAs and technologies More Logic and Better Features: >100,000 LUTs & flip-flops >200 BlockRAMs, and the same number of 18 x 18 multipliers 1156 pins (balls) with > 800 GP I/O 50 I/O standards, incl. LVDs with internal termination 16 low-skew global clock lines Multiple clock management circuits On-chip microprocessor(s) and Gbps transceivers Gate count is really a meaningless metric A Bird’s Eye View… Higher Speed Smaller and faster transistors 90 nm technology, using 193 nm ultra-violet light Cu interconnect ( instead of Al ) was easily achieved Low-K dielectric progress is disappointing System speed: up to 500 MHz, Mainly through smart interconnects, clock management, dedicated circuits, flexible I/O. Integrated transceivers running at 10 Gigabits/sec Speeding up general-purpose logic is getting difficult A Bird’s Eye View… Better tools Back-End Place&Route and XST synthesis VHDL and Verilog becoming entry point IP/Cores speed up design and verification Embedded Software Development Tools support architectures and merge HW and SW Domain-Specific Languages System Generator bridges the gap between Matlab/Simulink and FPGA circuit description ASIC-size FPGAs need ASIC-like tools ASICs Are Losing Ground Mask set >$1M + design + verification + risk Source:IBM ASICS are only for extreme designs: Extreme volume, speed, size, low power SPGA Allow multiple building blocks. Logic. Memory. Data path. Applications Using SPGAs Intellectual property (IP). Communication & networking. Graphical processing. Embedded processing. Designing with SPGAs A team-based approach. Understanding how to use SPGA system features will be the key to pulling the entire design into a single device. CMOS PLD Market Share 31% 5% 5% 6% 24% 11% 3% Other Cpress AT&T Actel Lattice AMD Altera Xilinx 15% Source:dataquest CMOS Logic Market 8% 14% 10% 30% 9% Std logic Programmable GA Std cell Custom Chipset 29% But the market share is growing Source:dataquest FPGAs Growth 2500 2000 1500 M USD 1000 Milions US dollars 500 0 1996 1997 1998 1999 2000 Source: Integrated Circuit Engineering CMOS Programmable-logic Market 5 4 3 B USD 2 Billions US dollars 1 0 1997 1998 1999 2000 Source:dataquest Rapid Prototyping What? Why? How? What is prototyping? Basic components: FPGAs and FPICs. Hardware : boards, boxes, and cabinets. Software: methodologies and CAD tools. Field Programmable Gate Arrays Field Programmable Interconnect Devices Product Development Cycle Market survey Customer acceptance Product development Production Pressures on Today’s Product Development Time-to-market! Design complexity! Why Needs Prototyping? Design verification. Limited production. Concurrent engineering. This requires cooperation of engineers, computer science specialists and marketing Design Verification Specification Functionality & requirements ? Final product Final functionality & performance Design Process Specification System-level design RTL design Logic-level design Physical-level design Final chips Simulation Fast prototyping Formal verification Logic emulation Formal verification is just one of options Verification Alternatives Modeling System Prepare accuracy integration time Speed Event Driven Simulation High No Short Slow Cycle-Based Simulation Med. No Short Med. Behavioral Simulation Low No Short Med. Hardware Accelerated Sim Varies No Med. Med. Fast Breadboarding Med. Yes Long Very Fast Emulation or Prototyping Med. Yes Med. Very Fast A Minute in the Life of a 100K Gates Design One minute 1 --------- Actual hardware at 50MHz 10 -------- Logic emulator or prototype at 5MHz 100------2K-------- HW accelerator at 250M evals/sec 1 Mon. 50K------- Cycle-based simulator at 1K insts/sec 3 Mon. 120K----- Compiled-code logic simulator at 125MIPs 1.5 Yr. 800K----- Event-driven logic simulator at 125 MIPs ten minutes We need FPGA emulation because simulation is too slow Development with Prototyping small gap SW Design Code HW Design Build CHIP Design Fab Big gap Integration Integration Debug Debug Debug Development with Prototyping You speed up development through parallelism SW HW CHIP Design Design Design Integration Code System & SW Debug Build HW Integration & Debug Chip debug Fab Final Integration How to Develop a Prototyping using FPDs Custom-designed prototyping board. Logic-emulation systems. Field-programmable printed-circuit-boards. Field Programmable Devices FPGA State of the Art 2004 90-nanometer manufacturing technology Ten Gigahertz serial I/O (SerDes) in silicon 0.07 femtosecond asynchronous data capture window causes 1.5 ns metastable delay Issues in FPGA Technologies Complexity of Logic Element How many inputs/outputs for the logic element? Does the basic logic element contain a FF? What type? Interconnect How fast is it? Does it offer ‘high speed’ paths that cross the chip? How many of these? Can I have on-chip tri-state busses? How routable is the design? If 95% of the logic elements are used, can I route the design? More routing means more routability, but less room for logic elements Issues in FPGA Technologies (cont) Macro elements Are there SRAM blocks? Is the SRAM dual ported? Is there fast adder support (i.e. fast carry chains?) Is there fast logic support (i.e. cascade chains) What other types of macro blocks are available (fast decoders? register files? ) Clock support How many global clocks can I have? Are there any on-chip Phase Logic Loops (PLLs) or Delay Locked Loops (DLLs) for clock synchronization, clock multiplication? Issues in FPGA Technologies (cont) What type of IO support do I have? TTL, CMOS are a given Support for mixed 5V, 3.3v IOs? 3.3 v internal, but 5V tolerant inputs? Support for new low voltage signaling standards? GTL+, GTL (Gunning Tranceiver Logic) - used on Pentium II HSTL - High Speed Transceiver Logic SSTL - Stub Series-Terminate Logic USB - IO used for Universal Serial Bus (differential signaling) AGP - IO used for Advanced Graphics Port Maximum number of IO? Package types? Ball Grid Array (BGA) for high density IO Altera FPGA Family Summaries Now we discuss some popular families Altera Flex10K/10KE LEs (Logic elements) have 4-input LUTS (look-up tables) +1 FF Fast Carry Chain between LE’s, Cascade chain for logic operations Large blocks of SRAM available as well Altera Max7000/Max7000A EEPROM based, very fast (Tpd = 7.5 ns) Basically a PLD architecture with programmable interconnect. Max 7000A family is 3.3 v Xilinx FPGA Family Summaries Virtex Family SRAM Based Largest device has 1M gates Configurable Logic Blocks (CLBs) have two 4-input LUTS, 2 DFFs Four onboard Delay Locked Loops (DLLs) for clock synchronization Dedicated RAM blocks (LUTs can also function as RAM). Fast Carry Logic XC4000 Family Previous version of Virtex No DLLs, No dedicated RAM blocks Actel FPGA Family Summaries MXDS Family Fine grain Logic Elements that contain Mux logic + DFF Embedded Dual Port SRAM One Time Programmable (OTP) - means that no configuration loading on powerup, no external serial ROM AntiFuse technology for programming (AntiFuse means that you program the fuse to make the connection). Fast (Tpd = 7.5 ns) Low density compared to Altera, Xilinx - maximum number of gates is 36,000 Cypress CPLDs Ultra37000 Family 32 to 512 Macrocells Fast (Tpd 5 to 10ns depending on number of macrocells) Very good routing resources for a CPLD Evolution 1965 1980 1995 2010(?) Max Clock Rate (MHz) 1 10 100 1000 Min IC Geometries (µ) - 5 0.5 0.05 # of IC Metal Layers 1 2 3 12 2000 500 100 25 1-2 2-4 4-8 10-20 PC Board Trace Width (µ) # of PC-Board Layers Every 5 years: System speed doubles, IC geometry shrinks 50% Every 7-8 years: PC-board min trace width shrinks 50% The Ever-Shrinking Circuitry Number of LUTs + flip-flops + routing that fit on the cross section of a human hair 2000 2002 2004 2005 2 LUTs in Virtex-II (150 nm) 3 LUTs in Virtex-IIPro (130 nm) 4 LUTs in Virtex-4 (90 nm) 8 LUTs = one CLB in 65 nm Moore’s law is alive and well in FPGAs Middle-of-the-Road Xilinx FPGAs 1990 1994 1998 2000 2002 2004 2005 XC3042 XC4005 XC4013XL XCV300 XC2V1000 XC2VP30 XC4V60-LX 288 512 1,152 6,144 10,240 27,382 53,248 LUTs + flip-flops LUTs + flip-flops LUTs + flip-flops LUTs + flip-flops LUTs + flip-flops LUTs + flip-flops LUTs + flip-flops Same price for each: One day’s engineering salary Thirteen Years of Progress of Xilinx Devices 200x More Logic plus memory, µP, DSP, MGT 40x Faster 50x Lower Power per function x MHz 500x Lower Cost per function Moore Meets Einstein 2048 1024 Trace Length in cm per 1/4 clock period 512 256 128 64 32 16 Clock Frequency in MHz 8 4 2 1 ’65 ’70 ’75 ’80 ’85 ’90 ’95 ’00 ’05 ’10 Year Speed Doubles Every 5 Years… ...but the speed of light never changes FPGAs in 2003 1000 to 80,000 LUTs and flip-flops, millions of bits in dual-ported RAMs Low-skew Global Clocks, Frequency synthesis, 50 ps phase control 18 Kbit BlockRAMs and 18 x 18 multipliers FPGAs are not glue-logic anymore FPGAs in 2003 1000 to 80,000 LUTs and flip-flops, millions of bits in dual-ported RAMs Low-skew Global Clocks, Frequency synthesis, 50 ps phase control 18 Kbit BlockRAMs and 18 x 18 multipliers FPGAs are not glue-logic anymore FPGAs in 2003 300+ MHz system clock, 800 MHz I/O 3+ Gigabit transceivers Embedded hard and soft microprocessors Design security: Triple-DES encryption VHDL/Verilog entry, synthesis, auto place and route FPGAs are a compelling alternative to ASICs FPGAs in 2004 Virtex-4 in September 2004 4th Generation Advanced Logic ASMBL™ Column-Based Architecture 500 MHz SmartRAM™ BRAM/FIFO Integrated 450 MHz PowerPC Cores 0.6 - 11.1 Gbps RocketIO™ Integrated Tri-Mode Ethernet MAC Cores Integrated System Monitor 500 MHz Xesium™ Clocking 500 MHz Xtreme DSP™ Slice SelectIO with ChipSync™ Technology: - 1 Gbps LVDS - 600 Mbps SE New ASMBL™ Columnar Architecture Enables “Dial-In” Resource Allocation Mix Logic, DSP, BRAM, I/O, MGT, DCM, PowerPC Made possible by Flip-Chip Packaging I/O Columns Distributed throughout the Device FPGA Innovation: Virtex-4 90 nm technology, triple-oxide, 1.2-V Vccint supply General-purpose I/O up to 1 Gbps, Vcco=1.5, 2.5, or 3.3-V 0.6 to 11.2 Gigabit/sec RocketI/O transceivers Advanced Silicon Modular Block architecture Three sub-families: V4-LX for logic-intense applications V4-SX for DSP-intensive applications V4-FX with PPC micros and multi-gigabit transceivers Common architecture for diverse applications FPGA Innovation: Virtex-4 Higher Performance: 500 MHz for all sub-blocks More Versatility New innovative functions Higher Level of Integration More LUTs, flip-flops, RAMs, multipliers Lower Cost Smaller area = lower cost per function Lower Power per ( Function times MHz ) FPGA Innovation: Virtex-4 Flip-chip packaging: lower pin-inductance, stiffer Vcc distribution Lower power per function and MHz Triple-oxide gates, multiple thresholds, smaller size, lower Vcc, better design Better clocking, less skew, more flexibility Better configuration control, partial reconfiguration Robust configuration cell, SEU tolerant like 130 nm FPGA Innovation: Virtex-4 Improved I/O Flexibility and Performance Supports >50 standards, on-chip termination Source-synchronous and system-synchronous Serializer/deserializer behind each pin Programmable delay available for each pin > 1Gbps SelectI/O on each pin >10 Gbps transceivers on dedicated pins (-FX family only) Source-synchronous I/O improves performance Serial I/O saves pins and pc-board area FPGA Innovation: Virtex-4 Faster logic and memory 500+ MHz operation of all on-chip functions 32-bit arithmetic 48-bit adders and synchronous loadable counters Up to 72-bit wide memory 4- to 36-bit wide FIFO control in each BlockRAM Operates with fully independent write and read clocks Reliable EMPTY and FULL outputs also ALMOST Empty and ALMOST Full FIFOs need no fabric resources and no design expertise Advanced Clocking Proper clocking is extremely important for performance and reliability Most design need many global clock lines with minimal clock delay and clock skew Digital Clock Manager (DCM) provides: Four-phase outputs, Frequency multiplication and division Fine phase adjustment Advanced I/O >50 Different Output Standards (strength, voltage, input threshold, etc) multiple parallel output transistors which are either fully on or fully off, Nothing is ever analog, except in LVDS Digitally Controlled Impedance =DCI for series-termination of transmission-line drivers Adjusts up/down strength to be = external resistor One external pull-up and pull-down resistor per bank V2Pro and Virtex-4 can “update-only-if-necessary” System Synchronous System-Synchronous when the clock arrives “simultaneously” at all chips typically used below 200 MHz clock rate On-chip clock distribution DCM Zero clock delay controls set-up time, and avoids hold time requirements The traditional design methodology Source Synchronous Each data bus has its own clock trace typically used at 200 to 800 MHz clock rate On-chip clock-distribution DCM centers the clock in the data eye Adds more unidirectional-only clock lines The only way above 300 MHz Serial Transceiver Technology 3.125 Gbps over each pair 32b @ 78 MHz Virtex-II Pro 32b @ 78 MHz Virtex-II Pro Serial Transceiver Technology Up to 11.1 Gbps over each pair 64b @ 168 MHz 64b @ 168 MHz Virtex-4 Virtex-4 RocketIO™ Multi-Gigabit Transceiver 8 to 24 per device TXDATA FIFO Encode 8-64b Wide Transmitter 78MHz to 700MHz Serializer Transmit Buffer Serial Out TX Clock Generator Programmable Features: 16X/20X Multiplier REFCLK Receiver Comma Detect and Word Alignment RXDATA Elastic 8-64b Wide Buffer RX Clock Generator Decode DeSerializer 622 Mb/s – 11.1 Gb/s Receive Buffer Serial In 64b/66b or 8b/10b EnDec Comma Detect Rx and Tx FIFO Pre-Emphasis Receiver Equalization Output Swing On-Chip Termination Channel bonding AC & DC Coupling Virtex-4 Capabilities Any type of design runs at >400 MHz Pipelining provides extra performance “for free” Synchronous is best, but 32 clock are available Gigabit serial saves pins and board area On-chip termination for board signal integrity I/O features support double-data rate operation and source-synchronous design Virtex-4 Capabilities Popular functions are hard-wired for lower cost, higher performance, and ease-of-use: microprocessors, FIFOs, serial I/O, clock management, etc. Many pre-tested soft cores are available Some are free, some for a fee One-hot state machines are preferred But MicroBlaze and PicoBlaze may be better Massive parallelism enhances DSP, Up to 1024 fast two’s complement multipliers per chip, faster than dedicated DSP chips, but needs system-rethinking 2004 Challenges Technology moves rapidly: 130, 90, 65 nm Multiple Vcc, lower voltage - higher current Lower Vcc makes decoupling very critical Moore’s law becomes more difficult to sustain Leakage current has increased significantly Triple-oxide transistors and clever design provide relief Signal integrity on pc-boards is crucial “homebrew” prototyping would waste money and time Use Standard Evaluation Boards Instead