Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010 In the beginning • I’ve always been a computer architect. • Before the ASIC (early 1980’s) we built computers with off-the-shelf chips. – Am2901 bit slices, PALs, 7400 logic. Just hook up some parts and run it now. • • • • • • Full-speed wire-wrapped prototypes. When it ran it shipped. Design Verification: It doesn’t crash. Debug visibility: scope, maybe LA. Design revision: wire-wrap gun. Project time: months, not years. Example: Kurzweil 1978 – Nova clone for Kurzweil Reading Machine – 2901s, 74F TTL, 16Kb DRAMs, 4 MHz clock – When the prototype ran the reading machine app for three days without crashing, I released the design to manufacturing. Mike Butts - RAMP - August, 2010 2 Then came the ASIC Tapeout • Must get the design perfect before tapeout • Emergence of EDA, design capture, logic simulation: “Daisy/Mentor/Valid” • Simulation is very slow, must write testbenches, can’t run the real app. • This makes the design process very conservative. Crimps architect’s style. • To me EDA has always been a bit of a video game. Mike Butts - RAMP - August, 2010 3 FPGAs Emerge! • Real hardware! We can prototype again! • But simulators are automatic, and FPGA tools are strange and hard. What if we had an automatic box of FPGAs that plugs into an ASIC socket. Emulate! • Many FPGAs are needed. How to interconnect? Extend the row-column FPGA architecture: XC2064 FPGA 64 CLBs, 1986 Sample, US 5,109,353, 1992 Mike Butts - RAMP - August, 2010 4 First Logic Emulator Product • Quickturn RPM: 1989 • Nearest-neighbor interconnect • Hard to get expected logic capacity, hard to manage delays. • But it worked! Sample, US 5,109,353, 1992 Mike Butts - RAMP - August, 2010 5 First big success: Intel P5 • Quickturn worked closely with Intel to emulate the original Pentium microarchitecture: P5. – Ten RPM systems were cabled together, and the design was manually broken up into RPM-sized segments which were emulated. • “The emulator had one more benefit: blunting the spread of RISC. At a technology forum for PC companies and software developers last November (1991), (Intel VP Albert Yu) dialed it up and ran a Lotus 1-2-3 spreadsheet from a terminal. The crowd was astonished that a model was already working. Six months later, Compaq Computer Corp. scrubbed its plans for a RISC-based PC.” - Business Week 6/1/1992 “Inside Intel” Mike Butts - RAMP - August, 2010 6 But row/column doesn’t scale • • • • Logic circuit topology is not flat, 2D nearest-neighbor. Wires go anywhere. FPGA pins get used up by nets that are just passing through. Long delays. Quickturn RPM had serious capacity, placement and routing issues. It turns out the wires and pins of an FPGA are its most precious resource. – 80-90% of FPGA transistors are interconnect. – “We charge for the wires, the gates are free” -- Altera VP Eng. Clive McCarthy, 1994 • Logic density follows Moore’s Law, but packaging and pin counts do not. – Not even the square root (perimeter). • Logic emulators inevitably outstripped FPGA pin counts. Why??? Mike Butts - RAMP - August, 2010 7 Rent’s Rule • The problem of how many pins to provide for each partition of a system came up in the IBM 1401 project, 1960. • Ed Rent found this empirical rule for the relationship between pins per logic block and the number of gates in the block: p = Kgr where p = pins, g = gates, r is the “Rent exponent”, and K is the “Rent constant”. Mike Butts - RAMP - August, 2010 8 Rent’s Rule • IBM 1401 used a Standard Modular System (SMS) of logic modules, backplanes and chassis, with standard pin counts. How to size? Rent’s Rule. • Rent never published, but in 1971 Landman and Russo did. B. S. Landman, R. L. Russo, On a Pin Versus Block Relationship For Partitions of Logic Graphs, IEEE Trans. Comp., col. C-20, 1971. • Profound influence on system architecture and CAD/EDA tools. • Different Rent coefficients apply to different environments. • Empirical. Theory? Inconclusive. – Exponent > 0.5: global connectivity. – Constant > 1: net fanout. • Rent’s Rule guided FPGA emulation system architecture. We used p = 2.5g0.57 IEEE Solid-State Circuits magazine, winter 2010 Mike Butts - RAMP - August, 2010 9 Emulators: Big Green Button • • • A logic emulator is automatic and universal. It takes any arbitrary netlist and implements it in standard hardware, with little or no user intervention. Uniform hardware, uniform-size FPGAs. Design netlist is cut arbitrarily into many equal partitions to keep the chips full. – Balanced k-way partitioning (NP-hard) • This means Rent’s Rule applies. M. Butts, “Emulators”, Wiley Encyclopedia of Electrical and Electronics Engineering, 1999. • • An FPGA prototype is manual and specific. Hardware is usually chosen for one project, the design is manually partitioned according to its modular structure, FPGAs are sized accordingly. System modules naturally have smaller pinouts than arbitrary cuts. Rent’s Rule does not apply. (Well, yes it does but weakly.) G. Schelle, et. al., Intel Nehalem Processor Core Made FPGA Synthesizable, ACM FPGA 2010 Mike Butts - RAMP - August, 2010 10 Rent’s Rule says FPGA Pins are Precious • XC3090: 640 LUTs, 5K gates. Rent’s Rule says 325 pins, FPGA has 144 pins, only 44% • Lesson: FPGA pins are vital to FPGA emulator capacity. => Separate interconnect • Crossbar is ideal – Interconnects any pins, any way, with any fanout – Uniform delay: one level • Far too expensive: O(n2) • Far more fanout than needed, average net fanout is 2 to 3. • Doesn’t take advantage of FPGA pin routability. Butts, US 5,036,473, 1991 Mike Butts - RAMP - August, 2010 11 Partial Crossbar Interconnect • Drop out most of the crosspoints, leaving a partial crossbar. – Group FPGA pins into subsets, – Fully populate crosspoints within each subset, – Leave the rest out. • For each net, find a subset which can route it. – High fanout nets first. • • • • Map nets to FPGA pins accordingly. Still uniform single-level delay. Symmetrical, no placement needed. Scalable: O(n) Butts, US 5,036,473, 1991 Mike Butts - RAMP - August, 2010 12 Partial Crossbar Systems • Redraw: Group each subset’s crosspoints into a crossbar chip for that subset • • Each crossbar has pins to every FPGA, and vice versa. Make crossbar chip or use cheap FPGA • Multilevel for systems: second-level crossbars on the backplane. • Max delay is three hops. • Cost is slightly higher than O(n). Scalable. • Partial crossbar interconnect made large-scale logic emulation practical. Mike Butts - RAMP - August, 2010 Butts, US 5,036,473, 1991 13 History of FPGA Emulators, 1989-2000 Nearest-neighbor architecture • Quickturn RPM (1989): First commercial emulator • Virtual Machine Works (1994): Virtual Wires pin multiplexing Partial Crossbar architecture • Mentor Realizer (1989): First hardware, emulated Apple II mobo • Mentor Realizer (1991): Proof-of-concept system prototype – 8 logic boards (14 XC3090 FPGAs, 32 XC2018 xbars), 64 XC2018 2nd-level xbars • Mentor sold this logic emulator technology to Quickturn (1992). • Quickturn Enterprise (1993): First commercial partial crossbar emulator – 11 logic boards (46 XC3090s, 46 custom xbars), 144 2nd-level xbars, 330K gates • HP Teramac (1995): Configurable computing research machine: 1M gates • Quickturn System Realizer (1995): XC4000 series, 2M gates • Quickturn Mercury Plus (2000): Large custom emulation FPGA, 20M gates Mike Butts - RAMP - August, 2010 14 FPGA Emulation Clocking Issues • ASIC and custom chips have gated clocks, latches, many clock domains. FPGAs can introduce their own violations. • FPGA interconnect delay is very hard to manage. – FPGAs use dedicated low-skew clock networks. • Gated clocks: must run clock through logic blocks. Hold-time violations: clock gets sooner than the data. • Latches: timing of both edges matters, plus there’s latch transparency. • How to reliably map these to FPGA? Re-synthesis. – Map gated clocks to FPGA FF clock enables (which is the gate, which is the clock?) – Map latches into flops, using 2x clocking. • Emulators developed sophisticated design mapping techniques. Mike Butts - RAMP - August, 2010 15 Emulator User Psychology • Emulators were often hard to use, especially in the early days. – First-time users + clocking issues = errors. – Ultra-high pincount backplanes, cabling = errors. • This trained users to blame the emulator. • After weeks of effort, they finally get their design up and running on the emulator. A bug is found. What is their response? a) “Wonderful! It found a bug in our design. We’re getting value from all this expense.” b) “It’s not our design, it’s your emulator.” • • • User starts running diagnostics and swapping boards. Swap enough boards and guess what happens..... Solutions: Locked board extractors, Better emulators. Mike Butts - RAMP - August, 2010 Emulators have thousands of pins per board 16 1995: Quickturn System Realizer • Up to 990 FPGAs (Xilinx XC4013), custom crossbar chips • Logic board: 45 FPGAs, 100 K gates – 2500 pins to backplane, 900 pins in-circuit or LA • Max system 22 boards 2M gates, 14 MB RAM • Built-in LAPG • 14K I/Os for multiple systems • Compiler 100KG/hr • Two-level partial crossbar connects 990 FPGAs in 3 hops max. Mike Butts - RAMP - August, 2010 17 2000: Mercury Plus FPGA • Custom FPGA for emulation • Five-level partial crossbar across entire 20M gate system: – Logic cluster: full crossbar – Two partial crossbar levels on-chip – Two more levels in the system • 10x faster compile • Predictable capacity and delays • 6-LUTs, FFs, RAMs – hold time trimmers • Full visibility, on-chip logic analyzer • QT’s last FPGA emulator Mike Butts - RAMP - August, 2010 18 FPGA Pin Shortage Gets Worse Over Time • Using FPGAs directly in logic emulators falls to Rent’s Rule – FPGA-based emulators were always starved for pins. – Xilinx FPGAs from the beginning. Altera, other FPGAs are similar. XC2064 XC3090 XC4062 XC40200 XCV800 XC2V6000 XC4VLX160 XC6VLX550T XC7V2000T LCs (4-LUT) 128 640 5472 16758 21168 67584 200448 549888 1954560 Gates Rent pins Real pins* 1024 130 58 5120 325 144 43776 1105 352 134064 2092 448 169344 2390 512 540672 4631 1104 1603584 8607 960 4399104 15299 1200 15636480 31521 1200 Shortfall 2.24 2.26 3.14 4.67 4.67 4.20 8.97 12.75 26.27 * ordinary pins only, SERDES latency is too long for logic emulation Mike Butts - RAMP - August, 2010 19 FPGA Emulator Pin Multiplexing • Multiple nets per pin, slower design clock Xilinx data book • Quickturn: – Asynchronous free-running high-speed using DDR IOBs – Transparent to the emulated design • VMW: Virtual Wires – Synchronous to design – Modify design netlist: Eval/mux/latch, many levels – Multiple clock domains? Mike Butts - RAMP - August, 2010 Babb et. al, “Logic Emulation with Virtual Wires”, vol. 16, pp. 609 - 626, 1997. 20 Continuous to Discrete Time • As FPGAs got further and further from Rent’s Rule, FPGA emulators went to deeper and deeper pin multiplexing. • Continuous time: – Pure FPGA emulator runs in the continuous time of the design. Signals propagate as in the real hardware, just with different delays. • Continuous / discrete time mix: – Pin-multiplexed FPGA emulator runs in an ad-hoc mix of continuous and discrete time. Yet pins still mostly lie idle. • Discrete time: – Go all the way into discrete time == levelized simulation • Now it’s a massively parallel computer Mike Butts - RAMP - August, 2010 21 Processor-based Emulation • Levelize netlist, evaluate all gates every cycle, level-by-level. • No branches: deep pipelining, fast, massively parallel, very scalable. • Compile-time net scheduling: Emulated design escapes Rent’s Rule • IBM Yorktown Simulation Engine Monty Denneau, DAC 1982. – “... high speed special purpose parallel processor designed and built at the IBM Thomas J. Watson Research Center to simulate logical operation ... up to 2,000,000 gates at a rate exceeding 3 billion gate computations per second” • IBM Engineering Verification Engine Beece et. al, DAC 1988. Mike Butts - RAMP - August, 2010 22 Quickturn CoBALT Wm. Beausoleil et. al., IBM • • • • 1997 commercialization of IBM engines 8M gates, 1 MHz emulation speed IBM HW, QT front end compiler Maps multi clock domains, latches, gated clocks onto single faster clock, making use of FPGA compiler experience • Compiles 1M gates / hour • Full custom 100 MHz 250um chip with 64 logic processors • 65 chips / board Mike Butts - RAMP - August, 2010 23 Processor-based Emulation in 2000’s • IBM technology and team acquired by QT, then QT acquired by Cadence • FPGA emulators dropped • 2002: Palladium – – – – 128M gates, 0.75 MHz Full visibility Compile 30M gates / hour Multi-user • 2004: Palladium II – 256M gates, 1.5 MHz • 2007: Palladium III – 256M gates, 2 MHz Palladium XP • 2010: Palladium XP – 2000M gates, 4 MHz Mike Butts - RAMP - August, 2010 24 Emulation at NVIDIA One of the largest emulation labs in the world Mike Butts - RAMP - August, 2010 25 Early Emulation Success • In 1995, CEO Jensen Huang “spent $1 million, a third of the company’s cash, on a technology known as emulation, which allows engineers to play with virtual copies of their graphics chips before they put them into silicon. That allowed Nvidia to speed a new graphics chip to market every six to nine months, a pace the company has sustained ever since.” - Forbes, 1/7/08 • RIVA 128, or "NV3", was one of the first consumer graphics processing units to integrate 2D and 3D acceleration. When announced in 1997, the market found the specifications hard to believe: performance superior to market-leader 3dfx. RIVA 128 shipped in volume, and the combination of its low cost and high performance made it a popular choice for OEMs. Mike Butts - RAMP - August, 2010 Wikipedia 26 Emulation in 2005 The specific verification goals that were required for the GeForce 6800 project include: • Bring up a new generation of GPUs on an accelerated verification platform in a oneweek time frame. Derivative chips must be brought up in a few days. • Automate the Compile-Run-Debug process so that ASIC design engineers could use an accelerated verification platform. • Verify GPU and frame-buffer/system-memory interaction. • Validate AGP/PCI-bus interface functions. • Ensure functionality at various levels of abstraction (RTL and gates). • Expand accelerated verification solution to ATPG and BIST applications. - Chip Design Magazine, January 2005 Mike Butts - RAMP - August, 2010 27 Emulation Today • 2010: Cadence Palladium XP • Up to 2 billion gates, up to 4 MHz, up to 512 users – Compile up to 35M gates / hour on 1 PC • Full visibility to all signals • Integrates with logic and power simulation, SystemC/C++ models, prototype hardware • System integration steps used at NVIDIA: – Design and verify the silicon itself. • Power analysis is vital. – Run silicon in the virtual system (such as a PC), verify that the GPU works in a system. – Run lots of software applications on the virtualized platform. - “NVidia Engineer Cites HW/SW Integration Challenges”, 5/5/10, cadence.com Mike Butts - RAMP - August, 2010 28 FPGA Prototyping today • FPGA prototyping is widely used as a verification tool by chip development projects (not to mention RAMP of course). • Practical for one to four to maybe ten FPGAs. – 2-4M gates each, typically 10 to 50 MHz • Prototypes are rarely disclosed, two research efforts were: Nehalem CPU in five FPGAs, 520 kHz due to pin multiplexing, 18 to 24-ways (ACM FPGA ‘10) Atom CPU in one Virtex-5 LX330, 50 MHz (ACM FPGA ‘09) Mike Butts - RAMP - August, 2010 29 Future • State-of-the-art projects continue to rely heavily on processorbased emulation and FPGA prototyping for tapeouts. • State-of-the-art tapeouts today cost $50-100M++. – Only possible for established $B vendors. – Very hard to get new chip startups funded. • Therefore, ASIC project starts are dropping. • FPGAs and GPUs are the only processing silicon that scales with Moore’s Law (so far). – Their vendors are the “foundries” for new HW efforts. • Off-the-shelf chips: we’re coming full circle. Mike Butts - RAMP - August, 2010 30 The Ultimate Interconnect Human brain: 1011 neurons, 1014 to 1015 total synapses, 20-40 W, somewhat reconfigurable. “The Brain Unveiled”, Technology Review, Nov-Dec, 2008 Mike Butts - RAMP - August, 2010 31