Part III Logic Emulation What is a Logic Emulation System? 1. A programmable hardware built with programmable logic (FPGA) and programmable interconnect devices (PID). 2. A software which automatically programs the hardware according to the circuit under design 3. Control HW/SW to support operation of the emulated design as a hardware component operating in real time. Typical Logic Emulation Environment Compiler, runtime software Workstation Logic Emulator Logic Module Probe Module Stimulus generator, logic analyzer Target System In-circuit Interface Why we need Logic Emulation? Design verification issues. Real-time operation. System-level testing. Rapid prototyping. Design Verification Issues Simulation-based verification methods have run out of steam when chip complexity grows. Emulation is a verification technology that grows along with design size. Real-Time Operation Simulation requires test vector development which is costly and difficult. Verification depends on test vector correctness. Certain applications must be verified in real time human perception: audio and video. Emulation connected to actual hardware can run: real diagnostic code, operating systems, and applications. System-Level Testing Often the chip meets its specifications but it fails in the system. We have to verify the system-level interactions between the chip and other components. They are hard to formalize. Internal probing is impossible when the chip is fabbed and placed in a system But it is possible using emulation. Rapid Prototyping Once emulated design is debugged it is available for immediate use by software developers for software debugging. Emulated design is available for demo and experiments with architecture on real applications and data. Programmable Hardware includes programmable interconnect Logic element Logic element Programmable interconnect Memory element VLSI core Interface Considerations for programmable interconnect The capacity of logic and interconnection depends on package constraints. This forces a hierarchical system. Chips => boards => boxes => system The interconnect structure must: 1. Provide successful connectivity, 2. Maximize FPGA utilization, and 3. Minimize delay and skew. Rent’s rule applies to predict the interconnect needs. Structures of Multi-FPGA Systems Topologies: - Mesh - nearest neighboring. - Crossbar - full and partial. Interconnect scheme: - Circuit switched. - Time multiplexed. Nearest Neighbor Interconnection FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGA Advantages and Disadvantages of Nearest Neighbor Interconnection Advantages: Uniform: all chips the same. Easy to lay out on PCB. Disadvantages: Routing is easily blocked. The “through pins” limit the logic utilization of FPGAs. Long and unpredictable delays. No natural hierarchical extension. Nearest Neighbor Extensions Connect to non-neighbors FPGA FPGA FPGA FPGA FPGA FPGA Add more neighbors FPGA FPGA FPGA Advantages and Disadvantages of nearest-neighbor extended architectures Advantages: More choices for router by adding diagonal lines & skip lines. Disadvantages: More complex PCB. More complex routing software. Partial Crossbar Interconnect Logic blocks ABCD ABCD ABCD ABCD Crossbars A pins B pins C pins Second-level crossbars D pins Partial Crossbar Interconnect Partial crossbar consists of a set of small full crossbars, connected to logic blocks but not to each other. I/O pins of each FPGA are divided into subsets. Each subset is connected by a full crossbar circuit switch. Partial crossbar is a potentially blocking network. Characteristics of “Partial Crossbar Architecture” Partial crossbar’s size is proportional to the number of FPGA pins. All interconnections go through one/three crossbar chips for a one-level/two-level partial crossbar interconnect – delays are uniform and bounded. Mixed Full and Partial Crossbar External connections Global Global Partial FPIC FPIC crossbar Local FPIC FPGA FPGA Local FPIC FPGA FPGA Full Local FPIC crossbar FPGA FPGA Circuit Switched versus Time Multiplexed Interconnect Schemes Trade-offs between the operating speed and the hardware cost. Time-multiplexing method: can greatly expand available interconnect. allows lower cost IC package and PCB. makes partitioning easier. BUT System power increases due to frequent signal switching (higher hardware cost). Complex scheduling software. Slow operating speed. Virtual Wires Mux FPGA Physical wires FPGA Logical inputs DeMux FPGA Logical outputs FPGA I change space to time Logic Emulation Systems and their interconnection schemes System with mesh topology - Quickturn’s RPM and Virtual Machine Works (IKOS). System with partial crossbar - Quickturn’s Enterprise, Mars, and System Realizer. System with mixed full and partial crossbar - Aptix Prototyping System. System using time-multiplexed interconnect - Virtual Machine Works (IKOS) , CoBALT and Arkos (Quickturn). Memory Solutions in Emulators and future devices/systems Goal: programmable memories with different width/depth/port combinations. FPGA-based memories: inefficient of using logic resources. timing correctness is difficult to be insured. large or highly multi-ported memories must be partitioned across several FPGAs. SRAMs with dedicated or programmable controllers. Logic Emulation Design Flow HDL synthesis Synthesis Pre-configuration preparation Partitioning System mapping P&R Full-chip configuration Design downloading Emulators In-circuit emulation Logic Emulation Design Compiler and its components Logic emulation design compiler is a large and complex EDA tool which includes: Front-end design importer. HDL-based synthesizer. Clock and timing analyzer. Partitioner. System-level placer and router. FPGA-based placer and router. Objectives of logic emulation compiler Fast compilation time. Fast emulation clock. Timing correctness. Easy (ECO ENGINEERING Change Order). Minimize circuit size. Design Considerations for Logic Emulators HDL synthesis: Trade-off run-time and quality. CLB-based vs. gate-based designs. Clock and timing analysis: Timing correctness, hold-time violation free. Clock skew minimization. Partitioning: Run time. Timing and area. - Design Considerations for Logic Emulators System placement and routing: Timing. Completeness of routing. FPGA-based placement and routing: Fast run time. Parallel compilation. Remember you emulate not the same logic as your design Hold-Time Violation Clock distribution problem (Skew)!!! Q D CK LUT CLB Q D CK Routing delay Hold-time violation occurs when Routing delay > LUT delay!!! Timing Correctness Delay insertion Q D CK Delay element LUT CLB Routing delay Q D CK Timing Correctness Use clock enables for gated clocks Q D CK Q LUT D CLB CE CK Clock path Primary clock Low-skew net Methodology and components of Logic Emulator System Pre-configuration preparation - prepare netlists and control files for configuration. Testbed preparation - prepare emulation-based operation environment. Full-chip configuration - download design to the emulator. In-circuit emulation - test the design. Pre-Configuration in Emulator System Translate the leaf-cell libraries into emulation primitives. Translated libraries must be verified for functional equivalence to original. Modify and redesign some components to attain compatibility with emulation techniques, such as precharge logic circuits. Assemble all the gate-level netlists for the entire design. Testbed in Logic Emulator Design and implement the target ICE board combining the emulated design with real hardware. Slowdown testbed to emulation speed. Assemble the testbed and emulation equipment. Full-Chip Configuration & InCircuit Emulation Full-chip configuration: Prepare control files. Partition the design to fit into the emulation system. Download design into the system. Verify that the emulation model faithfully implements the design as specified by RTL. In-circuit emulation Part IV Reconfigurable Computing and Systems General-Purpose Computing vs. Custom Computing General-purpose computing - applying applications on a general-purpose computer. Custom computing - applying applications on a custom-made application-specific hardware. Field-programmable devices make this into a reality. Goals of Reconfigurable Computing Tailor the architecture to the application. Minimize or eliminate instruction interpretation. Exploit fine grained parallelism. Map software to hardware. Applications of reconfigurable computing Database search and analysis. Image processing and machine vision. Data compression. Signal processing. Neural networks. Biology computing. Medical computing. Design Automation (PSU) Many more. Multi-Mode Systems map various applications to a reconfigurable system ROM Application 1 Reconfigurable system Application 2 • Different configurations for read & write operations of a tape driver (Honeywell). • Different configurations for different printer controllers (Tektronix). Run-Time Reconfiguration in military image recognition system Image data Truck? Jeep? I/O ? Tank? • Break single computation into multiple pieces. • Page in components as needed (virtual hardware), ex., automatic target recognition. Custom Computing Application-specific systems. Numerous applications for similar reconfigurable systems. Offers hardware performance, flexibility to handle numerous algorithms. Multi-FPGA systems can be viewed as hardware supercomputers. Tell about DEC Perle Reconfigurable Co-processors Program 1 Processor Inst1 Coprocessor Program 2 Inst2 - Provide custom instructions on a per-application basis. Types of Reprogrammable Systems Three ways to attach custom computing units Coprocessor CPU Attached processing unit Memory caches Standalone PU I/O interface PU = processing Unit Types of Reprogrammable Systems Attached and standalone processing units are reprogrammable systems on computer add-on cards and separate reprogrammable cabinets. Considerations: large communication overhead may over-shadow the speed gain. Application-specific coprocessors can achieve significant improvement over a wide range of applications. Types of Reprogrammable Systems Integrate the reprogrammable logic into the processor itself. A reprogrammable functional unit can be configured on a per-algorithm basis. Providing some special-purpose instructions tailored to the needs of a given application. Architectures of Multi-FPGA (Reconfigurable) Systems The most commonly used topologies: Mesh: 1D (linear array), 2D, and 3D. Crossbar: full, partial, mixed, and hierarchical. Hybrid between mesh and crossbar. Application-specific architecture. Hybrid Topology of a reconfigurable system Ext. Interface FPGA FPGA FPGA Ext. Interface FPGA FPGA RAM RAM 16 FPGAs RAM RAM Splash 2: augments a linear array of FPGAs with a crossbar switch. Goal: Supporting systolic circuits. Hybrid Topology FPGA FPGA FPGA FPGA Host interface RAM RAM RAM Anyboard: A linear array of FPGAs augmented by global buses. Hybrid Topology RAM Host interface RAM 4 X 4 mesh of FPGAs RAM RAM DECPeRLe-1: a 4 X 4 mesh of FPGAs augmented with shred global buses. Application-Specific Topology of MARC-1, one subsystem Connections to other FPGAs 4 1 5 2 3 1 FPGA FPGA 4 Memory FPGA FPGA 3 5 2 FPGA 4 FPGA 3 5 2 FPU FPGA FPGA The Marc-1: subsystem 1. 1 FPGA 1 • Application in circuit simulation where the program to be executed can be optimized on a per-run basis. • This is done for values constant within that run, • but which may vary from dataset to dataset. Application-Specific Topology of Marc-1, cont. The Marc-1 Subsystem1 1 2 3 Subsystem1 4 5 Application-Specific Topology RAM RAM RAM RAM FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGA RAM RAM RAM RAM The RM-nc system: neural network. RAM Architecture for Computer Prototyping VME bus FPGA Cache memory FPGA FPGA FPGA Register file FPGA FPGA ALU FPU FPGA The Mushroom processor prototyping system. Expandable Topologies Hierarchical crossbar topology: can be expanded by adding extra level. - Quickturn systems. Expandable mesh topology: can be expanded by connecting individual boards to form a large mesh. The Virtual Wires Emulation System (IKOS). Topology for Adapting Other Components Many multi-FPGA systems include nonFPGA resources to provide more general purpose solutions. The MORRPH system - sockets next to FPGAs which allow to add arbitrary devices to the array. The G800 board - contains two FPGAs and four sockets. Topology for Adapting Other Components The COBRA system Contains: based modules (expanding to 2D mesh), RAM modules, I/O modules, and bus modules. The Springbok system a pre-made daughter board which is able to contain an arbitrary device (on the top) and an FPGA (on the bottom). Daughter boards are mounted on a baseplate. Topology for Adapting Other Components The Quickturn systems - external component adapters. The Aptix FPCB - a reprogrammable PCB. Design Methodology for general-purpose configurable systems Applications Mapping Host computer Reprogrammable system Typical Software Methodology for general-purpose configurable systems Application spec. Analysis System-level synthesis Software spec. Code generation Object code Hardware spec. Hardware synthesis Typical Software Methodology for general-purpose configurable systems Hardware spec. Synthesis Partitioning & placement Pin assignment & routing FPGA P & R Bit-stream files Considerations for such complex software systems Architectural-specific design tasks. Design automation process. The mapping time dominates the setup time for operating the system. Run-time reconfigurability. Design Specification and Languages for reconfigurable software systems Standard software programming languages, e.g., C, C++, FORTRAN, and assembly language, vs. HDLs. Standard software programming languages - a sequential execution model. HDLs - a parallel execution model. Who will use it and which one is more suitable for system description??? Compilation Issues Translate code from software languages into hardware without losing the inherent concurrency of hardware. Compiler techniques for parallelizing code. Straight-line code, control flow, and loops. Transmogrifier C compiler. System-level and Highlevel Synthesis System-level design evaluation and analysis. Design estimation. Hardware-software partitioning. Interface synthesis. RTL synthesis. Logic synthesis and technology mapping. Partitioning and Placement Topology-aware partitioning methods. Partitioning onto a multi-FPGA system is equivalent to a placement problem. Logic utilization and timing. Pin Assignment and Routing Pin-assignment - the process of determining which I/O pins to be used for each inter-FPGA signal. Pin-assignment for a pre-fabricated multi-FPGA system is equivalent to the global routing problem. Pin-assignment will greatly affect the quality of FPGA’s logic utilization and routability. Run-Time Reconfigurability This is a new issue in system design: how much of the processor is virtual, when to reconfigure? Virtual hardware <=> virtual memory. What are their relations? Artificial Intelligence, robotics. Vision. Hardware on demand. What is the Initial Un-configured structure? What are the reconfiguring methods. Software supporting time-varying mapping. Many open problems need to be solved in the forth coming years. Applications: Splash 2 Stream oriented systolic and SIMD applications. Scalable linear array of 16 to 256 processing elements (1 XC4010 with 1/2 Mbyte). VHDL based. Sequence comparison - 2300M:0.75M cell updates/sec (Splash 2:Sparc 10). Edge detection - 10M:242K pixels/sec (Splash 2:Sparc 10). Applications: PAM (DEC) Programmable Active Memory (PAM). C++ based and mesh arrays of XC3090 (DECPeRLe-1). Applications: Multiple precision arithmetic. RSA encryption. Video compression (JPEG, MPEG, DCT). High energy physics. Telecommunications. Sources of some slides Peter Alfke Xilinx, Inc peter.alfke@xilinx.com