fpga05.FPGA-logic-emulation-and-reconfigurable

advertisement
Part III
Logic
Emulation
What is a Logic
Emulation System?
1. A programmable hardware built with
programmable logic (FPGA) and
programmable interconnect devices (PID).
2. A software which automatically programs the
hardware according to the circuit under design
3. Control HW/SW to support operation of the
emulated design as a hardware component
operating in real time.
Typical Logic Emulation
Environment
Compiler, runtime software
Workstation
Logic Emulator
Logic Module
Probe Module
Stimulus generator, logic analyzer
Target System
In-circuit
Interface
Why we need Logic
Emulation?
Design verification issues.
Real-time operation.
System-level testing.
Rapid prototyping.
Design Verification
Issues
Simulation-based verification methods have
run out of steam when chip complexity
grows.
Emulation is a verification technology that
grows along with design size.
Real-Time Operation
Simulation requires test vector development
which is costly and difficult.
Verification depends on test vector correctness.
Certain applications must be verified in real time human perception: audio and video.
Emulation connected to actual hardware can run:
real diagnostic code,
operating systems, and
applications.
System-Level Testing
Often the chip meets its specifications but it fails
in the system.
We have to verify the system-level interactions
between the chip and other components. They
are hard to formalize.
Internal probing is impossible when the chip is
fabbed and placed in a system
But it is possible using emulation.
Rapid Prototyping
Once emulated design is debugged it is
available for immediate use by software
developers for software debugging.
Emulated design is available for demo and
experiments with architecture on real
applications and data.
Programmable Hardware includes
programmable interconnect
Logic
element
Logic
element
Programmable
interconnect
Memory
element
VLSI
core
Interface
Considerations for
programmable interconnect
The capacity of logic and interconnection depends on
package constraints.
This forces a hierarchical system.
Chips => boards => boxes => system
The interconnect structure must:
1. Provide successful connectivity,
2. Maximize FPGA utilization, and
3. Minimize delay and skew.
Rent’s rule applies to predict the interconnect needs.
Structures of Multi-FPGA
Systems
Topologies:
- Mesh - nearest neighboring.
- Crossbar - full and partial.
Interconnect scheme:
- Circuit switched.
- Time multiplexed.
Nearest Neighbor
Interconnection
FPGA
FPGA
FPGA
FPGA
FPGA
FPGA
FPGA
FPGA
FPGA
Advantages and Disadvantages of
Nearest Neighbor Interconnection
Advantages:
Uniform: all chips the same.
Easy to lay out on PCB.
Disadvantages:
Routing is easily blocked.
The “through pins” limit the logic utilization of FPGAs.
Long and unpredictable delays.
No natural hierarchical extension.
Nearest Neighbor Extensions
Connect to
non-neighbors
FPGA
FPGA
FPGA
FPGA
FPGA
FPGA
Add more
neighbors
FPGA
FPGA
FPGA
Advantages and Disadvantages of
nearest-neighbor extended architectures
Advantages:
More choices for router by adding diagonal
lines & skip lines.
Disadvantages:
More complex PCB.
More complex routing software.
Partial Crossbar Interconnect
Logic blocks
ABCD
ABCD
ABCD
ABCD
Crossbars
A pins
B pins
C pins
Second-level crossbars
D pins
Partial Crossbar Interconnect
Partial crossbar consists of a set of small full
crossbars,
connected to logic blocks
but not to each other.
I/O pins of each FPGA are divided into subsets.
Each subset is connected by a full crossbar circuit
switch.
Partial crossbar is a potentially blocking network.
Characteristics of “Partial
Crossbar Architecture”
Partial crossbar’s size is proportional to the
number of FPGA pins.
All interconnections go through one/three
crossbar chips for a one-level/two-level
partial crossbar interconnect –
delays are uniform and bounded.
Mixed Full and Partial
Crossbar
External
connections
Global
Global Partial
FPIC
FPIC crossbar
Local
FPIC
FPGA
FPGA
Local
FPIC
FPGA
FPGA
Full
Local
FPIC crossbar
FPGA
FPGA
Circuit Switched versus Time
Multiplexed Interconnect Schemes
Trade-offs between the operating speed and the
hardware cost.
Time-multiplexing method:
can greatly expand available interconnect.
allows lower cost IC package and PCB.
makes partitioning easier.
BUT
System power increases due to frequent signal
switching (higher hardware cost).
Complex scheduling software.
Slow operating speed.
Virtual Wires
Mux
FPGA
Physical
wires
FPGA
Logical
inputs
DeMux
FPGA
Logical
outputs
FPGA
I change space to time
Logic Emulation Systems and their
interconnection schemes
System with mesh topology - Quickturn’s RPM and
Virtual Machine Works (IKOS).
System with partial crossbar - Quickturn’s Enterprise,
Mars, and System Realizer.
System with mixed full and partial crossbar - Aptix
Prototyping System.
System using time-multiplexed interconnect - Virtual
Machine Works (IKOS) , CoBALT and Arkos (Quickturn).
Memory Solutions in Emulators and
future devices/systems
Goal: programmable memories with
different width/depth/port combinations.
FPGA-based memories:
inefficient of using logic resources.
timing correctness is difficult to be insured.
large or highly multi-ported memories must be
partitioned across several FPGAs.
SRAMs with dedicated or programmable
controllers.
Logic Emulation Design Flow
HDL synthesis
Synthesis
Pre-configuration
preparation
Partitioning
System mapping
P&R
Full-chip
configuration
Design downloading
Emulators
In-circuit
emulation
Logic Emulation Design
Compiler and its components
Logic emulation design compiler is a large and complex
EDA tool which includes:
Front-end design importer.
HDL-based synthesizer.
Clock and timing analyzer.
Partitioner.
System-level placer and router.
FPGA-based placer and router.
Objectives of logic emulation
compiler
Fast compilation time.
Fast emulation clock.
Timing correctness.
Easy (ECO ENGINEERING Change Order).
Minimize circuit size.
Design Considerations for Logic
Emulators
HDL synthesis:
Trade-off run-time and quality.
CLB-based vs. gate-based designs.
Clock and timing analysis:
Timing correctness, hold-time violation free.
Clock skew minimization.
Partitioning:
Run time.
Timing and area.
-
Design Considerations for Logic
Emulators
System placement and routing:
Timing.
Completeness of routing.
FPGA-based placement and routing:
Fast run time.
Parallel compilation.
Remember you
emulate not the same
logic as your design
Hold-Time Violation
Clock distribution problem (Skew)!!!
Q
D
CK
LUT
CLB
Q
D
CK
Routing delay
Hold-time violation occurs
when Routing delay > LUT delay!!!
Timing Correctness
Delay insertion
Q
D
CK
Delay
element
LUT
CLB
Routing delay
Q
D
CK
Timing Correctness
Use clock enables for gated clocks
Q
D
CK
Q
LUT
D
CLB
CE CK
Clock path
Primary clock
Low-skew net
Methodology and components of Logic
Emulator System
Pre-configuration preparation - prepare netlists
and control files for configuration.
Testbed preparation - prepare emulation-based
operation environment.
Full-chip configuration - download design to the
emulator.
In-circuit emulation - test the design.
Pre-Configuration in Emulator
System
Translate the leaf-cell libraries into emulation
primitives.
Translated libraries must be verified for functional
equivalence to original.
Modify and redesign some components to attain
compatibility with emulation techniques, such as
precharge logic circuits.
Assemble all the gate-level netlists for the entire
design.
Testbed in Logic Emulator
Design and implement the target ICE board
combining the emulated design with real
hardware.
Slowdown testbed to emulation speed.
Assemble the testbed and emulation
equipment.
Full-Chip Configuration & InCircuit Emulation
Full-chip configuration:
Prepare control files.
Partition the design to fit into the emulation
system.
Download design into the system.
Verify that the emulation model faithfully
implements the design as specified by RTL.
In-circuit emulation
Part IV
Reconfigurable
Computing and
Systems
General-Purpose Computing
vs. Custom Computing
General-purpose computing - applying
applications on a general-purpose computer.
Custom computing - applying applications
on a custom-made application-specific
hardware.
Field-programmable devices make this into a
reality.
Goals of Reconfigurable
Computing
Tailor the architecture to the application.
Minimize or eliminate instruction interpretation.
Exploit fine grained parallelism.
Map software to hardware.
Applications of reconfigurable
computing
Database search and analysis.
Image processing and machine vision.
Data compression.
Signal processing.
Neural networks.
Biology computing.
Medical computing.
Design Automation (PSU)
Many more.
Multi-Mode Systems map
various applications to a reconfigurable
system
ROM
Application 1
Reconfigurable
system
Application 2
• Different configurations for read & write
operations of a tape driver (Honeywell).
• Different configurations for different
printer controllers (Tektronix).
Run-Time Reconfiguration in
military image recognition system
Image data
Truck?
Jeep?
I/O
?
Tank?
• Break single computation into multiple pieces.
• Page in components as needed (virtual hardware),
ex., automatic target recognition.
Custom Computing
Application-specific systems.
Numerous applications for similar reconfigurable
systems.
Offers hardware performance, flexibility to handle
numerous algorithms.
Multi-FPGA systems can be viewed as hardware
supercomputers.
Tell about DEC Perle
Reconfigurable Co-processors
Program 1
Processor
Inst1
Coprocessor
Program 2
Inst2
- Provide custom instructions
on a per-application basis.
Types of Reprogrammable Systems
Three ways to attach
custom computing units
Coprocessor
CPU
Attached
processing
unit
Memory
caches
Standalone
PU
I/O
interface
PU = processing Unit
Types of Reprogrammable
Systems
Attached and standalone processing units are
reprogrammable systems on computer add-on
cards and separate reprogrammable cabinets.
Considerations: large communication overhead may
over-shadow the speed gain.
Application-specific coprocessors can achieve
significant improvement over a wide range of
applications.
Types of Reprogrammable
Systems
Integrate the reprogrammable logic into
the processor itself.
A reprogrammable functional unit can be
configured on a per-algorithm basis.
Providing some special-purpose instructions
tailored to the needs of a given application.
Architectures of Multi-FPGA
(Reconfigurable) Systems
The most commonly used topologies:
Mesh: 1D (linear array), 2D, and 3D.
Crossbar: full, partial, mixed, and
hierarchical.
Hybrid between mesh and crossbar.
Application-specific architecture.
Hybrid Topology of a reconfigurable
system
Ext. Interface
FPGA
FPGA
FPGA
Ext. Interface
FPGA
FPGA
RAM
RAM
16 FPGAs
RAM
RAM
Splash 2: augments a linear array of FPGAs with
a crossbar switch.
Goal: Supporting systolic circuits.
Hybrid Topology
FPGA
FPGA
FPGA
FPGA
Host
interface
RAM
RAM
RAM
Anyboard: A linear array of FPGAs augmented
by global buses.
Hybrid Topology
RAM
Host
interface
RAM
4 X 4 mesh
of FPGAs
RAM
RAM
DECPeRLe-1: a 4 X 4 mesh of FPGAs augmented
with shred global buses.
Application-Specific Topology of
MARC-1, one subsystem
Connections to other
FPGAs
4
1
5
2
3
1
FPGA
FPGA
4
Memory
FPGA
FPGA
3
5 2
FPGA
4
FPGA
3
5 2
FPU
FPGA
FPGA
The Marc-1: subsystem 1.
1
FPGA
1
• Application in circuit
simulation where the
program to be executed
can be optimized on a
per-run basis.
• This is done for
values constant
within that run,
• but which may vary
from dataset to
dataset.
Application-Specific
Topology of Marc-1, cont.
The Marc-1
Subsystem1
1
2
3
Subsystem1
4
5
Application-Specific Topology
RAM
RAM
RAM
RAM
FPGA
FPGA
FPGA
FPGA
FPGA
FPGA
FPGA
FPGA
FPGA
RAM
RAM
RAM
RAM
The RM-nc system: neural network.
RAM
Architecture for Computer
Prototyping
VME bus
FPGA
Cache memory
FPGA
FPGA
FPGA
Register file
FPGA
FPGA
ALU
FPU
FPGA
The Mushroom processor
prototyping system.
Expandable Topologies
Hierarchical crossbar topology: can be
expanded by adding extra level.
- Quickturn systems.
Expandable mesh topology: can be
expanded by connecting individual boards
to form a large mesh.
The Virtual Wires Emulation System (IKOS).
Topology for Adapting Other
Components
Many multi-FPGA systems include nonFPGA resources to provide more general
purpose solutions.
The MORRPH system - sockets next to
FPGAs which allow to add arbitrary devices
to the array.
The G800 board - contains two FPGAs and
four sockets.
Topology for Adapting Other
Components
The COBRA system
Contains:
based modules (expanding to 2D mesh),
RAM modules,
I/O modules,
and bus modules.
The Springbok system
a pre-made daughter board which is able to
contain an arbitrary device (on the top) and an
FPGA (on the bottom).
Daughter boards are mounted on a baseplate.
Topology for Adapting Other
Components
The Quickturn systems - external
component adapters.
The Aptix FPCB - a reprogrammable PCB.
Design Methodology for
general-purpose configurable
systems
Applications
Mapping
Host
computer
Reprogrammable
system
Typical Software Methodology for
general-purpose configurable systems
Application
spec.
Analysis
System-level
synthesis
Software
spec.
Code
generation
Object code
Hardware
spec.
Hardware
synthesis
Typical Software Methodology for
general-purpose configurable systems
Hardware spec.
Synthesis
Partitioning & placement
Pin assignment & routing
FPGA P & R
Bit-stream files
Considerations for such
complex software systems
Architectural-specific design tasks.
Design automation process.
The mapping time dominates the setup
time for operating the system.
Run-time reconfigurability.
Design Specification and Languages for
reconfigurable software systems
Standard software programming languages,
e.g., C, C++, FORTRAN, and assembly language, vs.
HDLs.
Standard software programming languages - a
sequential execution model.
HDLs - a parallel execution model.
Who will use it and which one is more suitable for
system description???
Compilation Issues
Translate code from software languages
into hardware without losing the inherent
concurrency of hardware.
Compiler techniques for parallelizing code.
Straight-line code, control flow, and loops.
Transmogrifier C compiler.
System-level and Highlevel Synthesis
System-level design evaluation and analysis.
Design estimation.
Hardware-software partitioning.
Interface synthesis.
RTL synthesis.
Logic synthesis and technology mapping.
Partitioning and
Placement
Topology-aware partitioning methods.
Partitioning onto a multi-FPGA system is
equivalent to a placement problem.
Logic utilization and timing.
Pin Assignment and
Routing
Pin-assignment - the process of determining
which I/O pins to be used for each inter-FPGA
signal.
Pin-assignment for a pre-fabricated multi-FPGA
system is equivalent to the global routing
problem.
Pin-assignment will greatly affect the quality of
FPGA’s logic utilization and routability.
Run-Time Reconfigurability
This is a new issue in system design: how much of the processor is
virtual, when to reconfigure?
Virtual hardware <=> virtual memory. What are their
relations? Artificial Intelligence, robotics. Vision.
Hardware on demand.
What is the Initial Un-configured structure?
What are the reconfiguring methods.
Software supporting time-varying mapping.
Many open problems need to be solved in the forth
coming years.
Applications: Splash 2
Stream oriented systolic and SIMD applications.
Scalable linear array of 16 to 256 processing
elements (1 XC4010 with 1/2 Mbyte).
VHDL based.
Sequence comparison - 2300M:0.75M cell
updates/sec (Splash 2:Sparc 10).
Edge detection - 10M:242K pixels/sec (Splash
2:Sparc 10).
Applications: PAM (DEC)
Programmable Active Memory (PAM).
C++ based and mesh arrays of XC3090
(DECPeRLe-1).
Applications:
Multiple precision arithmetic.
RSA encryption.
Video compression (JPEG, MPEG, DCT). High energy physics.
Telecommunications.
Sources of some slides
Peter Alfke
Xilinx, Inc
peter.alfke@xilinx.com
Download