Implementing Algorithms in FPGA-Based Reconfigurable Computers Using C-Based Synthesis Doug Johnson, Technical Marketing Manager NCSA/OSC Reconfigurable Systems Summer Institute Urbana, Illinois, July 11-13 2005 Celoxica UK-Based System design company 2 Provider of design tools, IP & services for Digital Imaging & Signal Processing Image Processing Video Processing Sonar/ Radar signal processing Biometrics Massively parallel data mining and matching Complete solutions for Electronic Level System (ESL) Design System/ algorithm acceleration Co-design partitioning Co-simulation & co-verification (C/ C++/ SystemC/ Handel-C/ Matlab/ VHDL/ Verilog) Hardware compilation & C synthesis to reconfigurable architectures Consulting and professional services Systems analysis and design strategy System implementation capability NCSA/OSC Reconfigurable Systems Summer Institute Presentation Objectives Prerequisites Objectives 3 Motivations for using FPGAs in RC and HPC HPC and RC FPGA systems hardware and infrastructure HPC algorithms and Considerations for Reconfigurable Computing (RC) Share a perspective on the State-of-the-Art for C-based HW design Describe the C to FPGA Flow Illustrate with code examples … Look forward to some critical debate… NCSA/OSC Reconfigurable Systems Summer Institute Agenda Reconfigurable Computing Considerations, core algorithm relationships, commercial applications C-based design 4 The solution space (its place in EDA) Nature of C for HW design The Design Flow Summary JPEG2000 Design Example NCSA/OSC Reconfigurable Systems Summer Institute Agenda Reconfigurable Computing (RC) Considerations, core algorithm relationships, commercial applications C-based design The solution space (its place in EDA) Nature of C for HW design The Design Flow Summary “RC = Using FPGAs for (algorithmic) computation” 1. Embedded: Well established – body of knowledge/experience 2. Enterprise: Some 3. HPC: Starting Out 5 NCSA/OSC Reconfigurable Systems Summer Institute Reconfigurable Computing Commercial C-to-FPGA tools FPGAs Closely Coupled Systems Partitioning Frameworks Intimately Coupled Systems Advanced Compilers First RC Successes 1980 20X0? Algorithm Acceleration Exploit parallelism to increase performance with custom HW implementation Algorithm Offload Free CPU resource by offloading bottleneck processes BIG Challenges 6 2000 Promised Opportunities 1990 Development complexity Design framework and methods, deployment and integration/middleware Coupling to coprocessor/data bandwidth Price/Performance/Power! Choosing the right applications! NCSA/OSC Reconfigurable Systems Summer Institute FPGA Computing and Methodology High Performance Embedded and Reconfigurable Computing C-based design for FPGAs 7 Why FPGA Computing? Moore’s Law showing signs of strain Ability to parallelize in HW Price/GOPS coming down rapidly Hard IP blocks – excellent density Example: Floating Point Performance Maximum for Virtex-4 – 50 GFLOPS (Courtesy of Dave Bennett, Xilinx Labs) Maximum for Virtex-2 – 17.5 GFLOPS “ “ “ “ “ “ “Can fit 10’s of FPUs on 2 Xilinx Virtex-4’s” (Courtesy of Justin Tripp, LANL) Use of hard macros for functions is mandatory (example DSP48 on Virtex-4) Several offerings on commercial marketplace or in research Commercial – Celoxica, Mentor Graphics, Impulse Technologies, Mitrion… Research – Sandia, UC Riverside, LANL RTL/HDL is the most widely used way to get to FPGAs but is not usable by SW engineers NCSA/OSC Reconfigurable Systems Summer Institute 2005 Conventional Wisdom for RC 1. Small data objects 2. Modest arithmetic Fewer Issues with Latency in HPC Streaming Applications – most successful 5. Simple Control 8 Essential Parallelism essential - FPGA clocks order of magnitude slower than CPUs Fine grain - wide data widths Medium grain - operation/function routine Course grain - multiple instantiations of application processes 4. Pipeline-ability C-based design Difficult to design and implement complex algorithms in HW Integer/fixed precision calculations Floating point too resource expensive High Density Devices 3. Data-parallelism Closely coupled systems Data transfer overhead to coprocessor, High operation to byte ratio Soft Cores/C-based design Difficult to design complex scheduling schemes in Parallel HW NCSA/OSC Reconfigurable Systems Summer Institute Further Considerations 6. Exploiting “Soft” programmable HW 9 Configurable Applications Schedule and load HW content prior to HW execution Reconfigurable Applications Few Compelling Examples in HPC Dynamically change HW content during HW execution NCSA/OSC Reconfigurable Systems Summer Institute Commercial RC Applications …using C-based design Well established in embedded systems: Digital Video Technology and Image Processing “PROCESSING AT THE SENSOR” versus local and/or remote processing 3D LCD display development and test Real-time verification of HDTV image processing algorithms Robust image matching - product tracking and production line control Defense & Security Digital Signal Processing Communications and Networking Consumer Automotive & Industrial Internet reconfigurable multimedia terminal, MP3, VoIP etc. Ground traffic simulation testbed for broadband satellite network communications Satellite based Internet data tracking system Rapid Systems Prototyping 10 Engine control unit for 3-phase motors Radar and sonar beamforming and spatial filtering Computer aided tomography security system Automotive safety system incorporating sensor fusion Robotic vision system for object detection and robot guidance NCSA/OSC Reconfigurable Systems Summer Institute Commercial RC Applications …using C-based design Enterprise Computing High Performance Computing 11 Content processing solutions XML parsing, virus checking Packet/Pattern Matching/Filtering Compression/decompression Security/Encryption – DES/3-DES, SHA, MD5, AES/Rijndael Image processing CT scan analysis, 3D modeling, Ray Tracing Finite element analysis and simulation Custom Vector Engines Genome calculations Seismic data processing NCSA/OSC Reconfigurable Systems Summer Institute Core Algorithm Relationships in HPC Rational Nanotechnology Drug Design Tomographic Fracture Mechanics Diffraction Inversion Problems Atomic Scattering Condensed Matter Electronic Structure Astrophysics Military Logistics Transportation Systems Data Assimilation Electronic Structure Actinide Chemistry Cosmology Population Genetics Economics Air Traffic Control VLSI Design Pipeline Flows Flow in Porous Media Chemical Reactors Plasma Processing Transport CFD Basic Algorithms & Numerical Methods Discrete Events Monte Carlo Pattern Matching Computer Vision Multimedia Collaboration Tools Radiation Graph Theoretic n-body Genome Processing Virtual Reality Computational Steering Scientific Visualization Signal Processing Raster Graphics Neutron Transport Virtual Prototypes Electrical Grids Fourier Methods Nuclear Structure QCD Distribution Networks Reservoir Modelling Biosphere/Geosphere Cloud Physics Combustion Quantum Chemistry Manufacturing Systems Neural Networks MRI Imaging Molecular Modeling Chemical Dynamics PDE Boilers Chemical Reactors CVD Multiphase Flow Weather and Climate Seismic Processing Multibody Dynamics Fields Geophysical Fluids Ecosystems Economics Models Symbolic Processing Cryptography Electromagnetics Aerodynamics Orbital Mechanics Astrophysics Intelligent Search Databases Intelligent Agents Reaction-Diffusion Structural Mechanics ODE Computer Algebra Data Mining CAD 12 Phylogenetic Trees Biomolecular Reconstruction Dynamics Crystallography Automated Deduction NCSA/OSC Reconfigurable Systems Summer Institute Magnet Design Number Theory Source: Rick Stevens - ANL Core Algorithm Relationships in HPC Rational Nanotechnology Drug Design Tomographic Fracture Mechanics Diffraction Inversion Problems Atomic Scattering Condensed Matter Electronic Structure Astrophysics Military Logistics Transportation Systems Data Assimilation Electronic Structure Actinide Chemistry Cosmology Population Genetics Economics Discrete Events Monte Carlo VLSI Design Raster Graphics Neutron Transport Pipeline Flows Flow in Porous Media Chemical Reactors Plasma Processing CFD Basic Algorithms & Numerical Methods Pattern Matching Computer Vision Multimedia Collaboration Tools Radiation Graph Theoretic Transport Genome Processing Virtual Reality Computational Steering Scientific Visualization Signal Processing n-body Air Traffic Control Virtual Prototypes Electrical Grids Fourier Methods Nuclear Structure QCD Distribution Networks Reservoir Modelling Biosphere/Geosphere Cloud Physics Combustion Quantum Chemistry Manufacturing Systems Neural Networks MRI Imaging Molecular Modeling Chemical Dynamics PDE NCSA/OSC Reconfigurable Reaction-Diffusion Boilers Chemical Reactors CVD Multiphase Flow Weather and Climate Structural Mechanics Seismic Processing ODE Multibody Dynamics Fields Geophysical Fluids Ecosystems Economics Models Symbolic Processing Cryptography Electromagnetics Aerodynamics Orbital Mechanics Astrophysics Intelligent Search Databases Data Mining CAD 13 Phylogenetic Trees Biomolecular Reconstruction Dynamics Crystallography Automated Deduction Intelligent Agents Systems Summer Institute Computer Algebra Magnet How do Design we map out the right Apps? Number Theory Source: Rick Stevens - ANL Exploiting FPGA in HPC Hardware: How do we select and benchmark? “Enterprise Quality” co-processor system products (Cray XD1, SGI RASC) Robust PCI/PCIx/VME-based FPGA card solutions for development A software design methodology is essential: SW dominated application sector Complete designs can be specified in a C environment Porting to HW implementations simplified Platform abstractions through API’s and Libraries 14 Target developers have a SW background Register Transfer Level (RTL), Hardware Description Languages (HDL) are foreign Simplified Specification, Development, Deployment NCSA/OSC Reconfigurable Systems Summer Institute Agenda Reconfigurable Computing Considerations, core algorithm relationships, commercial applications C-based design 15 The solution space (its place in EDA – Electronic Design Automation) Nature of C for HW design The Design Flow Summary JPEG2000 Design Example NCSA/OSC Reconfigurable Systems Summer Institute Embedded Hardware (HW) Design Specification Function Algorithm Design Block Block Design Design Fixed FixedPoint Point extraction extraction DSP DSP IP IP TLM API’s/Libraries Frameworks Implementation Implementation IP IPModels Models Architecture Fast Mixed Mixed Simulation Simulation Architecture Exploration Design Analysis HW HWAccelerated Accelerated Simulation Simulation Custom Custom Processors Processors C-Based HLL Synthesis Synthesis Interface Interface Synthesis Synthesis Implementation Reconfigurable FPGA/SoPC Prototypes Implementation Implementation IP IP Emulation Emulation Platforms Platforms RTL RTL Verification Verification RTL RTL C to FPGA/SoPC 16 Physical Design NCSA/OSC Reconfigurable Systems Summer Institute C to FPGA Accelerated System Function & Architecture AL C/C++ CA C for HW Specification Model Design Algorithm Design Testbench Software Model System Model Partitioning API’s/Libraries HW Mixed Simulation COMMS SW Architecture Exploration Design Analysis Optimization C-Based Synthesis BSP BSP RTL EDIF OBJ Synthesis P&R Implementation FPGA 17 NCSA/OSC Reconfigurable Systems Summer Institute Processor Challenges for C-based synthesis Concurrency (Parallelism) Timing Annotations, additional or C++ Communication 18 Constraints Explicit Rules-based Data Types Compiler-determined (behavioral synthesis) Explicit Additional or C-like NCSA/OSC Reconfigurable Systems Summer Institute Two Approaches to C-based Design C Algorithm to FPGA SoC (System-on-a-Chip) Prototyping/Verification SystemC Core Libraries SCV, TLM, Master/Slave … Handel-C Core Libraries TLM (PAL/DSM), Fixed/Floating point … Standard Channels for Various MOC Kahn Process Networks, Static Dataflow… Primitive Channels Signal, Timer, Mutex, Semaphore, FIFO, etc Core Language Data Types Core Language Data Types par{…}, seq{…}, Interfaces, Channels, Bit Manipulation, RAM & ROM Single cycle assignment Bits and bit-vectors Arbitrary width integers Signals Modules, Ports, Processes, Events, Interfaces, Channels Event Driven Sim Kernel 4-valued logic/vectors Bits and bit-vectors Arbitrary width integers Fixed-point C++ user-defined types ANSI/ISO C Language Standard ANSI/ISO C++ Language Standard 19 NCSA/OSC Reconfigurable Systems Summer Institute Agenda Reconfigurable Computing Considerations, core algorithm relationships, commercial applications C-based design 20 The solution space (its place in EDA) Nature of C for HW design The Design Flow Summary JPEG2000 Design Example NCSA/OSC Reconfigurable Systems Summer Institute System Design Refinement Function • System Function • Course grain parallelism A C • • • • Parallel algorithm design Fine-grain parallism Bit/cycle true processes Algorithm Testbench A C Architecture • Add interfaces • Signal/cycle accurate test A C B D B D B D par{ processA(…); processB(…); processC(…); processD(…); } void processD(…){ unsigned 9 a,b,c; par{ a=1; b=2; } c=3; }; void main(){ interface port_in… interface port_out… … } EDIF/RTL 21 NCSA/OSC Reconfigurable Systems Summer Institute AL C/C++ CP Handel-C CA Handel-C CA Handel-C Systems Integration Implementation • • • • Complete system design Interface to pins Multi-Clock domain IP Integration A C CLK RST A B D B EDIF (Electronic Design Interface Format) RTL from HDL IP Data C D set clock = external “CLK”; set reset = external “RST”; interface Data(…)… void main() { par{ processA(…); { interface processB(…)…}; processB(…); processC(…); processD(…); } { interface processD(…)…}; } EDIF/RTL 22 NCSA/OSC Reconfigurable Systems Summer Institute Parallel Debug in C environment Algorithm Design 23 NCSA/OSC Reconfigurable Systems Summer Institute Resource Usage/Speed Estimations Architecture Exploration 24 NCSA/OSC Reconfigurable Systems Summer Institute FPGA Support Technology mapping Optimizations 25 NCSA/OSC Reconfigurable Systems Summer Institute Handel-C Template Multiplier set clock = external "clk"; void main() { … while(1) par { … process(); } } void process() { unsigned W A, B, C; while(1) par { … Multiply(A, B, &C); … } void Multiply(unsigned W A, unsigned W B, unsigned W *C) { static unsigned W a[W], b[W], c[W]; par{ a[0] = A; b[0] = B; c[0] = a[0][0] == 0 ? 0 : b[0]; par (i = 1; i < W; i++) { a[i] = a[i-1] >> 1; b[i] = b[i-1] << 1; c[i] = c[i-1] + (a[i][0] == 0 ? 0 : b[i]); } *C = c[W-1]; } } } Pipelined 26 NCSA/OSC Reconfigurable Systems Summer Institute Agenda Reconfigurable Computing Considerations, core algorithm relationships, commercial applications C-based design 27 The solution space (its place in EDA) Nature of C for HW design The Design Flow Summary JPEG2000 Design Example NCSA/OSC Reconfigurable Systems Summer Institute Summary Commercial C-based design is a reality For the HPC and RC communities it offers: Fastest route to accelerating SW designs in FPGA Deterministic and quality results State of the art tools used by embedded systems designers RC platforms for rapid prototyping 28 Lower barrier to adoption than RTL technologies Greater customization and productivity than block based approaches Complete integration with RTL/block based approaches for “Power users” Simple migration, development to deployment with full library support NCSA/OSC Reconfigurable Systems Summer Institute Design Example JPEG2000 Image Compression Algorithm Example Design JPEG 2000 Compressor Original Image Pre processing Five Steps to HW Platform: RGB to YUV conversion Quantization Tier-1 Encoder Tier-2 Encoder Direct Synthesis C to EDIF 5. HW Platform 30 Optimization 4. Implementation Model Coded Image System Estimations 3. Architecture and Communication Model Algorithm Profiling 2. Functional System Model DWT Rate Control 1. Specification Model Board level integration NCSA/OSC Reconfigurable Systems Summer Institute 1. Specification Model Function & Architecture 22 *.c and *.h files C/C++ AL Specification Model 1468 lines of code Original Image DWT Algorithm Profiling - Memory - Processing Time - Data Flow Quantization Tier-1 Encoder Coded Image Tier-2 Encoder DWT/Tier1 are the compute intensive blocks 31 Testbench Pre processing RGB to YUV conversion Rate Control Design Software Model NCSA/OSC Reconfigurable Systems Summer Institute Memory Usage (x86) MB 6 5 4 3 2 1 0 Curr ent Sum 2. Functional System Model Function & Architecture AL C/C++ CA Handel-C Original Image Specification Model Design Pre processing Testbench Software Model System Model Partitioning RGB to YUV conversion HW SW DWT Rate Control quantization /*Handel-C*/ extern “C” sw_block(…); Tier-1 Encoder Coded Image Cycles/speed/area… 32 Tier-2 Encoder void main(void){ while(1) par{ sw_block(…); hw_block(…); } } void hw_block(…) { … } NCSA/OSC Reconfigurable Systems Summer Institute /* C */ void sw_block(…) { … } 3. Architecture and Communication Model Function & Architecture AL C/C++ CA Handel-C Original Image Pre processing RGB to YUV conversion DWT Rate Control quantization FIFO FIFO Tier-1 Encoder DsmPortH2S Coded Image Tier-2 Encoder DsmRead(…) DsmWrite(…) DsmFlush(…) Dataflow/Cycles/speed/area… 33 NCSA/OSC Reconfigurable Systems Summer Institute 4. Implementation Model A C B D EDIF Device Family Implementation RTL 34 EDIF NCSA/OSC Reconfigurable Systems Summer Institute void main(){ interface port_in… interface port_out… … } Estimations from Synthesis DWT ~ 6% VII1000 35 NCSA/OSC Reconfigurable Systems Summer Institute 5. Hardware Platform From P&R Report for VII1000-4 A B uP HW uP DWT HW C D uP HW uP RAM HW RAM Board Level Integration Specific I/O Implementations Pin Location constraints Slices: 758 Device utilization : 7% Speed (MHz): 151 Lines of code: 395 Implementation Model Estimations DWT ~6% Implementation • Microblaze + Xilinx FPGA • Nios + Altera FPGA • Xilinx V2Pro • Toshiba MeP + FPGA • PowerPC + PLB + FPGA • PC + FPGA PCI Card •…etc 36 EDIF P&R FPGA NCSA/OSC Reconfigurable Systems Summer Institute JPEG2000 DWT Implementation Example taken from a “Xilinx Design Challenge” Comparison made with HDL approach See Article in Xcell Volume 46 http://www.xilinx.com/publications/xcellonline/xcell_46/xc_celoxica46.htm C-Based Design 1st pass Slices 2nd pass Final 646 546 758 800 6% 5% 7% 7% Speed (MHz) 110 130 151 128 Lines of code 386 386 395 435 Design time (days) 6 7 (6+1) 7 (6+1) 20* 5 mins 20 mins +6 hours Device utilization Simulation time 5 mins * Lena used as testbench throughout, input bit width12, max 1K image width 37 HDL NCSA/OSC Reconfigurable Systems Summer Institute * Doesn’t include partitioning spec. development Observations Comparable Using C faster Using C quicker Expert vs Novice JPEG2000 MQ coder Implementation > Celoxica 1st Pass Celoxica Final HDL Slices 1.347 1,999 620 Device utilization 12% 18% 6% Speed (MHz) 89.5 115.5 76 Lines of code 310 330 800 Design time (days) 10 12 (10+2) 30* Simulation time for Lena jpeg 5 mins 5 mins Hours * Doesn’t include partitioning spec. development > Common language base eased porting to hardware of the MQ coder source & DSM allowed partition, co verification & data to be moved between hardware & software > Optimizations included adding parallelism, replacing for() loops with while() loops, & simplifying loop control. > Design developed in a unified design environment 38 NCSA/OSC Reconfigurable Systems Summer Institute Observations HDL Smaller HC Faster HC Quicker Expert vs Novice