Experimental Analysis of Multi-FPGA Architectures over RapidIO for Space-Based Radar Processing Chris Conger, David Bueno, and Alan D. George HCS Research Laboratory College of Engineering University of Florida 20 September 2006 Project Overview Considering advanced architectures for on-board satellite processing Reconfigurable components (e.g. FPGAs) High-performance, packet-switched interconnect Sponsored by Honeywell Electronic Systems Engineering & Applications RapidIO as candidate interconnect technology Ground-Moving Target Indicator (GMTI) case study application Design working prototype system, on which to perform performance and feasibility analyses Experimental research, with focus on node-level design and memory-processor-interconnect interface architectures and issues FPGAs for main processing nodes, parallel processing of radar data Computation vs. communication: application requirements, component capabilities Hardware-software co-design Numerical format and precision considerations 20 September 2006 Image courtesy [5] 2 Background Information RapidIO Three-layered, embedded system interconnect Point-to-point, packet-switched connectivity Peak single-link throughput ranging from 2 to 64 Gbps Available in serial or parallel versions, in addition to message-passing or shared-memory programming models PE #1 PE #1 PE #2 PE #2 PE #2 PE #2 PE #3 PE #3 PE #3 PE #3 PE #4 PE #4 PE #4 PE #4 Pulse Compression time Doppler Processing STAP CFAR Detection Incoming Data Cube to PE 1 to PE 2 ... to PE n Image courtesy [6] DATA-PARALLEL PE PE #1 #1 Space-Based Radar (SBR) 1 CPI Results of 1 data cube ready Pulse Doppler Compression Processing STAP + CFAR PE #1 PE #2 PE #3 PE #4 Data Cube 1 PE #6 PE #8 PE #7 PE #9 PE #5 20 September 2006 CPI 1 start time st nd Results of 2 data cube ready Data Cube 2 Data Cube 3 Data Cube 1 Data Cube 2 Data Cube 3 Data Cube 1 Data Cube 2 Data Cube 3 STAP Data Cube 1 Data Cube 2 CFAR CPI 4 CPI 5 CPI 2 CPI 3 PC DP PIPELINED Space environment places tight constraints on system Frequency-limited radiation-hardened devices Power- and memory-limited Streaming data for continuous, real-time processing of radar or other sensor data Pipelined or data-parallel algorithm decomposition Composed mainly of linear algebra and FFTs Transposes or distributed corner turns of entire data set required, stresses memory hierarchy GMTI composed of several common kernels Pulse compression, Doppler processing, CFAR detection Space-Time Adaptive Processing and Beamforming 3 Testbed Hardware Custom-built hardware testbed, composed of: Xilinx Virtex-II Pro FPGAs (XC2VP20-FF1152-6), RapidIO IP cores 128 MB SDRAM (8 Gbps peak memory bandwidth per-node) Custom-designed PCBs for enhanced node capabilities Novel processing node architecture (HDL) Performance measurement and debugging with: 500 MHz, 80-channel logic analyzer UART connection for file transfer While we prefer to work with existing hardware, if the need arises we have the ability to design custom hardware RapidIO switch PCB RapidIO testbed, showing two nodes directly connected via RapidIO, as well as logic analyzer connections RapidIO testbed 20 September 2006 RapidIO switch PCB layout 4 Node Architecture All processing performed via hardware engines, control performed with embedded PowerPC Visualize node design as a triangle of communicating elements: External memory controller Processing engine(s) Network controller Parallel data paths (FIFOs and control logic) allow concurrent operations from different sources PowerPC interfaces with DMA engine to control memory transfers PowerPC interfaces with processing engines to control processing tasks Custom software API permits app development Locally-initiated transfers completely independent of incoming, remotelyinitiated transfers External Memory Controller Network Interface Controller remote request port incoming 3rd Party COMMAND CONTROLLER 3rd Party SDRAM Controller Core RapidIO Core outgoing local request port On-Chip Memory Controller oscillator oscillator hw_reset RESET & CLOCK GENERATOR DMA controller misc. I/O PowerPC HW module 1 HW module N Internal memory used for processing buffers (no external SRAM) Conceptual diagram of FPGA design (node architecture) 20 September 2006 5 Processing Engine Architectures All co-processor engines wrapped in standardized interface (single data port, single control port) Up to 32 KB dual-port SRAM internal to each engine Entire memory space addressable from external data port, with read and write capability Internally, SRAM divided into multiple, parallel, independent read-only or write-only ports Diagrams below show two example co-processor engine designs, illustrating similarities TO MEMORY CONTROLLER TO MEMORY CONTROLLER Port B Port B Port B Port B Port B Port B Port B Port B Port B Port B Port B Port B Port B Port B Port B Port B Input Buffer 1 Input Buffer 2 Input Buffer 3 Input Buffer 4 Output Buffer 1 Output Buffer 2 Output Buffer 3 Output Buffer 4 Input Buffer 1 Input Buffer 2 Input Buffer 3 Input Buffer 4 Output Buffer 1 Output Buffer 2 Output Buffer 3 Output Buffer 4 DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM DualPort SRAM Port A Port A Port A Port A Port A Port A Port A Port A Port A Port A Port A Port A Port A Port A Port A Port A SRAM Data Vector multiplier SRAM Processed Data × Averaging function PROCESSING ELEMENT SRAM Xilinx FFT Core Averaging function PROCESSING ELEMENT Thresholding function Detection output FFT / IFFT Pulse Compression Co-processor Architecture 20 September 2006 CFAR Co-processor Architecture 6 Experimental Environment System and algorithm Parameters Parameter Ranges 1024 Range dimension of data cube Pulses 128 Pulse dimension of data cube Channels 16 Channel dimension of data cube 100 MHz PowerPC/Co-processor engine clock frequency user Experimental steps 125 MHz Memory clock frequency Net. Frequency 250 MHz RapidIO clock frequency Max number of FPGAs used experimentally 2 Proc. SRAM Size 32 KB Max SRAM internal to each proc. FIFO Size (each) 8 KB Size of FIFOs to/from SDRAM Numerical Format Signed magnitude, fixed-point, 16-bit Complex elements for 32-bit/element +/- Integer bits (7) 20 September 2006 Fraction bits (8) Logic Analyzer RapidIO Testbed Mem. Frequency Max System Size USB UART Value Description Proc. Frequency JTAG No high-speed input to system, so data must be pre-loaded XModem over UART provides file transfer between testbed and user workstation User prepares measurement equipment, initiates processing after data is loaded through UART interface Processing completes relatively quickly, output file is transferred back to user Post-analysis of output data and/or performance measurements Configure FPGAs, load input data Timed experiment Unload output file 7 Results: Baseline Performance Data path architecture results in independent clock domains, as well as varied data path widths SDRAM: Processors: Network: 64-bit, 125 MHz (8 Gbps max theoretical) 32-bit, 100 MHz (4 Gbps max theoretical) 64-bit, 62.5 MHz (4 Gbps max theoretical) Max. sustained* throughputs Generic data transfer tests to stress each communication channel, measure actual throughputs achieved Notice transfers never achieve over 4 Gbps “A chain is only as strong as its weakest link” Simulations of custom SDRAM controller core alone suggest maximum sustained* throughput of 6.67 Gbps 6.67 Gbps 4 Gbps 3.81 Gbps RapidIO Throughput 5 Local Memory Throughput NETWORK BW 4 SDRAM BW 3 4 Gbps 5 Gbps SDRAM: Processor: Network: 3 2 1 SWRITE 0 2 256 1 Local 1K 4K 16K 64K 256K 1M Size (Bytes) 0 32 64 128 256 512 1K 2K 4K 8K 16K 32K Size (Bytes) PROCESSOR BW SRAM-to-FIFO, so processor transfers achieve 100% efficiency; latency negligible 20 September 2006 * Assumes sequential addresses, data/space always available for writes/reads 8 Results: Kernel Execution Time Processing starts when all data is buffered No inter-processor communication during processing Double-buffering maximizes co-processor efficiency For each kernel, processing is done along one dimension Multiple “processing chunks” may be buffered at a time: CFAR co-processor has 8 KB buffers, all others have 4 KB buffers CFAR only 15% faster than Doppler processing, despite 39% faster buffer execution time Loss of performance for CFAR due to under-utilization Equation to lower right models execution time of an individual kernel to process an entire cube (using double-buffering) Dimension 50 40 30 20 10 0 PC Implies 2 “processing chunks” processed per buffer by CFAR engine Buffer 60 CFAR works along range dimension (1024 elements or 4 KB) Single co-processing engine kernel execution times for an entire data cube Kernel Execution Times (per cube) 70 Microseconds Kernel Execution Times Milliseconds DP 160 140 120 100 80 60 40 20 0 CFAR PC DP PC = Pulse Compression DP = Doppler Processing Tkernel = 2 · Ttrans + N · Tsteady , where Ttrans Tsteady N = DMAblocking + MAX[DMAblocking, PROCbuffer] = 2 · MAX[(2 · DMAblocking), PROCbuffer] = general term for number of iterations Kernel execution time can be capped by both processing time as well as memory bandwidth DMAblocking = time to complete a blocking DMA transfer of M elements After certain point, higher co-processor frequencies or more engines per node will become pointless PROCbuffer = 20 September 2006 CFAR time to process M elements of buffered data (depends on task) 9 Results: Data Verification Processed data inspected for correctness Expect decrease in accuracy of results due to decrease in precision Pulse compression engine very similar to Doppler processing, results omitted due to space limitations CFAR detections suffer significantly from loss of precision Fixed-point vs. floating-point 16-bit elements vs. 32-bit elements CFAR and Doppler processing results shown to right, along-side “golden” or reference data Compared to C version of equivalent algorithm from Northwestern University & Syracuse University [7] MATLAB also used for verification of Doppler processing and pulse compression engines 97 detected (some false), 118 targets present More false positives where values are very small More false negatives where values are very larger Slight algorithm differences prevent direct comparison of Doppler processing results with [7] MATLAB implementation and testbed both fed square wave as input Aside from expected scaling in testbed results, data skewing can be seen from loss of precision 20 September 2006 10 Results: FPGA Resource Utilization FPGA resource usage table* (below) Virtex-II Pro (2VP40) FPGA is target device Baseline design includes: PowerPC, buses and peripherals RapidIO endpoint (PHY + LOG) and endpoint controller SDRAM controller, FIFOs DMA engine and control logic Single CFAR co-processor engine Co-processor engine usage* (right) Only real variable aspect of design Resource requirements increase with greater data precision Resource Used Avail. % 11,059 19,392 57 PowerPC’s 1 2 50 BlockRAMs 111 192 58 Global Clock Buffers 14 16 88 Digital Clock Managers 5 8 63 Occupied Slices HCS_CNF design (complete) Equivalent Gate Count for design Resource Used Avail. % Occupied Slices 1, 094 19,392 5 16 192 8 BlockRAMs CFAR Detection co-processor Equivalent Gate Count for design 1,076,386 Resource Used Avail. % Occupied Slices 3,386 19,392 17 BlockRAMs 23 192 11 MULT18X18s 20 192 10 Pulse Compression co-processor Equivalent Gate Count for design 1,726,792 Resource Used Avail. % Occupied Slices 2,349 19,392 12 BlockRAMs 14 192 7 MULT18X18s 16 192 8 7,743,720 Doppler Processing co-processor Equivalent Gate Count for design 1,071,565 * Resource numbers taken from mapper report (post-synthesis) 20 September 2006 11 Conclusions and Future Work Novel node architecture introduced and demonstrated External memory (SDRAM) throughput at each node is critical for system performance in systems with hardware processing engines and integrated high-performance network Pipelined decomposition may be better for this system, due to co-processor (under)utilization Multiple request ports to SDRAM controller improves concurrency, but does not remove bottleneck Different modules within design can request and begin transfers concurrently through FIFOs SDRAM controller can still only service one request at a time (assuming one external bank of SDRAM) Benefit of parallel data paths decreases with larger transfer sizes or more frequent transfers Parallel state machines/control logic take advantage of FPGA’s affinity for parallelism Custom design, not standardized like buses (e.g. CoreConnect, AMBA, etc) Some co-processor engines could be run at slower clock rates to conserve power without loss of performance 32-bit fixed-point numbers (possibly larger) required if not using floating-point processors If co-processor engines sit idle most of the time, why have them all in each node? With sufficient memory bandwidth, multiple engines could be used concurrently Parallel data paths are nice feature, at cost of more complex control logic, higher potential development cost All processing performed in hardware co-processor engines Apps developed in Xilinx’s EDK environment using C, custom API enables control of hardware resources through software Notable error can be seen in processed data simply by visually comparing to reference outputs Error will compound as data propagates through each kernel in a full GMTI application Larger precision means more memory and logic resources required, not necessarily slower clock speeds Future Research Enhance testbed with more nodes, more stable boards, Serial RapidIO Complete Beamforming and STAP co-processor engines, demonstrate and analyze full GMTI application Enhance architecture with direct data path between processing SRAM and network interface More in-depth study of precision requirements and error, along with performance/resource implications 20 September 2006 12 Bibliography [1] D. Bueno, C. Conger, A. Leko, I. Troxel, and A. George, “Virtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar,” Proc. High Performance Embedded Computing (HPEC) Workshop, MIT Lincoln Lab, Lexington, MA, Sep. 28-30, 2004. [2] D. Bueno, A. Leko, C. Conger, I. Troxel, and A. George, “Simulative Analysis of the RapidIO Embedded Interconnect Architecture for Real-Time, Network-Intensive Applications,” Proc. 29th IEEE Conf. on Local Computer Networks (LCN) via IEEE Workshop on High-Speed Local Networks (HSLN), Tampa, FL, Nov. 1618, 2004. [3] D. Bueno, C. Conger, A. Leko, I. Troxel, and A. George, “RapidIO-based Space Systems Architectures for Synthetic Aperture Radar and Ground Moving Target Indicator,” Proc. Of High-Performance Embedded Computing (HPEC) Workshop, MIT Lincoln Lab, Lexington, MA, Sep. 20-22, 2005. [4] D. Bueno, C. Conger, and A. George, "RapidIO for Radar Processing in Advanced Space Systems," ACM Transactions on Embedded Computing Systems, to appear. [5] http://www.noaanews.noaa.gov/stories2005/s2432.htm [6] G. Shippen, “RapidIO Technical Deep Dive 1: Architecture & Protocol,” Motorola Smart Network Developers Forum, 2003. [7] A. Choudhary, W. Liao, D. Weiner, P. Varshney, R. Linderman, M. Linderman, and R. Brown, “Design, Implementation and Evaluation of Parallel Pipelined STAP on Parallel Computers,” IEEE Trans. on Aerospace and Electrical Systems, vol. 36, pp 528-548, April 2000. 20 September 2006 13