Rapid Prototyping of RADAR Signal Processing Systems using Ptolemy Classic Ptolemy MiniConference UCB Denis Aulagnier, Patrick Meyer, Hans Schurer, Xavier Warzee, THALES D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p1 CONTENTS * The ESPADON programme & methodology – Environment & development process used for the benchmark * ESPADON Ptolemy developments & the benchmark – Benchmarking application for ESPADON – Ptolemy developments • Library set-up and features – Improvements done after first use • MERCURY target development – Benchmark iterations and results * Conclusions D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p2 ESPADON & INDUSTRIAL PARTNERS * * ESPADON: Environment for Signal Processing Application Development and PrOtotypiNg EUROFINDER PROGRAMME in France, UK, Netherlands: – FRANCE • THALES (Former THOMSON-CSF) – – – – THALES AIRBORNE SYSTEMS, THALES COMMUNICATION, THALES OPTRONIC, THALES AIR DEFENCE SYSTEMS • THOMSON MARCONI SONAR SAS • MATRA BAe Dynamics – UNITED KINGDOM • BAE SYSTEMS Advanced Technology Centres • THOMSON MARCONI SONAR Ltd – NETHERLANDS • THALES Naval Netherlands (former: THOMSON-CSF SIGNAAL) D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p3 THE ESPADON METHODOLOGY Phase 1: Analysis and Selection of the requirements allocated to SP Subsystem Plan SP Development From System Development From Previous Process Requirements Spiral Model Representation Risk Analysis Definition Development Validation Risk Register • • • • Risk driven development life cycle Model Year approach Reuse and capitalisation Support for: - Traceability - Cost performance trade off Phase 2: Definition of SP Subsystem INCREASING LEVEL OF REFINEMENT SP production Hardware/Software description D Mapping R Development R eExample Specification Plan of risk: Software development description e i v Example of risk: Refinement of q s e Real time performance architecture choice u Review k l Computer Example of risk: i o architecture Computing powerFunctional Design choice r R p Example of risk: SP Functional definition e e m SP algorithms, ... GO/NO GO m g e e To Next/Previous i n Architectural Design Process Functional Simulation n s t modelling Placement t t Choice of functions validation s e P Development of r l Validation Implementation performance model of performance model Software/Hardware a Validation development ofn virtual prototype (synthesis) Validation of manufactured computer Phase 4: Validation of SP Subsystem System Review To System Development Production Integration Phase 3: Development of SP Subsystem D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p4 ESPADON DESIGN ENVIRONMENT (EDE) Matlab • Algorithm Prototyping EDE Framework Simulink/RTW PTOLEMY (or GEDAE) •Tools Target/ Porting Kit VSIP HANDEL-C • Libraries • Standards ICS FPGA board Target H/W Rapid prototyping machine Mercury G4/RACE++ D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p5 PTOLEMY DEVELOPMENT PROCESS Reusable Components Functional simulation SDF STIMULI STIMULI (BDF, FSM) FUNCTIONAL FUNCTIONAL RESULTS RESULTS F2 F1 Functional design library: F5 F3 F4 • SDF VSIP Stars Architecture Design : Performance analysis: • CGC VSIP Stars with performance info MATLAB PE1 STIMULI GENERATOR F1 PE 1 F1receiveF3F2 PE 2 F2 PE 3 Code generation Code generation libraries: libraries: •CGC • CGCVSIP VSIPStars Stars •Target • VHDLoptimised library library • VHDL Drivers for •Communication Communication library CG TATL Target selection / Partitioning / Performance analysis Real time trace Display FPGA PE2 SEND/RECEIVE Send F5 F4 F3 F4 MATLAB POST-PROC POST-PROC PE3 DISPLAY DISPLAY COMPARASON F5 PTOLEMY Gantt Chart Display Implementation: CGC with an “Handel-C” syntax Implementation: PE 1 F1 CGC F3 C code for the target HANDEL-C ------------------------- C code generation F2 PE 2 for Target Run on the Target PE 3 PROTOTYPE RESULTS Target B ------------------------Target C F5 C to VHDL/EDIF F4 conversion D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p6 BEAMFORMER APPLICATION * Channel 1 Receive antennas * From a vertical array, e.g. 8 antenna channels, to 6 beams High level set-up of the radar beamformer application: * * * Beam 1 Window: Stabilization, Tapering, Calibration... Channel 8 N-point FFT/FIR Elevation beams output towards velocity filtering & detection Beam 6 Waveform: 16 pulses, PRF=3-6 kHz, Fsample=2.5MHz Input: 8 IQ-channels 32 bits complex float: 160 MB/sec Output: 6 beams 32 bits complex float: 120 MB/sec D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p7 BEAMFORMER CALIBRATION * * Normal burst pattern is one clutter sweep + 16 air pulses Calibration is performed instead of clutter measurement using 48 pulses (mode switch): Clutter Pulse Air-Burst Pulses 0 1 s= 0 2 Clutter Measurement s=1 t=0 3 s=2 s=3 T2 T1 4 T3 16 s=4 ........... T4 s=16 T 16 Burst k Test Pulse 1 Calibration Pulses 2 3 4 5 6 7 8 9 10 etc. First incoherent integration sum of 4 R Q s Second incoherent integration sum of 4 R Q s D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p8 BEAMFORMER DESIGN BEAMFORMER FUNCTIONAL DIAGRAM SIGNAL WEIGHTING (CMPLX MUL) I/Q VIDEO level_stab_shift, CVE_phase, RF, receiver_STC CALCULATE WEIGHTING FUNCTION INCOHERENT INTEGRATION CALCULATE PHASE DIFFERENCE COHERENT INTEGRATION CALCULATE GAIN DIFFERENCE NOISE MEASUREMENT BEAM CALCULATION weighting _control Input interface to the File system BEAM FORMING (FFT) Output interface to the File system CALIBRATION BF_control BF_status BEAMFORMER MONITOR & CONTROL D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p9 ESPADON PTOLEMY * Within Ptolemy we only use: – SDF (or BDF) Domain for functional simulation – CGC Domain for Code Generation (and implementation) * What we have developed for the benchmark is: – An extension of the Library of stars (both in SDF/BDF and CGC available, total: 70) • Radar Library (5 components) • VSIP Core Light Library (partially, 11 components) • Support Library (e.g. components for parallel operation, 19 components) – Target for the MERCURY Machine (G2 and G4 processor) • VSIP vectors are allocated in one buffer (per processor) • Synchronized Inter-Processor Communication for Complex Vector (The Burst Message is always sent along with the data) D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p10 PTOLEMY LIBRARY * Use of VSIPL standard library – Pass pointers of VSIPL views between stars instead of data (‘int’-type) * * * Develop multi- and complex-interleave star needed for corner-turn process (in HOF domain) Extent CGC-BDF to handle multiprocessor architecture Important requirements to developed elements: – Keep library platform independent, dependency is only in the target – Make control flow explicit in the data-flow graphs * Stars with vector output are provided with 2 extra parameters: – MAX_BUF_LENGTH: Maximum length of a vector – OUT_BUF_OPT: Number of output buffers used for each vector D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p11 PTOLEMY LIBRARY FEATURES * * * Support in-place operation (if possible) Support rate change i.e. the output buffer is automatically duplicated as many times as needed Colours of the stars highlight the different kind of stars used in the design: – Standard Ptolemy stars (WHITE) that use only std C library, – VSIPL stars (GREEN) that use the std C library and the VSIPL Core Light library, – Application specific stars (RED) that also use MERCURY library (ICS) and/or are specific to the ESPADON radar benchmark. D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p12 LIBRARY SET-UP (CGC) D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p13 PTOLEMY LIBRARY IMPROVEMENT PARAMETERS DATA & BURST MESSAGE SLOT 1 SLOT 2 CHANNEL 1 CHANNEL 2 CHANNEL N SLOT 3 SLOT 4 SECOND STAR ONE SLOT CHANNEL 1 SIGNAL DATA GLOBAL BUFFER (SMAB) SYNCHRONISATION FLAGS FIRST STAR LIBRARY STARS OFFSET 0 COMMUNICATION CHANNELS All the stars allocate the required buffers in the “Global Buffer” during the setup phase: GROWING OFFSETS * CHANNEL 2 CHANNEL N LAST STAR FREE SPACE D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p14 PTOLEMY MERCURY TARGET (1) * Features – Generate a C-file for each processor, compile, load and run the application on the machine – Use MERCURY ICS Library and VSIPL (exclusively) ⇒ Make it portable to any MERCURY machine – Arrange synchronisation and data transfer between PPCs – Data transfer uses DMA ⇒ efficient • Synchronisation protocol uses simple flags • Support Variable Vector Length: each communication buffer is duplicated N times (user defined) and the effective transfer length is set in real time • Memory is allocated for the maximum vector length (user defined) • Support both complex storage types (interleaved & split) • Support complex float vectors (only) D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p15 PTOLEMY MERCURY TARGET (2) * Features (continued) – Implement TATL Trace Tool from MERCURY – Overview of the main parameters (to be set by the user): • Number of processors • CE id for each processor • Size of the Shared Memory Buffer (SMB) for each processor (only one SMB is created in each processor) • Size of the “heap” is set for all processors • Communication buffer length (only one parameter for all the communication channels) • ON/OFF switches for debug messages and TATL (trace for all stars possible) • Give any ‘runmc’ command line option D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p16 PTOLEMY MERCURY TARGET (3) * Interface with VSIPL issues – If the input vector is already allocated inside the SMB and the stride of the vector view is equal to one, then the copy is not needed. ⇒ efficient transfer is possible (using In-Place operation) (Vector view with a stride > 1 are not supported. A 2D DMA is required). – But according to VSIPL policy, any VSIPL function is allowed to move the data to the more appropriate place (e.g. to internal memory for a DSP). Therefore the copy is always needed if we use the ‘VSIPL data’ space. – This problem is solved if we use only ‘User data’ space. In doing this we do not follow the defined VSIPL standard, however! ⇒ VSIPL does not fit well on a multi-processor machine like the MERCURY machine (interface VSIPL - ICS not efficient). D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p17 ESPADON PTOLEMY ISSUES (1) * Future work to solve known problems: – The same buffer size is applied to all communication channels ⇒ Memory allocation overhead – The Burst Message structure is hard-coded ⇒ Application dependent stars are used in the design – The BDF stars are available only for galaxies with single input & single output, and multi-rate is not supported ⇒ Strong design constraint – The BDF stars can only be used inside a processor ⇒ Design constraint – The CGC library elements are not calibrated in terms of execution time ⇒ Automatic mapping may fail D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p18 ESPADON PTOLEMY ISSUES (2) * Future work to solve known problems (continued): – The Memory boards are implemented inside the I/O stars ⇒ Memory boards are not really integrated in the design environment – The inter-processor communication functions support only VSIPL complex float vectors ⇒ Design constraint – The TATL Tool cannot be used if the design counts more than 384 different stars (due to the limited number of event types) ⇒ Design constraint D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p19 ITERATION 1: BARE BEAMFORMER DESIGN * Iteration 1 (6 processor design): Data in + Distribute 2 striplines NofRQ/NofProcs 2 striplines NofRQ/NofProcs 2 striplines NofRQ/NofProcs 2 striplines NofRQ/NofProcs Collect + Data out D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p20 TATL RESULTS FOR ITERATION 4 * Bare beamformer on 8 processors D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p21 BENCHMARK PERFORMANCE METRICS Iteration 1 Iteration 2 Iteration 3 Iteration 4 NofChannel 8 8 8 8 NofSweep 17 17 17 17 NofProc 4 (+2) 5 (+2) 4+4 8 Possible NofProc 1, 2, 4, 8 (+2) 2, 3, 5, 9, 17 (+2) >1 1, 2, 4, 8 Input data DMA DMA PRE-LOADED PRE-LOADED Output data (DMA) (DMA) (DMA) - (PbMCS) CORNER-TURN 4->4 NO 4->4 8->8 RACE++ peak load ? ? ? 53 % LATENCY 1 burst 1 burst 2 bursts 1 burst PERFORMANCE* 25 ms 21 ms 9.5 ms 9 ms Support Var. Burst L. YES YES YES YES Design Time# 72 H 16 H 12 H 16 H * The performance is the average processing time for one burst. The measurement has been done with TATL on 10 bursts of 19000 RQ of 400 ns (i.e. 7.6 ms). # Time is without extensive functional testing. D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p22 BENCHMARK FINAL DESIGN (1) D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p23 BENCHMARK FINAL DESIGN (2) D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p24 FINAL BURST TIMING RESULTS D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p25 CONCLUSIONS (1) * * * Main functional requirements are met by the final design (12 of the 19 requirements) Throughput and latency requirements are almost met; expected to be met in case of full speed G4 daughter cards and/or VSIPL functions redesign Review of graphical Ptolemy designs seems faster and more efficient than code reviews – Disadvantage is parameter handling and scope. – Design is highly multi-rate, but this is difficult to see – Some functionality is inside stars (hidden) * Total design, validate & test time for bare beamformer was 354.5 hours, while normal development takes 481 hours: Approximately 36% faster (improvement ~1.36) D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p26 CONCLUSIONS (2) * * Development time from functional/architectural design to implementation is very short: matter of days For which purpose can we use it? – Mainly for rapid prototyping of new algorithms – Rapid prototyping of demonstrators – Open source approach enables us to adapt the tool to our needs * Many improvements are needed before it can be used for a complete application/project D. Aulagnier/P. Meyer/H.Schurer/X.Warzee, p27