An Automatic Approach to Generate Haste Code from Simulink Specifications Maurizio Tranchero1, Leonardo M. Reyneri1, Arjan Bink2, and Mark de Wit2 1Politecnico di Torino – Department of Electronics – Italy 2Handshake Solutions – The Netherlands Outline Simulink Based Design and CodeSimulink Haste Coding Choices Simulink-Specific Issues Proposed Flow and its Implementation Case studies and performance Haste Code from Simulink 2 of about 50 Simulink-Based Design Simulink®: what is it? and why? General-purpose graphical tool able to describe and simulate heterogeneous systems Based on MATLAB® Widely used in different application and industrial areas: signal and image processing, control, aerospace, modeling, etc... Does not require knowledge of electronic/digital design; allows interdisciplinary teams Uses dataflow (DF) computational model Haste Code from Simulink 4 of about 50 Simulink diagrams A set of interconnected blocks Each block performs an operation (e.g. a multiply and accumulate model) Includes stimuli and test points ACCUM SOURCES DISPLAY RESULTS MULTIPLY ADD Haste Code from Simulink 5 of about 50 Simulink to develop digital systems Simulink is very fine in general-purpose modeling, but: what are the implications of HW/SW implementations? what about the effects of data representation? what about the effects of timing, latencies and delays? Can Simulink models be implemented physically? Yes, but some external tools are required: For SW For HW Real-Time Workshop, from The Mathworks System Generator (Xilinx) DSP-Builder (Altera) HDL Coder (The Mathworks) CodeSimulink (Politecnico di Torino) For mixed HW/SW CodeSimulink (Politecnico di Torino) Haste Code from Simulink 6 of about 50 Data Flow vs. Register Transfer Simulink is natively Data Flow (each block computes only when data is valid) Sequential SW is DF (by compilation) Synchronous HW is natively Register Transfer (each block computes independently of data being valid) RT has the problem of repipelining and synchronization… Asynchronous HW is natively DF (because of handshake) Analog systems are time-continuous True Simulink HW has to be DF! Mixed HW/SW systems have to be DF! Haste Code from Simulink 7 of about 50 Commercial tools Simulink HW/SW Non Simulink-compliant (they are RT, not DF !!!) : System Generator (Xilinx) DSP-Builder (Altera) HDL Coder (The Mathworks) they use Simulink ONLY as a graphical interface do NOT support any Simulink block Simulink-compatible (fully DF): Real Time Workshop (The Mathworks; only for SW!) CodeSimulink/SMT6040 (Politecnico di Torino); also supports mixed HW/SW/analog systems both implement Simulink blocksets natively in a transparent manner Haste Code from Simulink 8 of about 50 CodeSimulink/ SMT6040 Tool Our Tool CodeSimulink/SMT6040 True DF, Simulink-compatible, model-based, hybrid codesign environment Co-simulates: SW + digital HW + analog HW + external world (e.g. mechanical) modeling real behavior of chosen implementation(s) Generates: SW (C) + digital HW (VHDL) + analog (SPICE) or VHDL + HASTE code Digital either: synchronous DF, asynchronous DF Commercially available (SMT6040) Student edition available at http://polimage.polito.it/groups/codesimulink.html Haste Code from Simulink 10 of about 50 A Simple CodeSimulink model Haste Code from Simulink 11 of about 50 Implementation Parameters Available parameters DATAWIDTH (number of bits) BINARYPOINT (position of fixed point) REPRESENTATION ((un)signed, sign/modulus, floating point) OVERFLOW (saturation/wraparound) TRUNCATION (floor, ceil, round, etc.) PIPELINE (latency, speed) +/- 1 0 1 1 0 +5.50 Haste Code from Simulink 12 of about 50 From Simulink to CodeSimulink An automatic process composed of these steps: Model simulation Model conversion (namely 1-to-1 block substitution) Hw Parameter setting (based on simulation result) Double precision (64b) Floating point - Selectable data-width - integer, fixed point, floating point... - signed, unsigned, modulus & sign - wrap around, saturate output Haste Code from Simulink 13 of about 50 A Semi-Automatic Process This conversion cannot be completely automated Inputs and outputs block should be inserted manually Some block parameter have to be set manually (overflow, truncation and pipeline) Haste Code from Simulink 14 of about 50 CodeSimulink Environment System Description Functional + Timing Simulation HW-SW Partitioning Digital HW SW DigHw Compiler RTW Analog HW AnHw Compiler Synchronous Asynchronous PCB Tool P&R Target Programming Schematic Haste Code from Simulink 15 of about 50 Advantages of (Code)Simulink Flexibility: very high (short redesign time); no need to take care of interfaces and timing; quick system-level performance optimization Reusability: may use existing Simulink models Time-to-market: very short (consequently), although design is suboptimal (can be optimized later on) Accessibility: does not require experienced designer; simpler integration of work team with heterogeneous know-how’s Academic: Optimal for teaching Electronic Systems and Asynchronous circuits; student version available Haste Code from Simulink 16 of about 50 Advantages of CodeSimulink Allows choosing implementation later in the design flow Timing analysis and pipeline balancing Natively handles scalars, vectors, matrices Supports multi-system (multi-platforms, multi-cores, multi-SW, multi-FPGA, multi time-domains, GALS, mixed synch/asynch, hybrid, etc.) Supports synchronous bit-parallel, bit-serial, bundleddata asynchronous designs Interfaces to low-level simulators (ModelSim, MaxPlus, Quartus, ISE, Spice-like) Haste Code from Simulink 17 of about 50 Limitations of CodeSimulink Best suited to data-dominated systems Mostly fixed-rate (does not mean synchronous!) sampling strategy (including multi-rate) Library-based (sub-optimal) Fast timing models (optional) require technology characterization Haste Code from Simulink 18 of about 50 Library Blocks Large library of blocks (blockset) including: Low-level Simulink blockset: addition, multiplication, min/max, floating / fixed point converters, etc. High-level functions: FIR filters, FFTs, custom transfer functions, etc. Special-purpose functions Interface blocks: I/Os SW/HW/SW Analog/digital/analog Synchronous/asynchronous Haste Code from Simulink 19 of about 50 CodeSimulink digital blocks Each CodeSimulink block is translated into: A combinational functional blocks (VHDL) A sequential protocol controller + register. Either: VHDL, synchronous VHDL, asynchronous Haste code REQ, CHANNEL ACK CLK, VAL, RDY Haste Code from Simulink 20 of about 50 Asynchronous CodeSimulink Just change protocol handling box Supports bundled data transfers Analyzes and optimizes timing Timing analysis identifies bottlenecks and helps to minimize them Forces timing constraints during synthesis accordingly Adds delay line according to required timing Prevents optimization on delay line Haste Code from Simulink 21 of about 50 Haste Coding Choices VHDL Usage within TiDE CodeSimulink uses a library-based approach: each block is described in VHDL To reuse such code, an automatic conversion into Verilog (which is fully supported in TiDE) code has been provided using RTL Compiler Haste Code from Simulink 23 of about 50 Coding Styles in HASTE Different coding styles available: which is the best for Simulink blocks? To benchmark, we used a simple datapath made of: 4 different arithmetic operations 2 x 16 bit-wide inputs 1 x 3 bit-wide selector 1 x 32 bit-wide output Haste Code from Simulink * + > x3 24 of about 50 Multiple vs. Single Processes The same block described as a single process or as an ensemble of concurrent processes produces different implementations forever do multiplier(...) || adder(...) || comparator(...) || fixedGain(...) forever do multiplier(...) od || forever do adder(...) od || forever do comparator(...) od || forever do fixedGain(...) od od Haste Code from Simulink 25 of about 50 Shared Variables vs. Channels Variables are cheaper than channels channels automatically make synchronization & C = func(& i0 ? var T & i1 ? var T ):T. ( i0 + i1 ) fit T & pipeline: main proc( & x ? chan T broad pas & y ! chan T ). begin | forever do wait( outprobe( x ) ) ;( y!C( .i0(dataprobe(x)), .i1(B(.i0(dataprobe(x)), .i1(A(dataprobe(x)))) ) ) || x?~ ) od end & C = proc( & i0 ? chan T & i1 ? chan T & o0 ! chan T ). begin | forever do wait( outprobe(i0) * outprobe(i1) ) ; o0!( dataprobe( i0 ) + dataprobe( i1 ) ) fit T ; ( i0?~ || i1?~ ) od end & pipeline: main proc( & x ? chan T broad pas & y ! chan T ). begin & c0 : chan T & c1 : chan T | A(x,c0)||B(c0,x,c1)||C(c1,x,y) end Haste Code from Simulink 26 of about 50 Tupled vs. Separated Channels Tupled channels (c) are cheaper than separated one (b) But they can introduce deadlock in several configurations Haste Code from Simulink 27 of about 50 Deadlock Exposed Haste Code from Simulink 28 of about 50 Register Insertion Using HASTE and TiDE 5.2, registers are inserted at inputs Simulink blocks have usually only one output and one or more inputs We would like to have register on output, for less area occupation At the moment (TiDE 5.2) it is not possible, but in TiDE6 it will be Haste Code from Simulink 29 of about 50 Some Figures (htcomp + htmap) System Datapath Tupled inputs Multiple forever do Independ. parallel inputs V - V - V - 932.2 156.7 V - - V V - 902.0 156.3 - V - V V - 848.7 129.3 - V V - V - 871.7 124.7 V - V - - V 331.0 14.3 V - - V - V 298.0 8.3 Haste Code from Simulink Pipelined version Fully combinati onal Area [mm2] C-gates Global forever do 30 of about 50 Conclusions After this analysis we can decide to: Use multiple processes description Use channels instead of variables Use separated channels Registers are not optimized, but left to compiler optimization Haste Code from Simulink 31 of about 50 Simulink-Specific Issues Multidimensional Objects Simulink models can easily process scalars, vectors or matrices Depending on throughput constraints we can decide to process each data component serially or in parallel Serial vector 1,3,5,7 2 2,6,10,14 Parallel vector 1 3 5 7 Haste Code from Simulink 2 2 2 2 2 6 10 14 33 of about 50 Sampling Blocks Sampling blocks are the ones with special timing constraints, i.e., they have to guarantee data processing in a fixed amount of time They can be used to change input/output data rate The main blocks belonging to this category unit delay zero order hold rate transition Haste Code from Simulink 34 of about 50 Unit Delay FSM for scalar data It introduces one memory stage from input to output When a “Sampling Time” period has been elapsed The old data (multiple data, in case of arrays) is (are) generated on output A new data (multiple components) is (are) sampled Haste Code from Simulink 35 of about 50 Zero Order Hold FSM for scalar data It maintains output data until a “Sampling Time” period has been elapsed When it elapses, a new acquired input data (possibly multiple) is transferred to the output Haste Code from Simulink 36 of about 50 Rate Transition It is a super set of previous blocks: it is used to change data rate from input to output, both increasing or decreasing it Replicates/consumes tokens It can be described as a cascade of “unit delays” and “zero order” blocks Haste Code from Simulink 37 of about 50 Sampling Blocks Implementation All these blocks have to be connected to a clock/timing (?!?) signal to guarantee timing To reduce overhead introduced by clock interaction, it is possible to use a fully asynchronous version of such blocks, yet precisely timed Timing clock interaction is still necessary but it could be moved to I/Os Haste Code from Simulink 38 of about 50 Simulink-Haste Flow Implementation The Flow Simulink Model Integrates CodeSimulink with the existing TiDE flow Each block is converted in both Haste and RTL code CodeSimulink VHDL Descriptions Haste Description RTL Compiler htcomp + htmap Verilog Descriptions HT Back-end Haste Code from Simulink 40 of about 50 Haste File Generated Is composed of 6 parts Type definitions (used in the file itself) Top level procedure interface definition Internal channels Internal functions (the interface to RTL code) Internal procedures (protocol management and functions instance) Procedure instances and connections Haste Code from Simulink 41 of about 50 E.g.: Haste File Generated // Types Definition & STD_LOGIC_VECTOR_17 = type [0..2^17-1] & STD_LOGIC_VECTOR_16 = type [0..2^16-1] & STD_LOGIC_VECTOR_15 = type [0..2^15-1] & STD_LOGIC_VECTOR_14 = type [0..2^14-1] & STD_LOGIC_VECTOR_1 = type [0..2^1-1] // Top entity instance & inout1 : main proc( & DIGINA ? chan STD_LOGIC_VECTOR_15 & DIGINB ? chan STD_LOGIC_VECTOR_14 & DIGOUTA ! chan STD_LOGIC_VECTOR_17 ). begin ... Haste Code from Simulink 42 of about 50 E.g.: Haste File Generated // Functions declarations & sim_sum1_f = func ( & A1 ? Var STD_LOGIC_VECTOR_16 & A2 ? var STD_LOGIC_VECTOR_1 ): STD_LOGIC_VECTOR_17. import // Component declarations & sim_sum1 = proc ( & Y1 ! Chan STD_LOGIC_VECTOR_17 & A1 ? Chan STD_LOGIC_VECTOR_16 & A2 ? Chan STD_LOGIC_VECTOR_1 ). begin & v_A1 : var STD_LOGIC_VECTOR_16 & v_A2 : var STD_LOGIC_VECTOR_1 | forever do ( A1 ? v_A1 || A2 ? v_A2 ) ; Y1 ! sim_sum1_f( .A1( v_A1 ), .A2( v_A2 ) ) od end Haste Code from Simulink 43 of about 50 E.g.: Haste File Generated // Internal signal declarations & Y1_5 : chan STD_LOGIC_VECTOR_1 broad & Y1_4 : chan STD_LOGIC_VECTOR_17 broad & Y1_1 : chan STD_LOGIC_VECTOR_16 broad ... // Component instantiation sim_constant ( .Y1( Y1_5 ) ) || sim_digOut ( .A1( Y1_4 ), .DIGIO( DIGOUTA ) ) || sim_sum1 ( .Y1( Y1_4 ), .A1( Y1_1 ), .A2( Y1_5 ) ) ... Haste Code from Simulink 44 of about 50 E.g.: VHDL File Generated -- Top entity instance ENTITY sim_sum1 IS PORT ( DIGOUTA_i : IN STD_LOGIC_VECTOR(15 downto 0); DIGIN_VALA0 : IN STD_LOGIC; DIGIN_RDYA : OUT STD_LOGIC; DIGOUTB_i : IN STD_LOGIC_VECTOR(0 downto 0); DIGIN_VALA0 : IN STD_LOGIC; DIGOUTA_o : OUT STD_LOGIC_VECTOR(16 downto 0); DIGOUT_VALA : OUT SIM_SIGVAL_SYNCHPAR; DIGOUT_RDYA : IN STD_LOGIC; nRESET : IN STD_LOGIC; CLK : IN STD_LOGIC -- left unconnected in this implementation ); END sim_sum1 ; Haste Code from Simulink VHDL is used to describe the block functionality For each block a HDL file will be generated with desired parameters (Data width, binary point...) 45 of about 50 Conversion of Simulink models Simulink model Compiled Haste program // Functions declarations & sim_sum1_f = func ( & A1 ? Var STD_LOGIC_VECTOR_16 & A2 ? var STD_LOGIC_VECTOR_1 ): STD_LOGIC_VECTOR_17. import // Component declarations & sim_sum1 = proc ( & Y1 ! Chan STD_LOGIC_VECTOR_17 & A1 ? Chan STD_LOGIC_VECTOR_16 & A2 ? Chan STD_LOGIC_VECTOR_1 ). begin & v_A1 : var STD_LOGIC_VECTOR_16 & v_A2 : var STD_LOGIC_VECTOR_1 | forever do ( A1 ? v_A1 || A2 ? v_A2 ) ; Y1 ! sim_sum1_f( .A1( v_A1 ), .A2( v_A2 ) ) od end 46 of about 50 Haste Code from Simulink Case Studies 3-Input 32-bits adder Haste Code from Simulink 48 of about 50 Simple 16-bits ALU (*,+,<,gain) Haste Code from Simulink 49 of about 50 8th order, 20-bits wide IIR Filter Haste Code from Simulink 50 of about 50 Results Speed comparison by simulation on Cyclone II FPGA SPEED (CycloneII) Design Unit Adder Msa/s Datapath Msa/s IIR Msa/s C.S. Synchronous C.S. Asynchronous 218 132 120 112 - 125 45 - 80 70 - 98 Area comparison on commercial 90nm ASIC library AREA (ASIC) Design Unit Adder gates Datapath gates IIR gates Synchronous Asynchronous Haste CS-Haste Logic Regs Total Logic Regs Total Logic Regs Total Logic Regs Total 128 130 258 137 130 267 130 130 260 146 161 307 1181 179 1360 1224 186 1410 1144 186 1330 1230 186 1416 3,654 1029 4683 4752 1029 5781 4528 1029 5557 NA NA NA Haste Code from Simulink 51 of about 50 Proprietary Audio Test Chip Area [um2] Handwritten + TiDE 5.2 (not available) Sequential 32,018 89,792 11,632 Logic 141,676 357,368 152,468 Total 173,694 468,746 164,100 CodeSimulink + TiDE 5.2 CodeSimulink + TiDE 6.0 TiDE 5.2 has limitations (e.g. registers placed at the input instead of the outputs) which made Simulink to Haste conversion very inefficient. TiDE 6.0 has overcome these limitations and the automatically generated ASIC is smaller than the handwritten one. Haste Code from Simulink 52 of about 50 Conclusions Optimization at system level (CodeSimulink ), followed by automatic translation to Haste can achieve the same quality as manual coding with Haste, followed by hand optimization at Haste level, although The major drastic improvement is in productivity, maintainability and reusability of CodeSimulink model System-level cosimulation reduces development risks, makes optimization easier, makes interdisciplinary interactions much easier Time to market is significantly faster Performance reduction due to library-based design (about 10-20% in the average) is completely overcompensated by the performance improvement achievable with high level specification, simulation and optimization Further manual optimizations are feasible if economical returns justify them Haste Code from Simulink 53 of about 50 That’s folk! Thank you for your attention! Haste Code from Simulink 54 of about 50