Technion – Israel Institute of Technology Faculty of Electrical Engineering High Speed Digital System Lab (HS DSL) Elad Hadar Omer Norkin Supervisor: Mike Sumszyk Winter 2010/11, Single semester project. Date:30/5/11 • PROC_HILs is a Hardware-In-the-Loop acceleration tool for running Simulink designs on FPGAs. • Automatically translate Simulink designs into FPGA code (compatible with the PROC board installed on the target PC) and run it under Simulink. • Dramatically improves simulation speed, with a dedicated accelerator for Simulink designs. • Enables building a design visually and downloading it directly, with minimal effort, into the PROC board. • Enables concurrent engineering at an early stage. • Cuts development cycle time (and costs). • Improve design reliability. Implementing a video analysis designs on GIDEL PROCSTAR III platform that will enable usage and exploration of a new development platform (PART I – PROC_HILs). Proper usage of development tools throughout all stages of implementation from algorithm to hardware. • PROC_HILs enables the user to download a Simulink design into PROC board and run it. • The design runs on the on-board FPGAs, communicating with Simulink in real time. • Generation process is fully automatic. Simulink design An HDL code is A new Simulink generated, The design runs design file is synthesized and on the hardware generated. Single HIL compiled to get an synchronized block includingfully all the .rbf file (FPGA with that Simulink, inputs and outputs binary file) were present in the receiving the original compatible with thedesign,signals from the connected to all the specific PROC simulation sources sources and sinks board and outputting the results into the sinks. • Main development stages were made on a GiDEL PROCe III (Altera Stratix III) board (1-FPGA) • • • • • GiDEL PROC_HILs (Version 2.1.2) ALTERA’s DSPBuilder blockset for Simulink (Version 10.1) ProcWizard (Version 8.8) Quartus II (version 10.1) Matlab (Version 2009a) • Additional development was made on a GiDEL PROCStar III (Altera Stratix III) board (4-FPGA) • NLD is a hardware implementation of Non Linear Diffusion algorithm for video images. • Enable local smoothing of the picture while preserving edges. • The Simulink design in this project is based on a previous project (Performed in the Technion HS-DS Lab by Tsion Bublil & Yony Dekell). • The original Project was implemented on a PROCStar II (Altera Strartix II) board (4-FPGAs), using SynplifyDSP blockset library for Simulink. • All I/Os Must be placed on the top level of the design. • Simulink sources must be configured to the same clock that toggles the input port they feed. • All signals from the workspace blocks feeding inputs blocks and all frame output blocks must use the same frame size (as seen in the previous slide). • The design must obey the following table rules: * PROC_HILs User Guide V2.1.2 p. 49 a - R r 1 b z-1 (1) 1 Ry Pipelined Adder Delay a 1 R -256 z (1) r b 1 Rx Pipelined Adder Delay 2 R256 1 a0[13]:[13] beta 1 b0[13]:[13] Constant y = a0 X b0 + a1 X b1 y[27]:[26] 1 g11 a1[13]:[13] 1 Ry 2 a[13]:[13] Xr [13]:[13] b[13]:[13] Rx 2 Ry 1 b1[13]:[13] g12 Multiply Add Multiplier2 9.5367e-007 beta1 a[24]:[24] Xr [24]:[18] b[24]:[24] 1048576 Multiplier2 beta 3 g22 a[16]:[16] 1048576 Xr [16]:[16] 2 Multiplier1 - r b Pipelined Adder a[16]:[16] 1 q[48]:[0] a=bXq+r b[24]:[24] a[24]:[24] a g11 g12 a[24]:[24] beta2 b[16]:[16] r[24]:[24] a[16]:[24] Xr [28]:[16] d(43:0) q(21:0) b[24]:[24] Xr [16]:[24] 0.00097656 Multiplier4 Square Root Xr [16]:[16] b[16]:[16] Multiplier3 1 g11 -206 z (1) 1 Out1 Delay12 2 g12 -206 z (1) 2 Out2 Delay1 3 g22 -206 z (1) 3 Out3 Delay3 4 Ry -206 z (1) 4 Out4 Delay2 5 R256 -206 z (1) 5 Out5 Delay5 6 Rx z-206 (1) Delay4 6 Out6 beta3 b[16]:[24] Multiplier5 Divider 1 gm_out • Determining clock rate – Video processing algorithm will have to process 15 Iterations of 256 by 256 pixels for a frame, achieving a reasonable rate of 15 frames per second. clock rate 2562 15 15 14,745,600 15[MHz ] • Long logical path prevents meeting clock rate demands, and fails compilation. – Altera DSPBuilder Advanced blockset supports automatic pipelining (was not implemented in this project). – Altera DSPBuilder blockset supports user pipelining using internal pipeline definition of the block (determined by user), or inserting Delays throughout the logical path. This method requires careful attention of the designer, that must assure full synchronization of the logical paths, guarantied by design. 9.5367e-007 beta1 a[24]:[24] Xr [24]:[18] b[24]:[24] 1048576 Multiplier2 beta 3 g22 a[16]:[16] 1048576 Xr [16]:[16] 2 - r b Pipelined Adder a[16]:[16] g12 Xr [16]:[16] b[16]:[16] Multiplier3 b[24]:[24] a[24]:[24] a Multiplier1 q[48]:[0] a=bXq+r b[16]:[16] g11 1 a[24]:[24] beta2 Xr [28]:[16] b[24]:[24] Multiplier4 r[24]:[24] a[16]:[24] d(43:0) q(21:0) Xr [16]:[24] 0.00097656 Square Root beta3 b[16]:[24] Multiplier5 Divider 1 gm_out 7 R 8 dt a + -256 1 z (1) gm05 Delay Xr [16]:[16] b[16]:[16] Multiplier8 4 g22 5 a[13]:[13] Xr [15]:[15] b[13]:[13] a[16]:[16] Rx Multiplier2 Xr [16]:[16] a - r Rp Rpx b[16]:[16] b dpx Multiplier5 Pipelined Adder 3 g12 6 a[13]:[13] Xr [15]:[15] b[13]:[13] a[16]:[16] Ry Xr [16]:[16] a Multiplier1 + r b[16]:[16] b Multiplier7 Pipelined Adder2 a[13]:[13] Xr [15]:[15] b[13]:[13] a[16]:[16] Multiplier4 a - r Xr [16]:[16] b[16]:[16] b Multiplier6 Pipelined Adder1 2 g11 a[13]:[13] Xr [15]:[15] b[13]:[13] Multiplier3 -255 r X Out1 X Out1 b a[16]:[16] z (1) Rp Rpy Delay1 dpy Pipelined Adder3 1 belt_r min max • Validating performance of the completed design, using Simulink environment. • A full automatic compilation and synthesis starts by activating the GiDEL HIL generation tool block. GiDEL HIL Generation Tool • Preliminary compability test starts by pressing the prompt GUI button. – Checks meeting design rules. – Does not check Hardware fitting and feasibility. • “GO” button issues a full compilation and synthesis of the design. • The generation flow can be adjusted by selecting the “Advanced Mode”. Controls the enabling/disabling of different flow stages. • Generation ends with a new Simulink design file. vecR Signal From Workspace beta Signal From Workspace4 Convert 6.666666666666667E-8 sec cvrt_inp Clock Convert Convert cvrt_inp4r cvrt_outp prob_belt To Workspace1 hw_loop_6b_HIL_HW_block dt Signal From Workspace3 Convert cvrt_inp3r <your_design_name>_HIL • PROC_HILs does not fully elaborate the feasibility and hardware consumption of the design. – Quartus file are generated only while the generation process is active and then automatically deleted. – Solution: During generation extract Quartus top design and independently compile it with Quartus. • NLD Hardware consumption: • Original image: • Smoothed image (3 Iterations): • Calculated warm-up time: • Simulation overhead: 9.9712 [sec] • Hardware overhead: 9.60422 [sec] Run time- Simulation & Hardware 7000 5792.6473 6000 Run time [sec] 5000 4000 2902.047122 3000 Simulink simulation HardWare simulation 2000 1000 15.38741 37.052384 9.876949 9.860184 296.137446 64.449291 11.885162 10.134778 583.806265 14.166178 32.085269 54.825545 0 15,000 75,000 150,000 750,000 Vector Length 1,500,000 7,500,000 15,000,000 • Reduced overhead, time ratio: 128.645454 Eli’s comment: All Simulations were made on: Run time ratio- simulation & Hardware 120 100 80 Ratio: 60 Run time Simulation Run time Hardware 40 20 0 0 5,000,000 10,000,000 Vector Length 15,000,000 • Implementing NLD as part of Video capture/view real-time streaming. • Web cam envelopment: – Resizing image (256x256) – Performing “log” on resized image – Spreading image to vector form – Reshaping to matrix form – Performing “power” on processed image Frame rate 15[ frames / sec] • NLD algorithm Hardware block is inserted into the webcam envelop. Insufficient Frame rate 0.077[ frames / sec] • Hardware is dramatically decreasing frame rate though it is designed with the capabilities of the desired frame rate. – Operating frequency is 15MHz. • Conclusion: interface Simulink/Hardware overhead is to high to allow proper streaming in real-time applications. • A possible way to gain advantage of PROC_HILs is using a hardware loop. mod1048576 q(23:0) a Counter IF a<b true 65536 Constant sel(0:0) b If Statement 0- MUX Pipeline levels: 256 In1 q vecR Signal From Workspace i[32]:[32] 1- In2 d Out1 o[16]:[16] Input GiDEL Frame Input GiDEL Frame Output Multiplexer full a IF a>b true rreq FIFO 6.666666666666667E-8 sec b If Statement1 Signal From Workspace4 To Workspace1 In3 Belt1d beta prob_belt Output empty Clock i[16]:[16] Input4 GiDEL Frame Input 1 wreq usdw(15:0) Constant1 dt Signal From Workspace3 i[16]:[16] Input3 GiDEL Frame Input FIFO FIFO Size: 256X256-(256) GiDEL HIL Generation Tool • Multiple tries of the full HL designs showed problems of convergence to the hardware limits of the PROCe-III Board. • The same design was implemented on a PROCStar III board, with no problems reported in the generation flow. • Problem encountered: While Simulink simulation showed reasonable results, hardware simulation showed different results (efforts to find origin and fix were stopped due to the project’s time constraints). • Strict software compatibility demands – There is only one combination of involved software version that matches (matlab, PROC HIL, Altera DSPBuilder, Quartus, PROC wizard) • Moderate algorithms do not fit the common boards using Proc HIL and Altera DSPBuilder blockset. • Altera DSP blockset variety is poor, and does not contain common operations (log, exp, power, nth root, not, min/ max…) • For effective usage, one should use the Altera advanced DSP Blockset, but it requires the simulink fixed point license. • Demands data flow as vectors and does not support matrices. • Inconsistency between simulation and Hardware Performances. • Inconvenient existing blocks – Square Root: accepts and returns only whole numbers. – Divider: returns only in the form of: whole number and res. 1. Allows to easily design and implement algorithms in Simulink environment. • • • Direct Hardware Burn. Direct generation HDL code that matches the target board. Fast HW simulation using Simulink/Matlab interface. 2. Extremely efficient on resources consuming processing algorithms. 3. Not suited for applying on streaming data designs (RealTime designs). Motivation: Learning and practice of effective debug methodology using PROC API. GIDEL PROC_API – enable real-time configuration and querying of the board. Main goals/phases: 1) Learning PROC API, PROC MegaFIFO 2) Define and build an integrated DSPbuilder design combining PROC API video streaming functions, data channels and PROC MegaFIFO memories. PROC MegaFIFO RX - FIFO PROC API TX - FIFO Task Learning PROC API, PROC MegaFIFO Build a simple design combining DSPbuilder and the PROC Wizard using PROC API Define an integrated design combining PROC API video streaming functions and data channels, PROC MegaFIFO memories and DSPbuilder design Verification and writing the project’s book. Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9 Week 10