Mid Semester Presentation - High Speed Digital Systems Laboratory

advertisement
Technion – Israel Institute of Technology
Faculty of Electrical Engineering
High Speed Digital System Lab (HS DSL)
Elad Hadar
Omer Norkin
Supervisor: Mike Sumszyk
Winter 2010/11, Single semester project.
Date:30/5/11
• PROC_HILs is a Hardware-In-the-Loop
acceleration tool for running Simulink
designs on FPGAs.
• Automatically translate Simulink designs into
FPGA code (compatible with the PROC board
installed on the target PC) and run it under
Simulink.
• Dramatically improves simulation speed, with
a dedicated accelerator for Simulink designs.
• Enables building a design visually and
downloading it directly, with minimal effort,
into the PROC board.
• Enables concurrent engineering at an early
stage.
• Cuts development cycle time (and costs).
• Improve design reliability.
 Implementing a video analysis designs on GIDEL
PROCSTAR III platform that will enable usage
and exploration of a new development platform
(PART I – PROC_HILs).
Proper usage of development tools
throughout all stages of implementation
from algorithm to hardware.
• PROC_HILs enables the
user to download a
Simulink design into
PROC board and run it.
• The design runs on the
on-board FPGAs,
communicating with
Simulink in real time.
• Generation process is
fully automatic.
Simulink design
An HDL code is
A new Simulink
generated,
The design runs
design
file
is
synthesized and
on the hardware
generated.
Single HIL
compiled to get an
synchronized
block includingfully
all the
.rbf file (FPGA
with that
Simulink,
inputs and outputs
binary file) were present in
the
receiving
the
original
compatible with
thedesign,signals from the
connected to all the
specific PROC
simulation sources
sources
and
sinks
board
and outputting the
results into the
sinks.
• Main development stages were made
on a GiDEL PROCe III
(Altera Stratix III) board (1-FPGA)
•
•
•
•
•
GiDEL PROC_HILs (Version 2.1.2)
ALTERA’s DSPBuilder blockset for Simulink (Version 10.1)
ProcWizard (Version 8.8)
Quartus II (version 10.1)
Matlab (Version 2009a)
• Additional development was made
on a GiDEL PROCStar III
(Altera Stratix III) board (4-FPGA)
• NLD is a hardware implementation of Non Linear
Diffusion algorithm for video images.
• Enable local smoothing of the picture while preserving
edges.
• The Simulink design in this project is based on a
previous project (Performed in the Technion HS-DS Lab
by Tsion Bublil & Yony Dekell).
• The original Project was implemented on a PROCStar
II (Altera Strartix II) board (4-FPGAs), using
SynplifyDSP blockset library for Simulink.
• All I/Os Must be placed on the top level of the design.
• Simulink sources must be configured to the same clock
that toggles the input port they feed.
• All signals from the workspace blocks feeding inputs
blocks and all frame output blocks must use the same
frame size (as seen in the previous slide).
• The design must obey the following table rules:
* PROC_HILs User Guide V2.1.2 p. 49
a
-
R
r
1
b
z-1
(1)
1
Ry
Pipelined Adder
Delay
a
1
R
-256
z
(1)
r
b
1
Rx
Pipelined Adder
Delay
2
R256
1
a0[13]:[13]
beta
1
b0[13]:[13]
Constant
y = a0 X b0 + a1 X b1
y[27]:[26]
1
g11
a1[13]:[13]
1
Ry
2
a[13]:[13]
Xr [13]:[13]
b[13]:[13]
Rx
2
Ry
1
b1[13]:[13]
g12
Multiply Add
Multiplier2
9.5367e-007
beta1
a[24]:[24]
Xr [24]:[18]
b[24]:[24]
1048576
Multiplier2
beta
3
g22
a[16]:[16]
1048576
Xr [16]:[16]
2
Multiplier1
-
r
b
Pipelined Adder
a[16]:[16]
1
q[48]:[0]
a=bXq+r
b[24]:[24]
a[24]:[24]
a
g11
g12
a[24]:[24]
beta2
b[16]:[16]
r[24]:[24]
a[16]:[24]
Xr [28]:[16]
d(43:0)
q(21:0)
b[24]:[24]
Xr [16]:[24]
0.00097656
Multiplier4
Square Root
Xr [16]:[16]
b[16]:[16]
Multiplier3
1
g11
-206
z
(1)
1
Out1
Delay12
2
g12
-206
z
(1)
2
Out2
Delay1
3
g22
-206
z
(1)
3
Out3
Delay3
4
Ry
-206
z
(1)
4
Out4
Delay2
5
R256
-206
z
(1)
5
Out5
Delay5
6
Rx
z-206
(1)
Delay4
6
Out6
beta3
b[16]:[24]
Multiplier5
Divider
1
gm_out
• Determining clock rate
– Video processing algorithm will have to process 15 Iterations of
256 by 256 pixels for a frame, achieving a reasonable rate of 15
frames per second.
clock rate  2562 15 15  14,745,600  15[MHz ]
• Long logical path prevents meeting clock rate demands,
and fails compilation.
– Altera DSPBuilder Advanced blockset supports automatic
pipelining (was not implemented in this project).
– Altera DSPBuilder blockset supports user pipelining using internal
pipeline definition of the block (determined by user), or inserting
Delays throughout the logical path. This method requires careful
attention of the designer, that must assure full synchronization
of the logical paths, guarantied by design.
9.5367e-007
beta1
a[24]:[24]
Xr [24]:[18]
b[24]:[24]
1048576
Multiplier2
beta
3
g22
a[16]:[16]
1048576
Xr [16]:[16]
2
-
r
b
Pipelined Adder
a[16]:[16]
g12
Xr [16]:[16]
b[16]:[16]
Multiplier3
b[24]:[24]
a[24]:[24]
a
Multiplier1
q[48]:[0]
a=bXq+r
b[16]:[16]
g11
1
a[24]:[24]
beta2
Xr [28]:[16]
b[24]:[24]
Multiplier4
r[24]:[24]
a[16]:[24]
d(43:0) q(21:0)
Xr [16]:[24]
0.00097656
Square Root
beta3
b[16]:[24]
Multiplier5
Divider
1
gm_out
7
R
8
dt
a
+
-256
1
z
(1)
gm05
Delay
Xr [16]:[16]
b[16]:[16]
Multiplier8
4
g22
5
a[13]:[13]
Xr [15]:[15]
b[13]:[13]
a[16]:[16]
Rx
Multiplier2
Xr [16]:[16]
a
-
r
Rp Rpx
b[16]:[16]
b
dpx
Multiplier5
Pipelined Adder
3
g12
6
a[13]:[13]
Xr [15]:[15]
b[13]:[13]
a[16]:[16]
Ry
Xr [16]:[16]
a
Multiplier1
+
r
b[16]:[16]
b
Multiplier7
Pipelined Adder2
a[13]:[13]
Xr [15]:[15]
b[13]:[13]
a[16]:[16]
Multiplier4
a
-
r
Xr [16]:[16]
b[16]:[16]
b
Multiplier6
Pipelined Adder1
2
g11
a[13]:[13]
Xr [15]:[15]
b[13]:[13]
Multiplier3
-255
r
X Out1
X Out1
b
a[16]:[16]
z
(1)
Rp Rpy
Delay1
dpy
Pipelined Adder3
1
belt_r
min
max
• Validating performance of the completed design, using
Simulink environment.
• A full automatic compilation and
synthesis starts by activating the
GiDEL HIL generation tool block.
GiDEL HIL Generation Tool
• Preliminary compability test starts by pressing the
prompt GUI button.
– Checks meeting design rules.
– Does not check Hardware fitting
and feasibility.
• “GO” button issues a full compilation
and synthesis of the design.
• The generation flow can be adjusted
by selecting the “Advanced Mode”.
Controls the enabling/disabling of
different flow stages.
• Generation ends with a new Simulink design file.
vecR
Signal From
Workspace
beta
Signal From
Workspace4
Convert
6.666666666666667E-8 sec
cvrt_inp
Clock
Convert
Convert
cvrt_inp4r
cvrt_outp
prob_belt
To Workspace1
hw_loop_6b_HIL_HW_block
dt
Signal From
Workspace3
Convert
cvrt_inp3r
<your_design_name>_HIL
• PROC_HILs does not fully elaborate the feasibility
and hardware consumption of the design.
– Quartus file are generated only while the generation process is
active and then automatically deleted.
– Solution: During generation extract Quartus top design and
independently compile it with Quartus.
• NLD Hardware consumption:
• Original image:
• Smoothed image
(3 Iterations):
• Calculated warm-up time:
• Simulation overhead: 9.9712 [sec]
• Hardware overhead: 9.60422 [sec]
Run time- Simulation & Hardware
7000
5792.6473
6000
Run time [sec]
5000
4000
2902.047122
3000
Simulink simulation
HardWare simulation
2000
1000
15.38741
37.052384
9.876949
9.860184
296.137446
64.449291
11.885162
10.134778
583.806265
14.166178
32.085269
54.825545
0
15,000
75,000
150,000
750,000
Vector Length
1,500,000
7,500,000
15,000,000
• Reduced overhead, time ratio: 128.645454
Eli’s comment: All
Simulations were made
on:
Run time ratio- simulation & Hardware
120
100
80
Ratio:
60
Run time Simulation
Run time Hardware 40
20
0
0
5,000,000
10,000,000
Vector Length
15,000,000
• Implementing NLD as part of Video capture/view real-time
streaming.
• Web cam envelopment:
– Resizing image (256x256)
– Performing “log” on resized image
– Spreading image to vector form
– Reshaping to matrix form
– Performing “power” on processed image
Frame rate 15[ frames / sec]
•
NLD algorithm Hardware block is inserted into the webcam envelop.
Insufficient Frame rate  0.077[ frames / sec]
• Hardware is dramatically decreasing frame rate though it is
designed with the capabilities of the desired frame rate.
– Operating frequency is 15MHz.
• Conclusion: interface Simulink/Hardware overhead is to high to
allow proper streaming in real-time applications.
• A possible way to gain advantage of PROC_HILs is using
a hardware loop.
mod1048576
q(23:0)
a
Counter
IF a<b true
65536
Constant
sel(0:0)
b
If Statement
0- MUX
Pipeline levels: 256
In1
q
vecR
Signal From
Workspace
i[32]:[32]
1-
In2
d
Out1
o[16]:[16]
Input
GiDEL Frame Input
GiDEL Frame Output
Multiplexer
full
a
IF a>b true
rreq
FIFO
6.666666666666667E-8 sec
b
If Statement1
Signal From
Workspace4
To Workspace1
In3
Belt1d
beta
prob_belt
Output
empty
Clock
i[16]:[16]
Input4
GiDEL Frame Input
1
wreq
usdw(15:0)
Constant1
dt
Signal From
Workspace3
i[16]:[16]
Input3
GiDEL Frame Input
FIFO
FIFO Size: 256X256-(256)
GiDEL HIL Generation Tool
• Multiple tries of the full HL designs showed problems of
convergence to the hardware limits of the PROCe-III
Board.
• The same design was implemented on a PROCStar III
board, with no problems reported in the generation flow.
• Problem encountered: While Simulink simulation
showed reasonable results, hardware simulation showed
different results (efforts to find origin and fix were
stopped due to the project’s time constraints).
• Strict software compatibility demands
– There is only one combination of involved
software version that matches (matlab, PROC
HIL, Altera DSPBuilder, Quartus, PROC wizard)
• Moderate algorithms do not fit the common boards
using Proc HIL and Altera DSPBuilder blockset.
• Altera DSP blockset variety is poor, and does not
contain common operations (log, exp, power,
nth root, not, min/ max…)
• For effective usage, one should use the Altera
advanced DSP Blockset, but it requires the simulink
fixed point license.
• Demands data flow as vectors and does not support
matrices.
• Inconsistency between simulation and Hardware
Performances.
• Inconvenient existing blocks
– Square Root: accepts and returns only whole
numbers.
– Divider: returns only in the form of: whole
number and res.
1. Allows to easily design and implement algorithms in
Simulink environment.
•
•
•
Direct Hardware Burn.
Direct generation HDL code that matches the target board.
Fast HW simulation using Simulink/Matlab interface.
2. Extremely efficient on resources consuming processing
algorithms.
3. Not suited for applying on streaming data designs (RealTime designs).
Motivation: Learning and practice of effective debug
methodology using PROC API.
GIDEL PROC_API – enable real-time configuration and
querying of the board.
Main goals/phases:
1) Learning PROC API, PROC MegaFIFO
2) Define and build an integrated DSPbuilder design
combining PROC API video streaming functions, data
channels and PROC MegaFIFO memories.
PROC MegaFIFO
RX - FIFO
PROC API
TX - FIFO
Task
Learning PROC API, PROC
MegaFIFO
Build a simple design
combining DSPbuilder
and the PROC Wizard
using PROC API
Define an integrated design
combining PROC API
video streaming
functions and data
channels, PROC
MegaFIFO memories and
DSPbuilder design
Verification and writing the
project’s book.
Week
1
Week
2
Week
3
Week
4
Week
5
Week
6
Week
7
Week
8
Week
9
Week
10
Download