High level synthesis tools

advertisement
Some Trends in High-level
Synthesis Research Tools
Tanguy Risset
Compsys, Lip, ENS-Lyon
http://www.ens-lyon.fr/COMPSYS
Outline
• Context: Why High level synthesis?
• HLS Hard problems
• Some solution in existing tools
• Some on-going projects
2
Context: Embedded Computing
Systems design
• SoC or MPSoC for multimedia application will soon
includes:






Network on chip
dozens of initiators (CPU, DMA,…)
Mbytes of code
Operating systems
Shared memory coherency protocols
…
• SoC Design problems:
 Time to market
 Design space exploration
 Software complexity
3
Some envisaged solutions
• Time to market
 IP re-use
 High level design
• Design space exploration
 Fast prototyping and performance evaluation, refinement
methodology (specification, algorithm, TLM, CABA)
• Software complexity
 Tools for embedded code generation/embedded OS
• High level synthesis is only a small part of the « High level
Design » process
4
Definition of High Level Synthesis
• HLS: Generates register-transfer level description from
behavioral specification, in an automatic or semi-automatic
way.
• Input:
 A behavioral specification
 Design constraints
 Library of available RTL components
• Output:
 RTL description
 Performance evaluations
5
Refinement : from algorithm to hardware
algorithm
domain
Transaction
Level
Modeling
algorithmic exploration
abstract architecture
• Matlab
•C
System
application design
• SoC Intermediate Representation
SoC platform design
virtual prototype
Architecture
Description
Language
block specification
IP block design
block implementation
• RTL Synthesis,
Synthesis VHDL, Verilog
6
Abstraction levels for HLS
• AL
• TLM
• T-TLM
• CABA
• RT
= Algorithm
prior to HW/SW partition
= Transaction-Level Model
after HW/SW partition
models bit-true behavior, register bank,
data transfers, system synchronisation
no timing needed
= Timed TLM (also PVT)
TLM + timing annotation
refined communication model
= Cycle Accurate-Bit Accurate
models state at each clock edge
= Register Transfer (ASIC flow entry point)
synthesisable model
7
Pro’s and Cons
• « Traditional » motivations:
 Fast design
 Safe design : formal refinement approach
 « Must be used » to cope with Moore’s law
• But!
 Commercial tools are not here
 A new tool is a big investment
 Designers have managed without it
8
New motivations ?
• IP-reuse
 Slightly change design parameter for re-using IP
• New target technologies and languages (FPGA, SystemC,
etc.)
 Tools can easily re-target the designs
• CAD tools companies are investing a lot in « high levellike » synthesis tools
 Monet, Behavioural compiler, VCC, …
• Technological advantage
 Traditional RTL design will be de-localized to Asia
9
Outline
• Context: Why High level synthesis?
• HLS Hard problems
• Some solution in existing tools
• Some on-going projects
10
HLS Hard Problems
• Huge design space
 Complex design space exploration
 Multi-criteria optimization techniques
• Integration into a design environment
 Lack of standard interchange format
 SoC simulation time is a crucial issue
• Acceptance by the designers
 Find a language common to SoC designers and tools designer
• Refinement technical problems
 (detailed hereafter)
11
HLS technical problems
• Compilation occurs when the target architecture is
precisely known
• In HLS, target architecture is only partially specified,
Examples:
 Data-flow architecture/systolic arrays : pure RTL description
 FSM+data path : closer to processor description
• HLS technical problems :




Initial specification format / language
Specification refinement : fixed point arithmetic
Scheduling/Mapping refinement: resource constraints
Technological Mapping refinement
12
Initial specification format
• Restriction on the input language expressivity are necessary
• … but designers hate new languages
• C-like language (handel-C, silicon-C,hardware-C, etc…) are actually
hardware description languages
• Main problems:
 How to express parallelism/sequentially
- Data-flow, CSP-like, process network, event-driven
 How to express both algorithmic and RTL description
 How much expressivity
- Dynamic control, loops
 How to introduce constraints/hints
13
Fixed point arithmetic
• Problem: translate a floating point computation to fixed
point computation
• Most of the tools start with an initial fixed point
specification found by extensive simulation.
• Automatic techniques are not handling loops
• In the case of signal processing application the signal
processing theory can help (transfer function used to
compute signal-to-noise ratio).
14
Scheduling/Mapping
• For a « basic bloc », resource constraints scheduling is NP-Hard, but
widely studied.
• Computations
 Currently, two way to handle loops:
- Unroll them
- Keep them sequential
 Other solutions:
- Use software pipelining theory
- Use the polyhedral model
• Memory and communication
 Memory mapping is usually strongly guided by the user
- Highly active research field (Catthoor, Darte)
 Communication refinement is also an important issue
- Highly dependent on the chosen computation model (Gajski, Kenhuis)
15
Technological mapping refinement
• Fine technological mapping are very target-dependent
• Predefined libraries are not precise enough
 Delays on wires
 Power consumption
• VLSI designers « tricks » are difficult to integrate in tools
• Sub-Micronics technologies constraints are changing too
fast for high level tools
 Cross talk
 Capacitance
16
Outline
• Context: Why High level synthesis?
• HLS Hard problems
• Some solution in existing tools
• Some on-going projects
17
Some solution in existing tools
• Digital signal processing circuits:
 Gaut: http://lester.univ-ubs.fr:8080
 Source: signal processing (one infinite loop)
 Target: RTL + FSM
• FSM+datapath
 Ugh: http://www-asim.lip6.fr/recherche/disydent/
 Source: restricted C
 Target: FSM+data path
• Regular computation and polyhedral Model
 MMAlpha: http://www.irisa.fr/cosi/ALPHA/
 Source : functional specification
 Systolic like architectures
18
GAUT:Génération Automatic
d’Unité de Traitement
• Developed first at LASTI (Lannion) and then LESTER (Lorient): free
• Generate RTL description from behavioral description for signal
processing algorithm
• Kernel technology: highly optimized ressource constraint scheduling
• Inputs are
- a behavioral VHDL description (one process repeated
infinitely)
- Libraries of operators pre-characterized
- Some design constraints
• Outputs are
- a synthesizable RTL VHDL description (data path, memory,
and communication units)
- Gantt chart for I/O specification
19
Gaut design flow
Behavioral description
VHDL
Compiling
-analyzing
-loop unrolling
.src
.lib
.gc
User constraints:
Latency, clock frequency
Operators, Alloc,etc.
.vhd
Operator library
RTL description
(data path+control)
graph
Synthesis
-selection
-Scheduling
Mapping
Memory and IO
specifications
.mem
20
Gaut : VHDL Input code
• Sequential instruction in one single process (no clock, no
reset, no sensitivity list)
ENTITY fir IS
PORT (xn:IN INTEGER; yn:OUT INTEGER);
END fir;
ARCHITECTURE behavioral OF fir IS
...
BEGIN
PROCESS
VARIABLE H,x:
vecteur;
VARIABLE tmp:
INTEGER;
VARIABLE i:
CONTROL;
BEGIN
tmp := xn * H(0);
FOR i IN 1 TO N-1 LOOP
tmp := tmp + x(i) * H(i);
END LOOP;
yn <= tmp;
FOR i IN N-1 DOWNTO 2 LOOP
x(i) := x(i-1);
END LOOP;
x(1) := xn;
WAIT FOR cadence;
END PROCESS;
END behavioral;
21
Gaut : Input code
• Types
 Bit, boolean, std_logic, Integer (single size), Bit_Vector,
Std_Logic_Vector
 Arrays (to be inlined)
• Sequential instructions





Signal and variables assignment
Only one level of if
For and While loops (to be inlined)
Procedure calls (to be inlined)
Function calls corresponding to library elements
22
Gaut step1: Source code
transformation
• Control dependence elimination
 Loop unrolling
y ( 0 ) := x ( 0 ) * h ( 0 ) ;
for i in 1 to n - 1 loop
y ( i ) := y ( i - 1 ) + x ( i ) * h ( i ) ;
end loop ;
y ( 0 ) := x ( 0 ) * h ( 0 ) ;
y ( 1 ) := y ( 1 - 1 ) + x ( 1 ) * h ( 1 );
y ( 2 ) := y ( 2 - 1 ) + x ( 2 ) * h ( 2 ) ;
y ( 3 ) := y ( 3 - 1 ) + x ( 3 ) * h ( 3 ) ;
 Procedure inlining
 Static single assignment
b := x + z ;
a := b + c ;
b := e + f ;
y := b;
b := x + z ;
a := b + c ;
b0001 := e + f ;
y := b0001;
23
Gaut step1: Source code
transformation
• Simple expression generation
b := x + z * u ;
tmp := z * u ;
b := x + tmp ;
• Constant propagation
• Generation of GC Graph (Data-Flow Graph Format of Synchronous
Programming)
24
GAUT step 2: Scheduling/Mapping
• In addition to throughput and clock cycle, the user can
give:




Ressource constraints and mapping constraints
Memory constraints
I/O constraints
Optimization type
• The result is an architecture and a GANTT charts
 For computations
 For I/O
 For memory
25
26
Gaut step 3: memory and
communication synthesis
• Optimizing memory layout and minimizing buses
I/O
Control
Datapath
Communication unit
Memory unit
27
Gaut: summary
• Advantages




Advanced development status (still research tool)
User guided synthesis
Open library
Active research team: memory optimization, communication
synthesis
• Drawbacks
 Loop flattening (complexity problem)
 Predefined timing characteristics
 Hard to get out of 1D signal processing
28
Ugh: User Guided High Level
Synthesis
• Developed at LIP6 (Paris), as part of the Disydent project (Digital
System Design Environment): open source
• Behavioral level synthesis tool for control dominated coprocessor
• Emphasis on precise timing estimation
• Kernel technology: ressource constraint scheduling and (GNU-like)
compiler construction technology
• Inputs are
- a C or VHDL behavioral description with KPN
communication primitives
- a draft data-path
- a cycle time constraint TC
• Outputs are
- a synthesizable RTL VHDL model
- a cycle accurate simulation model
29
Coprocessor System Environment
R3000
Processor
ICacheDCache
Bus
Controller
unit
PI-BUS
M/S Interface
RAM
Coprocessor
30
UGH Structure
Cell
Library
Ugh C
Draft
Data-Path
Depends on the
Synthesis tool
(Synopsys)
Synthesis +
Characterization
Timing
Annotations
VHDL
Data-Path
UGH-CGS
Coarse grain scheduler
UGH-FGS
CK
Fine grain
scheduler
VHDL
FSM/C
VHDL
Data-Path + FSM
Caba simulation
Model
31
Input 1 : UGH-C
C Description
#include <ughc.h>
ugh_inChannel32 work2hcfa;
ugh_inChannel32 work2hcfb;
ugh_outChannel32 hcf2work;
uint32 a,b;
void hcf(void)
{
while (a != b)
if (a < b)
b = b - a;
else
a = a - b;
}
int ugh_main()
{
while (1) {
channelRead(work2hcfa,&a);
channelRead(work2hcfb,&b);
hcf();
channelWrite(hcf2work,&a);
}
}

•Library IEEE;
•Use
ieee.std_logic_arith.a
ll;
•entity HCF is
•port (CK
: in bit;
•
DINA : in integer;
•
READA : out bit;
•
ROKA : in bit;
•
DINB : in integer;
•
READA : out bit;
•
ROKA : in bit;
•
DOUT : out integer;
•
WRITE : out bit;
•
WOK
: int bit);
•end HCF;
32
Input 2 : Draft Data-path
model Hcf(sofifo hcf2work;
sififo work2hcfa,
work2hcfa)
{
DFFl a, b;
SUB subst;
a
D
Q
A
Subst
S
subst.A = a.Q, b.Q;
subst.B = a.Q, b.Q;
a.D
b.D
= subst.S, work2hcfa;
= subst.S, work2hcfb;
b
D
Q
B
hcf2work= subst.S;
}
33
OUTPUT 1 : Refined Data path
sel_m1
we_ra sel_m4 inf
RegA
dina
i0
z
d
q
i1
i0
z
i1
M4
M1
zero
co
a
z
Subst
s
M2
dinb
z
dout
M3
i0
d
q
i1
i0
z
i1
b op
RegB
ck
sel_m2 we_rb
sel_m3
op_subst
34
OUTPUT 2 : FSM for control
RESET
RESET
RESET
START
READY
WHILE
START
ROKA
ROKA
ROKB
IF
READA
READB
ROKB
S1
S2
WRITE
WOK
WOK
35
Ugh summary
• Advantages




Precise timing information
Multi cycle operation
Almost a compiler approach (restricted target architecture)
Interfacing (Integrated in a SoC design environment)
• Drawbacks




Development status (research tool)
Low level information given by the user
Highly dependent on commercial tool (synopsys)
Dedicated to control oriented applications
36
MMAlpha
• Developed in Irisa (Rennes): open source
• High level synthesis of highly pipelined accelerators
• Kernel technology: polyhedral model and systolic design
methodology
• Emphasis on loop transformations
• Input :
 functional specification (Alpha langage)
• Output :
 RTL description of systolic-like architecture (Alpha or VHDL)
37
MMAlpha design flow
FPGA
VHDL
Uniformization
Alpha
For i=1:1:N
For j=1:1:N
Scheduling
bus
VHDL
RTL derivation
host
C
C
C
C
38
What is polyhedral model?
• Abstract a loop nest by the polyhedron described by the
loop indices during execution of the loop
• Can be used for any index-based structure : memory
(arrays), communications (accesses), etc…
• example: convolution (FIR filter)
N 1
y(i)H(n)x(in)
n0
for (i=N; i<=M; i++) {
y(i)=0;
for (n =0; n<=N-1; j++)) {
y(i)=y(i)+H(n)x(i-n)
}}
39
FIR: iteration space
y(N+1)
n
y(N)
H(N-1)
H(0)
i
0 0
x(N) x(N+1)
40
FIR polyhedral representation
(MMAlpha input language)
i, n N  i  M ; 0  n  N 1 
Y[i,n]  Y[i,n 1]  H[n]*x[i-n]
y(N+1)
n
y(N)
H(N-1)
H(0)
i
0 0
x(N) x(N+1)
41
MMAlpha polyhedral scheduling
i,n NiM; 0nN1 
Y[i,n]Y[i,n1]H[n]*x[i-n]
y(N+1)
n
y(N)
H(N-1)
H(0)
i
0 0
t=4 5
6
x(N) x(N+1)
42
MMAlpha space time transformation
t, p p  t  p  N  M ; 0  p  N 1  Y[t,p]  Y[t 1,p 1]  H[p]*x[t-2 p]
y(N)
p
H(N-1)
H(0)
t
0 0
6
t=4 5
x(N) x(N+1)
43
MMAlpha mapping
t, p p  t  p  N  M ; 0  p  N 1  Y[t,p]  Y[t 1,p 1]  H[p]*x[t-2 p]
y(N)
p
y
H(N-1)
H
H(0)
ti
0 0
x(N) x(N+1)
0
6
t=4 5
x
44
MMAlpha resulting architecture
x(n-2N+D+1)
x( n+D-1)
D-1
x( n)
d( n)
y( n)
w0
-
w1
w2
wN -1
x(n-2N+2)
wN-1
 e(n-N+1)
+
e( n)
y( n)

p=0
p=1
p=N-1
45
MMAlpha current features
• Tool box for designers:
 Powerful analyze tools
 Pipelining, Change of basis, multi-dimensionnal scheduling,
control signal generation.
 Code generation (C, VHDL)
 Hierarchical design methodology
• Work in progress:
 Ressource constraint scheduling (extention to Z-polyhedra)
 Multi-dimensionnal scheduling and memory synthesys
46
MMAlpha summary
• Advantages
 Design tool integrating loop transformation
 Parameterised design (N: size of the filter not fixed until VHDL
generation)
 Formal approach for refinement (functional to operational)
 A real language that syntactically captures HLS input restriction
• Drawbacks
 Does not yet handle resource constraints
 A language (Alpha) and design methodology very different from
designer’s habits
 Implementation status (research tool)
47
Some Design results
• Ugh compares IDCT with CoWare and Gaut but the results are highly
dependent upon design parameters
Ck period (ns)
#cycle
execution
Exec time
(µs)
Area (mm^2)
Area
(#inverter)
Manual (time
optimised)
10.41
118
1.228
N-A
242.1
CoWare
21
1 645
34.545
19.94
165.6
Gaut
17.5
526
9.2
19
123.5
Ugh
17
1 466
25.922
10.9
70.9
• MMAlpha demonstrates real implementation on FPGA co-processor
board (DLMS algorithm)
8 tap DLMS filter
Area
Clk cycle
Synthesis time
MMAlpha
2600 slices
35MHz
112 s
48
Outline
• Context: Why High level synthesis?
• HLS Hard problems
• Some solution in existing tools
• Conclusion and on-going projects
49
HLS conclusion
• HLS tools are not mature enough to produce the famous
« C-to-VHDL » magic tool
• Most tool designer agree that a highly « user guided »
approach is mandatory
• CAD tools are still actively developping tools (Mentor:
Catapult-C, CoWare: Cocentric….)
• Some progress have been made
 Domain specific constraints are more clearly identified (control
oriented or data flow)
 Interfacing is studied together with the synthesis
 Fast simulation is an important issue addressed by HLS tools
50
On-going project: Data-Flow IP
interface
VCI
IN
CTRL
I_FIFO
1
I_FIFO
2
Generic
Network
input
patterns
VCI
output
patterns
O_FIFO
OUT
CTRL
O_FIFO
1
2
Dataflow Hardware Accelerator
• Gaut (Lester) and MMAlpha (Irisa, Lip) are developing a
common interface for their IPs (data-flow Ips)
51
On-going project: SocLib
• SocLib environment
 Public domain systemC simulation models for SoC IP:
- Cycle-accurate hardware simulation
- TLM Simulation
 VCI interconnection standard
 French open academic initiative (should become European through
EuroSoc):http://soclib.lip6.fr/
• Typical platform:
prog.c
MIPS
MIPS
MIPS
MIPS
Cache
VCI
Cache
VCI
Cache
VCI
Cache
VCI
Bus / Network on chip (SPIN)
MIPS exec
GCC-MIPS
prog
RAM
prog
boot
TTY
ASIC
DMA
52
On-going project: Loop
transformation for compilation
• Unified loop nest transformation framework for optimization of
compute/data intensive programs (Alchemy Inria project: http://wwwrocq.inria.fr/~acohen/software.html).
• WRaP-IT: and Open-64/ORC Interface tool
53
Thanks
•
Slides with Help from Lester, LIP6
•
Here are some tools I did not talk about: Amical, Cathedral, High2, RapidPath, Flash, A/RT, Compaan,
Syndex, Phideo, Bach, SPARK, CriticalBlue, Chinook, SCE, CodeSign, Esterel, precisionC, Polis, Atomium, Ptolemy, HandelC, Cyber, Bridge, MCSE, Madeo, SpecC, and many more….
Any Questions ?
54
Download