Introduction to circuit design using Celoxica's Handel-C

advertisement
Introduction to circuit design
using Celoxica’s Handel-C
Presenter: Mr. David Sanders
Co-Sponsored by:
The Internet Innovation Centre and IEEE Computer
Society
Agenda



FPGA overview
Purpose of Handel-C
Comparison with ANSI C





Handel-C data types
parallelism
special Handel-C constructs and data types
Hardware implementation of Handel-C constructs
(Some of them)
Optimization and retiming features of Celoxica’s
design tool
2
FPGAs

Field Programmable Gate Array



A user programmable logic device with a
collection of Look-Up-Tables (LUTs), routing
resources, and Input/Output blocks (IOBs).
LUTs contain varying number of inputs depending
on vendor/technology used. Usually at least 4
inputs, but state of the art LUTs can have up to 8
(Altera’s 8-input fracturable LUT).
Modern FPGAs also included dedicated RAM
blocks, ALUs, multipliers, or hard or soft-core
processors (eg. ARM, NIOS II, MicroBlaze, PPC).
3
So what does Handel-C do for us?


Those who have programmed in VHDL/Verilog know
that you must think in terms of a state machine, and
write the code accordingly.
Handel-C is one level of abstraction higher than an
HDL.


Will create a netlist, but not an FPGA programming
files


Compiler deals with the state machine generation
automatically.
The FPGA vendor tools are still required.
It does however provide scripts for automating the
place and route, and bit stream creation for Xilinx
and Altera tools.
4
Still no free lunch!!



Just as in Professor McLeod’s talk last time,
there is no free lunch.
The state machine generated by the HandelC compiler uses One-Hot Encoding.
Not necessarily optimal for every design, but
still gives good results in practice.
5
Other Capabilities of Handel-C Design
Suite

Provides support for Altera, Xilinx, and Actel FPGAs


Compiler can provide output in different formats:




Handel-C compiler can take advantage of technology available in a
particular device (RAMs, ALUs, Multipliers, etc.)
Netlist (EDIF)
VHDL code from C code
Debug (can be used with SystemC or ANSI C front-end for
verification)
Provides a Platform Abstraction Layer (PAL)

Set of common utilities for hardware devices commonly found on
development boards


Video, Keyboard, Mouse, Ethernet, RS-232, LED output, General I/O, etc.
Provides support for integration with company specific
tools and/or intellectual property.

Quartus II, SOPC Builder, NIOS II processor, MicroBlaze processor
6
Handel-C Data Types




Handel-C supports all of the primitive integral types provided by
ANSI C, (signed and unsigned).
 char, int, short, long
Variables are implemented as registers.
Depth of an array must be specified at compile time.
Can also declare variables of arbitrary width from 1 to 128 bits.
 eg.
unsigned 8 myVariable;
signed 25 myVariable2[15];

No native floating point types or calculations in the current
version.
 Course instructor claims it will be included in the next release.
7
Operators

All the operators from ANSI C, plus a few
others:


Relational: !=, ==, <, >, <=, >= (GT and LT expensive to
evaluate with combinational logic). Operands must have same
width. Result is a 1 bit value.
Logical: &&, ||, ! Take 1 bit unsigned operands, however...
X || y  compiler will take this as: x!=0 || y!=0


Bitwise: ^, |, &, ~ Operands must have equal width.
Shift: <<, >> For a << b, b must have a width of
ceil(log2(width(a)+1))  Macros provided by the Platform
Developers Kit (PDK).
8
…the others

a=
Bit manipulation

Take: <- Drop: \\
1
0
0
1
1
b = a <- 3 0
1
1
1
c = a \\ 3
0
Very cheap in hardware since these operators are implemented as wires.

Range selection:


Expression[n:m] (bits n to m)
a[3:1] = 0
0
1
Concatenation:

expression 1 @ expression 2
d = a @ a[3:1] =
1
0
0
1
1
0
0
1
9
Parallelism


Since logic circuit operation is
highly parallel by nature, it is
necessary for a design tool to
support parallelism.
Accomplished in Handel-C by
using a par statement, as
opposed to a seq statement,
where the code is executed
sequentially.
10
static unsigned 8 a = 2;
static unsigned 8 b = 1;
par
{
a++;
b = a + 10;
}
Results: a = 3, b = 12
Each Handel-C assignment takes 1 clock cycle.
Both statements begin execution at the same time,
therefore both statements take only 1 clock cycle
combined. Operations are performed on the value
that the variable contained before the start of the
previous cycle.
static unsigned 8 a = 2;
static unsigned 8 b = 1;
seq
{
The seq block operates in the same manner as you
would expect from an ANSI C program.
a++;
b = a + 10;
}
Results: a = 3, b = 13
11
Signals



However, occasionally we need to use a value
immediately after assigning it in a par block.
This can be done by declaring a variable as a
signal.
The value of a signal lasts only for the duration of
the current clock cycle.
signal unsigned 8 a;
static unsigned 8 b;
signal unsigned 8 a;
static unsigned 8 b;
par
{
seq
{
a = 7;
b = a;
}
a = 7;
b = a;
}
Results: a = 0, b = 7
Results: a = 0, b = 0
12
Nesting seq and par

Can be nested as in the following example:
par
{
seq
{
/*some statements to be executed sequentially */
}
seq
{
/* these statements are executed sequentially,
but in parallel with previous seq block */
}
}

par will not return until all of the
statements/sub-blocks have completed.
13
Special Data Types

Input/Output



Obviously there must be a mechanism for
performing I/O with the FPGA.
Handel-C has data types for buses or interfaces.
(input, output, tri-state).
Also supports ports

I/O between modules/components in a design, not a
physical pin.
14
I/O Declaration Examples
Input interface prototype:
interface bus_in(type portName) Name() with {data = {Pin List}};
Input interface usage:
interface bus_in(unsigned 2 val) myInput() with {data = {“P1”,”P2”}};
unsigned 2 inData;
inData = myInput.val; //read the value {P1 P2}
15
Examples cont’d
Output interface prototype:
interface bus_out() Name(type portName=Expression) with
{data = {Pin List}};
Output interface usage:
static unsigned 8 counter = 0;
interface bus_out() CountOut(unsigned 8 outVal=counter+1);
while(1)
{
counter++;
}
16
RAM and ROM


No such thing as malloc() on an FPGA
Instead, Handel-C allows you to store variables in FPGA dedicated
RAM blocks
ram int 9 myRam[256];
/* a RAM block that holds 256, 9-bit integers */
static rom int 9 myRom[3] = {100,200,300}; /* must be static or global */



Different from arrays because declaring an array is the same as
declaring multiple variables
This means that an array’s indices can be accessed simultaneously
RAMs cannot because they only have 1 or 2 ports.
myRam[25]++; /*Read, Write, Modify = undefined results */
par /* 2 modifies during same cycle -> This also won’t work */
{
myRam[0] = 100;
myRam[2] = 498;
}
17
If/Else


Handel-C if/else syntax is almost the same as
in ANSI C.
The exception: The condition of the if() must
take 0 clock cycles to evaluate. This implies
that there can not be any variable assignment
in the condition expression.
if( (z = x + y) == 6) //legal in ANSI C, but not in Handel-C
18
Loops






while(), for(), do…while()
All have same syntax as in ANSI C
Same limitation applies to the conditions as
with if/else.
When programming a PC, it is good practice
to use a for loop when the context calls for it.
When writing C code for circuits, it’s almost
never good practice to use for() loops at all.
One clock cycle overhead per iteration.
19
While Loop Optimization

The limitations of a for() loop can be avoided
by incrementing a counter variable in parallel
with the body of a while() loop.
static unsigned 4 x = 15;
par
{
do{
//do something
} while(x != 0);
x--;
}
20
Macros, Channels, Prialt, and Semaphores


Scenario: Suppose you need to design a
circuit that calculates pixel values in a frame
buffer, and that each calculation takes 4 or 5
clock cycles. However you need to calculate
one pixel every clock cycle to meet a display
timing constraint.
Possible Solution: Duplicate the calculation
code 5 times, and have each block store
values in the proper place in the frame buffer.
21
Macros



Macros can be used to implement parameterizable
code, or to provide code re-use.
Like a regular function without parameter types.
For the solution to our scenario, declaring a macro
would look like:
macro proc myCalculation(dataSource)
{
//receive data from source
//Perform 3-5 clock cycles worth of calculations
}
22
Channels

Handel-C provides a channel type to allow for
synchronization or communication between parallel
processes.
Declaration: chan

<type> <channelName>
Data can then be sent over the channel, or received
from it, but only in one direction.
Each channel
operation
Must be declared
with global
scope. will block if the other party is not ready.
Two parallel blocks of code:
chan unsigned 8 dataPipe;
static unsigned 8 someData = 5;
…
dataPipe ! someData;
…
static unsigned 8 recvData;
…
dataPipe ? recvData;
…
23
Prialt




Now suppose we have 5 of our ‘worker’ processes
running in parallel. How do we use them to achieve
our goal?
Each operation will complete in 3-5 cycles, so we
don’t know which of the 5 will be free to perform the
next pixel calculation.
But if we send data down a channel sequentially to
each of the 5 processes, we might block on one of
them, when another is not doing anything…wasted
clock cycles.
Prialt is the solution for this.
24
Prialt


Similar to a case statement that chooses the first
channel able to receive data.
In other words, it gives a priority to each channel.
prialt
{
case channel1 ! data ;
break;
case channel2 ! data ;
break;
default:
If default is not used, then prialt will block on
break;
case statement if a prior one was not taken.
}

the last
Need to be careful that process aren’t starved.

Wasted resources
25
Semaphores




Once a process has finished its computation we
need to update the frame buffer (FB), which is
typically implemented in a RAM block for FPGA area
efficiency.
Recall that a RAM block typically only has one write
port, therefore we can’t have each process write to
the frame buffer because we can’t guarantee that
simultaneous access will not happen.
One solution is to have each process send the result
down a separate channel to another process that
deals with FB access.
But this is a section on semaphores, so we’ll go with
them instead.
26
Semaphores




Semaphores can be used to guard critical sections of code
against parallel access.
More like a mutex from POSIX threads.
trysema() and releasesema() methods used to check if
critical section is free.
eg.
sema fbGuard;
…
while(trysema(fbGuard)==0); delay; /*loop until semaphore is free */
/* critical section of code, ie. Frame buffer access */
releasesema(fbGuard); /*skipping this step could result in deadlock*/
…
27
Putting it all together…
#define NUM_CHANNELS 5
#define SCR_WIDTH 4
#define SCR_HEIGHT 4
set clock = external;
typedef struct point
{
unsigned 2 x;
unsigned 2 y;
} point;
//just as in ANSI C
sema fbGuard;
//you can even send structures over channels
chan point dataChannels[NUM_CHANNELS];
ram unsigned 8
frameBuffer[SCR_WIDTH*SCR_HEIGHT];
macro proc increment(p)
{
if(p.x==SCR_WIDTH-1)
{
par
{
p.x=0;
p.y++;
}
}
else
p.x++;
}
macro proc coordGen()
{
point pGen;
pGen.x = 0;
pGen.y = 0;
while(1)
{
prialt
{
case dataChannels[0] !
increment(pGen);
break;
case dataChannels[1] !
increment(pGen);
break;
case dataChannels[2] !
increment(pGen);
break;
case dataChannels[3] !
increment(pGen);
break;
case dataChannels[4] !
increment(pGen);
break;
default:
delay;
break;
pGen:
pGen:
pGen:
pGen:
pGen:
}
}
}
28
macro proc worker(channel)
{
point p;
static unsigned 8 pixel = 0;
//loop forever waiting for data to compute pixels
with
while(1)
{
channel ? p;
even
if(p.x <- 1 == 0
&& p.y <- 1 == 0 ) //x, y are
{
pixel = 2;
delay;
delay;
odd
}
else if(p.x <- 1 == 1 && p.y <- 1 == 1 ) //both
void main()
{
pixel = 1;
par
delay;
{
}
//create the coord generator and the worker
else //x is even/odd and y is odd/even processes
coordGen();
pixel = 3;
worker(dataChannels[0]);
worker(dataChannels[1]);
//critical section
worker(dataChannels[2]);
while(trysema(fbGuard) == 0)
worker(dataChannels[3]);
delay;
worker(dataChannels[4]);
{
frameBuffer[p.y@p.x] = pixel;
releasesema(fbGuard);
//will never return because at
//least 1 process has an infinite loop
}
}
}
}
29
Mapping Handel-C to Logic


Ultimately, the statements you write in Handel-C
must be mapped to logic by the compiler.
The following slides show the mapping for some of
the constructs discussed so far.






assignment
seq and par
if
while
do…while
The following logic circuits are taken from the course
notes from Celoxica’s DK training course.
30
Assignment

a = b;
31
Sequential Statements
seq
{
statement1;
statement2;
}
32
Parallel Statements
par
{
statement1;
statement2;
}
33
If Statements
if (Condition)
statement2;
34
While Loops
while (Condition)
{
statement2;
}
do
{
statement2;
} while(Condition);
35
Automatic Retiming
36
Why Retime?



Many designs will require the use of a multiplier,
divider, or other large combinational logic circuit.
The propagation delay through deep logic can be
quite long.
Having even one path in the design with a long
delay could cause the maximum clock rate to drop
significantly to the point where timing constraints
cannot be met.
Retiming involves moving/adding flip-flops around
the data path to reduce the depth of logic, and
ultimately reduce the critical path delay.
37
Simple
1
Example
x = a+b+c+d;
The result is calculated through
two adder stages. However we
can pipeline the result by
inserting registers at
intermediate locations.



The adder stages are split with
two registers. This reduces the
propagation delay of each
stage, allowing a higher clock
frequency.
The consequence is that the
result is delayed by one cycle.
1: Example adapted from Celoxica’s Handel-C and DK training course notes.
38
Programming for Retiming





Retiming is not a trivial task, it is extremely time
consuming to do by hand, especially for large
designs.
Handel-C design tools can perform retiming
automatically if the code is written properly.
The compiler will add/remove/move flip-flops as
necessary, but will not alter the timing of the design.
Therefore to use retiming, the design must be
pipelined, or have extra pipelining stages built-in.
The compiler can then shift logic and flip-flops
around without altering the timing of the design.
39
Programming Example
Example: x = a*b+c*d;

unsigned 8 x[3]; //3 retiming stages;
interface bus_out() sumOut(unsigned 8 out = x[2]) with {data ={"P2","P3","P4","P5","P6","P7","P8","P9"}};
interface bus_clock_in(unsigned 8 in) input() with {data ={"P10","P11","P12","P13","P14","P15","P16","P17"}};
void main()
{
unsigned 8 data[4];
Output is the last of
the retiming stages.
while(1)
{
par
{
//get the
data[0] =
data[1] =
data[2] =
data[3] =
input and shift the previous inputs
input.in;
data[0];
data[1];
data[2];
Coded like you would
without retiming.
x[0] = data[0]*data[1] + data[2]*data[3];
x[1] = x[0]; //extra stages
x[2] = x[1];
}
}
}
Result is shifted through the
retiming registers.
40
FIR Example

One of the exercises at the training course was to
code a nine tap FIR filter that was pipelined and
retimed automatically.

Nine multiplications of data and coefficients, followed by
summation of the nine products.


Very deep logic
Xilinx Spartan™ 3 chip was targeted. The fmax
results were recorded for various number of extra
retiming stages.
41
Fmax (MHz)
160
140
120
100
Frequency
(MHz)
80
Fmax (MHz)
60
40
Flip Flop Usage Before and After Retiming
20
0
1
2
3
4
5
6
# of Retiming Stages
1000
900
800
700
# of Flip
Flops
600
500
FF Before
FF After
400
300
200
100
0
1
2
3
4
5
6
# of Retiming Stages
42
Final Notes

Not enough time to cover everything HandelC has to offer.


There are ways to create parameterizable
code.


pointers, macro expressions
Allows the designer to easily vary the # of worker
processes, or pipeline/retiming stages, for
example.
More information available at
www.celoxica.com
43
Thank-You!
Questions?
44
Download